Chapter 24: Data Quality Assessment and Profiling
Chapter Objectives
Upon completing this chapter, you will be able to:
- Understand the fundamental dimensions of data quality and their impact on machine learning model performance and business outcomes.
- Implement systematic data profiling techniques using modern Python libraries to generate comprehensive statistical summaries and visualizations of datasets.
- Analyze the results of data profiling reports to identify and categorize common data quality issues, including missing values, outliers, inconsistencies, and invalid entries.
- Design and apply automated data validation rule sets and expectation suites to enforce data quality constraints within a data pipeline.
- Optimize the pre-modeling workflow by integrating data quality assessment as a foundational step in the MLOps lifecycle.
- Evaluate the trade-offs between different data quality tools and strategies based on project requirements, data scale, and operational constraints.
Introduction
In the intricate architecture of modern AI systems, data serves as the foundational bedrock upon which all subsequent components are built. The performance, reliability, and fairness of any machine learning model are inextricably linked to the quality of the data used for its training and validation. The industry maxim “garbage in, garbage out” has never been more relevant; even the most sophisticated algorithms and powerful computational resources cannot compensate for flawed, incomplete, or inconsistent data. This chapter confronts this critical challenge head-on, moving beyond the simplistic notion of data as a mere commodity to treating it as a meticulously engineered asset. We will explore the principles and practices of Data Quality Assessment and Profiling, the systematic process of examining, measuring, and understanding data to ensure its fitness for purpose.
This chapter serves as a critical bridge between raw data acquisition and the more glamorous phases of feature engineering and model training. We will establish a rigorous framework for defining and measuring data quality across several key dimensions: accuracy, completeness, consistency, timeliness, uniqueness, and validity. You will learn how to transition from manual, ad-hoc data inspection to automated, scalable, and repeatable profiling workflows. This is a cornerstone of modern MLOps, where data quality checks are not an afterthought but are embedded into continuous integration and continuous delivery (CI/CD) pipelines. By mastering the tools and techniques presented here, you will learn to diagnose data issues at their source, preventing them from propagating downstream where they can silently corrupt model behavior, introduce bias, and lead to costly business errors. This chapter will equip you with the essential skills to build robust, reliable, and trustworthy AI systems, starting with the most important ingredient: high-quality data.
Technical Background
The Foundational Dimensions of Data Quality
The concept of data quality is not monolithic; it is a multifaceted construct defined by several distinct, though often interrelated, dimensions. Understanding these dimensions provides a structured vocabulary for diagnosing issues and defining requirements. In an engineering context, these are not abstract ideals but measurable attributes that directly impact system performance. The successful implementation of any data-driven system begins with a clear definition of what constitutes “good” data for a specific use case.
Core Terminology and The Six Dimensions
The most widely accepted framework for data quality consists of six core dimensions. Completeness refers to the degree to which all required data is present. In a tabular dataset, this is often measured by the absence of null or missing values. For instance, a customer dataset for a marketing campaign is incomplete if a significant portion of entries lack an email address. The impact is direct: a model cannot learn from information that is not there, and missing data can introduce significant bias if the missingness is not random.
graph TD subgraph Trustworthy AI Systems direction TB A[<b>Trustworthy AI</b><br>Reliable, Fair, and Robust Models] end subgraph Foundational Pillars direction LR B(Completeness) C(Accuracy) D(Consistency) E(Validity) F(Uniqueness) G(Timeliness) end B --> A C --> A D --> A E --> A F --> A G --> A %% Styling classDef primary fill:#283044,stroke:#283044,stroke-width:2px,color:#ebf5ee; classDef process fill:#78a1bb,stroke:#78a1bb,stroke-width:1px,color:#283044; class A primary; class B,C,D,E,F,G process;
Accuracy is the dimension concerned with the conformity of a data value to a source of truth. It answers the question: “Is the information correct?” An address in a shipping database is inaccurate if it does not correspond to a real-world location. Accuracy is one of the most challenging dimensions to assess, as it often requires external validation or comparison with a “golden record.” An inaccurate dataset can lead a model to learn false patterns, resulting in flawed predictions and poor business decisions.
Consistency addresses the absence of contradictions within a dataset or across multiple datasets. A system lacks consistency if a patient’s date of birth is recorded differently in the admissions system versus the pharmacy system. Internally, a dataset is inconsistent if a customer’s age is listed as 30, but their date of birth implies they are 45. These logical discrepancies can confuse algorithms and erode trust in the data.
Validity ensures that data conforms to a predefined format, type, and range. A valid email address must contain an “@” symbol. A valid age for an employee should fall within a reasonable range, such as 18 to 75. Validity is often enforced through schema constraints and business rules. While a value can be valid (e.g., formatted correctly) but inaccurate (e.g., the wrong person’s email), ensuring validity is a crucial first-line defense against data entry errors and corruption.
Uniqueness guarantees that there are no duplicate records when a record is intended to represent a single real-world entity. In a user database, each user ID should be unique. Duplicate records can skew statistical analyses, inflate counts, and cause models to overweight the importance of the duplicated entities. Detecting uniqueness often involves identifying a primary key or a combination of attributes that uniquely identifies a record.
Finally, Timeliness refers to the degree to which data is up-to-date and available when needed. Financial transaction data must be timely to be useful for fraud detection. A model trained on stale data may fail to capture recent trends, leading to degraded performance. Timeliness is particularly critical in real-time or near-real-time applications where the “age” of the data is a primary feature.
The Six Dimensions of Data Quality
Dimension | Description | Example of a Problem |
---|---|---|
Completeness | The degree to which all required data is known and present. | A customer record is missing the email or phone_number, preventing contact. |
Accuracy | The degree to which data correctly represents the “real-world” object or event it describes. | A customer’s shipping address is listed as 123 Main St when they actually live at 321 Main St. |
Consistency | The absence of contradictions within a dataset or across multiple datasets. | A patient’s birthdate is 1985-06-10 in the billing system but 1985-10-06 in the clinical records system. |
Validity | The degree to which data conforms to the format, type, and range of its definition. | An age column contains a value of -5 or a string like “Twenty” instead of an integer. |
Uniqueness | Ensures that no entity or record exists more than once within the dataset. | The same customer CUST-001 appears as two separate rows in a customer table. |
Timeliness | The degree to which data is up-to-date and available within a useful timeframe. | A stock-trading algorithm receives price data that is 15 minutes delayed, making it useless for high-frequency trading. |
The Practice of Data Profiling
Data profiling is the process of systematically examining the data available in an existing source and collecting statistics and information about that data. The goal is to gain a deep understanding of the data’s structure, content, and interrelationships. It is an investigative process that forms the foundation of any data quality initiative. Profiling moves beyond assumptions and provides empirical evidence about the state of the data, making it the first practical step in any data-driven project.
Statistical Profiling Techniques
At its core, data profiling is a form of exploratory data analysis (EDA) focused on data quality. The process typically begins with generating summary statistics for each attribute or column in a dataset. For numerical data, this includes measures of central tendency and dispersion. The mean (\(\mu\)), median, and mode provide insights into the typical values, while the standard deviation (\(\sigma\)), variance (\(\sigma^2\)), and interquartile range (IQR) describe the spread and variability. These simple statistics can immediately highlight potential issues. For example, a minimum age of -5 or a maximum salary of \$500 million in an employee dataset are clear indicators of data entry errors.
Frequency distributions and histograms are essential for understanding the underlying patterns in the data. They reveal the prevalence of each value in categorical columns and the shape of the distribution for numerical columns (e.g., normal, skewed, bimodal). A frequency plot can quickly identify misspelled or inconsistently formatted categories (e.g., “USA”, “U.S.A.”, “United States”).
A critical aspect of statistical profiling is outlier detection. Outliers are data points that deviate significantly from the rest of the distribution. They can be legitimate but rare events, or they can be errors. A common statistical method for identifying outliers is the Z-score, which measures how many standard deviations a data point is from the mean:Z=fracx−musigmape
A common rule of thumb is to flag any data point with a Z-score greater than 3 or less than -3 as a potential outlier. Another robust method is using the IQR, where outliers are often defined as values that fall below \( Q_1 – 1.5 \times IQR \) or above \( Q_3 + 1.5 \times IQR \). Identifying these outliers is crucial because they can disproportionately influence the training of many machine learning models, such as linear regression.
Structural and Relationship Profiling
Beyond individual columns, data profiling also involves understanding the structure of the data and the relationships between columns. This includes validating data types, lengths, and formats. For instance, a column expected to contain dates should be profiled to ensure all values conform to the YYYY-MM-DD
format and do not contain text.
Relationship profiling involves discovering dependencies between columns. Correlation analysis is a key technique used to measure the strength and direction of a linear relationship between two numerical variables. The Pearson correlation coefficient, \(\rho\), ranges from -1 to 1, where 1 indicates a perfect positive linear relationship, -1 a perfect negative linear relationship, and 0 no linear relationship. A correlation matrix provides a concise overview of all pairwise correlations in a dataset, which is useful for identifying multicollinearity—a situation where predictor variables are highly correlated, which can destabilize some regression models.
For categorical variables, relationships can be explored using contingency tables and chi-squared tests. Profiling also extends to discovering more complex dependencies, such as functional dependencies (e.g., a ZIP code determining the city and state) and referential integrity constraints in relational databases (e.g., ensuring every order_id
in a payments table corresponds to a valid order in the orders table). Understanding these relationships is vital for data consistency and for making informed decisions during feature engineering.
flowchart TD A[Raw Data Source] --> B{Data Profiling Engine}; subgraph Profiling Stages B --> C[1- Column Profiling<br><i>Statistics, Frequencies, Data Types</i>]; C --> D[2- Structural Profiling<br><i>Formats, Patterns, Lengths</i>]; D --> E[3- Relationship Profiling<br><i>Correlation, Dependencies, Redundancy</i>]; end E --> F([Data Profile Report<br><i>Summaries, Visualizations, Quality Alerts</i>]); %% Styling classDef startNode fill:#9b59b6,stroke:#9b59b6,stroke-width:1px,color:#ebf5ee; classDef processNode fill:#78a1bb,stroke:#78a1bb,stroke-width:1px,color:#283044; classDef decisionNode fill:#f39c12,stroke:#f39c12,stroke-width:1px,color:#283044; classDef endNode fill:#2d7a3d,stroke:#2d7a3d,stroke-width:2px,color:#ebf5ee; class A startNode; class B decisionNode; class C,D,E processNode; class F endNode;
Automated Data Quality Assessment
While manual profiling is insightful for initial exploration, it is not scalable or repeatable. In a production environment, data is constantly changing, and quality must be monitored continuously. This necessitates the use of automated data quality assessment frameworks. These tools translate the dimensions of data quality into machine-readable and executable tests, often called expectations or assertions.
Rule-Based Validation and Expectation Suites
Modern data quality tools, such as Great Expectations or dbt (Data Build Tool), allow engineers to define a declarative “suite” of expectations about their data. These expectations are assertions about what the data should look like. They are, in essence, unit tests for data. An expectation suite for a user table might include assertions like:
expect_column_values_to_not_be_null('user_id')
expect_column_values_to_be_unique('user_id')
expect_column_values_to_match_regex('email', '^[^@]+@[^@]+\.[^@]+$')
expect_column_values_to_be_between('age', 13, 100)
expect_column_to_exist('registration_date')
When a new batch of data arrives, this suite of expectations is run against it. The tool then generates a validation report detailing which expectations passed and which failed, along with the observed values that caused the failure. This provides immediate, actionable feedback.
Note: The power of this approach lies in its declarative nature. The engineer specifies what the data should look like, not how to check it. The framework handles the execution and reporting, making the process highly efficient.
Integrating Quality Gates into Data Pipelines
The true value of automated validation is realized when it is integrated into a data pipeline, creating “quality gates.” A quality gate is a step in a pipeline that halts execution or raises an alert if the data does not meet a predefined quality threshold. For example, a daily ETL (Extract, Transform, Load) job might be configured to fail if more than 5% of the values in a critical column are null.
graph TD A["Data Ingestion<br><i>(Batch or Stream)</i>"] --> B{"Data Quality Gate<br><i>(Great Expectations)</i>"}; B -- Validation PASS --> C[Feature Engineering]; C --> D[Model Training]; D --> E[Model Deployment]; B -- Validation FAIL --> F[Alerting & Remediation]; F --> G{Quarantine Bad Data}; F --> H{Notify Data Team}; %% Styling classDef primary fill:#283044,stroke:#283044,stroke-width:2px,color:#ebf5ee; classDef process fill:#78a1bb,stroke:#78a1bb,stroke-width:1px,color:#283044; classDef decision fill:#f39c12,stroke:#f39c12,stroke-width:1px,color:#283044; classDef success fill:#2d7a3d,stroke:#2d7a3d,stroke-width:2px,color:#ebf5ee; classDef warning fill:#f1c40f,stroke:#f1c40f,stroke-width:1px,color:#283044; classDef model fill:#e74c3c,stroke:#e74c3c,stroke-width:1px,color:#ebf5ee; class A primary; class B decision; class C,G,H process; class D,E model; class F warning;
This proactive approach prevents bad data from propagating downstream and corrupting models or analytics dashboards. It transforms data quality from a reactive, manual cleanup task into a proactive, automated governance process. In a mature MLOps environment, data quality metrics are tracked over time, just like software performance metrics. This allows teams to monitor for “data drift” or “data rot,” where the statistical properties of the data change over time, potentially invalidating the assumptions a model was trained on. This continuous monitoring is essential for maintaining the long-term health and performance of production AI systems.
Practical Examples and Implementation
Development Environment Setup
To effectively profile and validate data, we will use a standard Python data science stack. This environment is designed to be reproducible and uses industry-standard tools.
Core Components:
- Python: Version 3.11+ is recommended for its performance improvements and modern language features.
- Jupyter Notebook or VS Code: An interactive environment is ideal for exploratory data profiling and visualizing results.
- Virtual Environment: It is a strong best practice to isolate project dependencies. We will use Python’s built-in
venv
module.
Library Installation:
First, create and activate a virtual environment:
# Create a virtual environment named 'data_quality_env'
python -m venv data_quality_env
# Activate the environment (on macOS/Linux)
source data_quality_env/bin/activate
# Or on Windows
# .\data_quality_env\Scripts\activate
Next, install the necessary Python libraries using pip
:
pip install pandas numpy ydata-profiling great-expectations
pandas
(v2.2+): The fundamental library for data manipulation and analysis in Python.numpy
(v1.26+): Provides support for large, multi-dimensional arrays and matrices, and is a dependency for pandas.ydata-profiling
(v4.7+): A powerful library for generating interactive HTML profiling reports from a pandas DataFrame with minimal code. It automates the generation of most of the statistics discussed in the technical background section.great-expectations
(v0.18+): The leading open-source tool for data validation and testing. We will use it to create and run expectation suites.
Tip: Always pin your dependency versions in a
requirements.txt
file (pip freeze > requirements.txt
) to ensure that your environment is reproducible by other team members and in deployment environments.
Core Implementation Examples
Let’s begin with a practical example. We will create a synthetic dataset of customer information with several deliberately introduced quality issues. Then, we will use ydata-profiling
to automatically discover these issues.
Creating a Synthetic Dataset with Quality Issues
We’ll use pandas to create a DataFrame. The issues to look for will be:
- Missing values in
last_name
andemail
. - Inconsistent formatting in the
country
column. - An invalid email format.
- An outlier in the
age
column. - A duplicate
customer_id
.
import pandas as pd
import numpy as np
# Create a dictionary with synthetic data
data = {
'customer_id': ['CUST-001', 'CUST-002', 'CUST-003', 'CUST-004', 'CUST-005', 'CUST-001'],
'first_name': ['John', 'Jane', 'Peter', 'Mary', 'David', 'John'],
'last_name': ['Smith', 'Doe', np.nan, 'Jones', 'Williams', 'Smith'],
'email': ['john.s@example.com', 'jane.d@example.com', 'peter@', 'mary.j@example.com', np.nan, 'john.s@example.com'],
'age': [34, 28, 45, 120, 31, 34],
'country': ['USA', 'U.S.A.', 'Canada', 'UK', 'usa', 'USA'],
'registration_date': pd.to_datetime(['2022-01-15', '2022-03-10', '2022-05-20', '2022-07-01', '2022-08-12', '2022-01-15']),
'total_spent': [150.75, 200.50, 99.99, 350.00, 50.25, 150.75]
}
# Create the pandas DataFrame
customers_df = pd.DataFrame(data)
print("Synthetic Customer DataFrame:")
print(customers_df)
Synthetic Customer DataFrame:
customer_id first_name last_name email age country registration_date total_spent
0 CUST-001 John Smith john.s@example.com 34 USA 2022-01-15 150.75
1 CUST-002 Jane Doe jane.d@example.com 28 U.S.A. 2022-03-10 200.50
2 CUST-003 Peter NaN peter@ 45 Canada 2022-05-20 99.99
3 CUST-004 Mary Jones mary.j@example.com 120 UK 2022-07-01 350.00
4 CUST-005 David Williams NaN 31 usa 2022-08-12 50.25
5 CUST-001 John Smith john.s@example.com 34 USA 2022-01-15 150.75
Generating an Automated Profile Report
Now, with just a few lines of code, we can generate a comprehensive and interactive report using ydata-profiling
.
import pandas as pd
import numpy as np
# Create a dictionary with synthetic data
data = {
'customer_id': ['CUST-001', 'CUST-002', 'CUST-003', 'CUST-004', 'CUST-005', 'CUST-001'],
'first_name': ['John', 'Jane', 'Peter', 'Mary', 'David', 'John'],
'last_name': ['Smith', 'Doe', np.nan, 'Jones', 'Williams', 'Smith'],
'email': ['john.s@example.com', 'jane.d@example.com', 'peter@', 'mary.j@example.com', np.nan, 'john.s@example.com'],
'age': [34, 28, 45, 120, 31, 34],
'country': ['USA', 'U.S.A.', 'Canada', 'UK', 'usa', 'USA'],
'registration_date': pd.to_datetime(['2022-01-15', '2022-03-10', '2022-05-20', '2022-07-01', '2022-08-12', '2022-01-15']),
'total_spent': [150.75, 200.50, 99.99, 350.00, 50.25, 150.75]
}
# Create the pandas DataFrame
customers_df = pd.DataFrame(data)
print("Synthetic Customer DataFrame:")
print(customers_df)
# Solution 1: Use minimal configuration to avoid problematic calculations
from ydata_profiling import ProfileReport
try:
# Create a minimal configuration that disables chi-square tests
profile = ProfileReport(
customers_df,
title="Customer Data Quality Profile",
minimal=True, # Use minimal mode to avoid complex statistical calculations
interactions=None, # Disable interaction analysis
correlations=None, # Disable correlation analysis
missing_diagrams=None # Disable missing value diagrams
)
profile.to_file("customer_data_profile_minimal.html")
print("\nMinimal data profile report generated successfully!")
except Exception as e:
print(f"Minimal profiling failed: {e}")
# Solution 2: Use custom configuration to disable specific features
try:
from ydata_profiling.config import Settings
# Create custom settings
settings = Settings()
settings.correlations.calculate = False
settings.interactions.calculate = False
settings.describe.categorical.chi_squared_threshold = 0 # Disable chi-square
profile = ProfileReport(
customers_df,
title="Customer Data Quality Profile",
config=settings
)
profile.to_file("customer_data_profile_custom.html")
print("\nCustom configuration profile report generated successfully!")
except Exception as e2:
print(f"Custom configuration failed: {e2}")
# Solution 3: Manual data quality analysis
print("\nFalling back to manual data quality analysis:")
# Basic data info
print("\n=== DATASET OVERVIEW ===")
print(f"Shape: {customers_df.shape}")
print(f"Memory usage: {customers_df.memory_usage(deep=True).sum()} bytes")
# Missing values analysis
print("\n=== MISSING VALUES ===")
missing_stats = customers_df.isnull().sum()
missing_pct = (missing_stats / len(customers_df)) * 100
missing_df = pd.DataFrame({
'Missing Count': missing_stats,
'Missing Percentage': missing_pct.round(2)
})
print(missing_df[missing_df['Missing Count'] > 0])
# Duplicate analysis
print("\n=== DUPLICATES ===")
duplicates = customers_df.duplicated().sum()
print(f"Total duplicate rows: {duplicates}")
if duplicates > 0:
print("Duplicate rows:")
print(customers_df[customers_df.duplicated(keep=False)])
# Data type analysis
print("\n=== DATA TYPES ===")
print(customers_df.dtypes)
# Categorical analysis
print("\n=== CATEGORICAL COLUMNS ANALYSIS ===")
categorical_cols = customers_df.select_dtypes(include=['object']).columns
for col in categorical_cols:
print(f"\n{col.upper()}:")
value_counts = customers_df[col].value_counts(dropna=False)
print(value_counts)
# Check for data inconsistencies
if col == 'country':
usa_variants = customers_df[col].str.upper().isin(['USA', 'U.S.A.'])
if usa_variants.sum() > 1:
print(" ⚠️ Inconsistent country naming detected (USA variants)")
if col == 'email':
invalid_emails = customers_df[col].str.contains('@.*\.', na=False) == False
invalid_count = invalid_emails.sum()
if invalid_count > 0:
print(f" ⚠️ {invalid_count} potentially invalid email formats")
# Numerical analysis
print("\n=== NUMERICAL COLUMNS ANALYSIS ===")
numerical_cols = customers_df.select_dtypes(include=[np.number]).columns
for col in numerical_cols:
print(f"\n{col.upper()}:")
print(f" Count: {customers_df[col].count()}")
print(f" Mean: {customers_df[col].mean():.2f}")
print(f" Median: {customers_df[col].median():.2f}")
print(f" Min: {customers_df[col].min()}")
print(f" Max: {customers_df[col].max()}")
print(f" Std: {customers_df[col].std():.2f}")
# Check for outliers
if col == 'age':
unusual_ages = (customers_df[col] < 0) | (customers_df[col] > 120)
if unusual_ages.sum() > 0:
print(f" ⚠️ {unusual_ages.sum()} unusual age values detected")
# Data quality summary
print("\n=== DATA QUALITY SUMMARY ===")
total_cells = customers_df.size
missing_cells = customers_df.isnull().sum().sum()
quality_score = ((total_cells - missing_cells) / total_cells) * 100
print(f"Overall data completeness: {quality_score:.1f}%")
print(f"Total records: {len(customers_df)}")
print(f"Unique customers: {customers_df['customer_id'].nunique()}")
print(f"Duplicate records: {duplicates}")
# Recommendations
print("\n=== RECOMMENDATIONS ===")
if missing_cells > 0:
print("• Address missing values in critical fields")
if duplicates > 0:
print("• Remove or investigate duplicate records")
if 'usa' in customers_df['country'].str.lower().values:
print("• Standardize country naming conventions")
print("\nManual analysis completed successfully!")
# Additional utility function for data cleaning
def clean_customer_data(df):
"""Clean the customer dataframe"""
df_clean = df.copy()
# Standardize country names
df_clean['country'] = df_clean['country'].str.upper().replace({
'U.S.A.': 'USA',
'USA': 'USA'
})
# Remove duplicates
df_clean = df_clean.drop_duplicates()
# Fix invalid email formats (basic validation)
invalid_email_mask = ~df_clean['email'].str.contains('@.*\.', na=False)
df_clean.loc[invalid_email_mask, 'email'] = np.nan
return df_clean
print("\n" + "="*50)
print("CLEANED DATASET:")
cleaned_df = clean_customer_data(customers_df)
print(cleaned_df)
When you open the customer_data_profile.html
file, you will find an interactive report. It will automatically highlight:
- Warnings Tab: A summary of all potential issues, such as
email
has 1 invalid value,age
has 1 outlier,country
has high cardinality (due to inconsistent formatting), andcustomer_id
has duplicate values. - Per-Variable Details: For
age
, it will show the histogram and flag120
as an extreme value. Forcountry
, it will list the distinct values “USA”, “U.S.A.”, and “usa”, making the inconsistency obvious. Foremail
, it will show the missing value and provide a warning about the invalid format for “peter@”. - Correlations: It will generate a correlation matrix for the numerical columns.
- Missing Values: It provides a matrix and count of missing values, clearly showing the
NaN
entries inlast_name
andemail
. - Duplicate Rows: It will identify and show the fully duplicated row for
CUST-001
.
This example demonstrates the power of automated profiling tools to quickly surface a wide range of data quality issues that would be tedious and error-prone to find manually.
Step-by-Step Tutorials
Now let’s move from profiling (discovery) to validation (enforcement) using Great Expectations
. We will create a suite of expectations to formalize the quality rules for our customer dataset.
Setting up a Great Expectations Project
Creating and Running an Expectation Suite
This tutorial will walk you through creating an ExpectationSuite
, adding expectations to it, and validating our customers_df
against it.
import great_expectations as gx
import pandas as pd
df = pd.read_csv(
"https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)
context = gx.get_context()
data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="pd dataframe asset")
batch_definition = data_asset.add_batch_definition_whole_dataframe("batch definition")
batch = batch_definition.get_batch(batch_parameters={"dataframe": df})
expectation = gx.expectations.ExpectColumnValuesToBeBetween(
column="passenger_count", min_value=1, max_value=6
)
validation_result = batch.validate(expectation)
print(validation_result)
Running this script will execute our defined expectations against the DataFrame. The checkpoint_result
object will contain a detailed JSON report of the run. More importantly, the build_data_docs()
command generates an HTML site that provides a clean, shareable report showing exactly which expectations passed and failed, which rows caused the failures, and the observed values. This provides clear, actionable feedback for data engineers or analysts to fix the source data or the ingestion process.
Integration and Deployment Examples
The real power of Great Expectations
comes from integrating it into automated pipelines. Imagine a daily process that ingests customer data from a remote source. We can create a Python script that acts as a quality gate.
import pandas as pd
import great_expectations as gx
import numpy as np
def get_daily_customer_data():
"""
Placeholder function to simulate fetching new data.
In a real scenario, this would read from a database, API, or file store.
"""
# Using the same messy data for demonstration
data = {
'customer_id': ['CUST-001', 'CUST-002', 'CUST-003', 'CUST-004', 'CUST-005', 'CUST-001'],
'email': ['john.s@example.com', 'jane.d@example.com', 'peter@', 'mary.j@example.com', np.nan, 'john.s@example.com'],
'age': [34, 28, 45, 120, 31, 34],
'country': ['USA', 'U.S.A.', 'Canada', 'UK', 'usa', 'USA'],
}
return pd.DataFrame(data)
def run_data_quality_gate(dataframe: pd.DataFrame) -> bool:
"""
Runs a Great Expectations checkpoint to validate a DataFrame.
Returns True if validation succeeds, False otherwise.
"""
try:
context = gx.get_context()
# Method 1: Using run_checkpoint with checkpoint name
# This assumes you have a pre-configured checkpoint
try:
result = context.run_checkpoint(
checkpoint_name="my_customer_data_checkpoint",
batch_request={
"runtime_parameters": {"batch_data": dataframe},
"batch_identifiers": {"default_identifier_name": "pipeline_run"},
},
)
except Exception as checkpoint_error:
print(f"Checkpoint method failed: {checkpoint_error}")
print("Falling back to direct validation...")
# Method 2: Direct validation if checkpoint doesn't exist
# Get or create expectation suite
suite_name = "customer_data_quality_suite"
try:
suite = context.get_expectation_suite(suite_name)
except:
# Create a basic suite if it doesn't exist
suite = context.add_expectation_suite(suite_name)
# Add some basic expectations
validator = context.get_validator(
batch_request=gx.core.batch.RuntimeBatchRequest(
datasource_name="default_pandas_datasource",
data_connector_name="default_runtime_data_connector_name",
data_asset_name="customer_data",
runtime_parameters={"batch_data": dataframe},
batch_identifiers={"default_identifier_name": "pipeline_run"}
),
expectation_suite_name=suite_name
)
# Add expectations
validator.expect_column_to_exist("customer_id")
validator.expect_column_to_exist("email")
validator.expect_column_to_exist("age")
validator.expect_column_to_exist("country")
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
validator.save_expectation_suite()
# Run validation
validator = context.get_validator(
batch_request=gx.core.batch.RuntimeBatchRequest(
datasource_name="default_pandas_datasource",
data_connector_name="default_runtime_data_connector_name",
data_asset_name="customer_data",
runtime_parameters={"batch_data": dataframe},
batch_identifiers={"default_identifier_name": "pipeline_run"}
),
expectation_suite_name=suite_name
)
result = validator.validate()
# Check results
if not result["success"]:
print("Data quality check failed!")
print(f"Failed expectations: {len([r for r in result['results'] if not r['success']])}")
# Print details of failed expectations
for expectation_result in result["results"]:
if not expectation_result["success"]:
print(f" - {expectation_result['expectation_config']['expectation_type']}: {expectation_result.get('result', {}).get('partial_unexpected_list', 'See details')}")
return False
print("Data quality check passed successfully.")
return True
except Exception as e:
print(f"Error during data quality validation: {e}")
return False
# Alternative simplified approach using just expectations
def run_simple_data_quality_check(dataframe: pd.DataFrame) -> bool:
"""
Simplified data quality check without checkpoints.
Good for getting started or when you don't have pre-configured checkpoints.
"""
try:
context = gx.get_context()
# Create a validator directly
validator = context.get_validator(
batch_request=gx.core.batch.RuntimeBatchRequest(
datasource_name="default_pandas_datasource",
data_connector_name="default_runtime_data_connector_name",
data_asset_name="customer_data",
runtime_parameters={"batch_data": dataframe},
batch_identifiers={"default_identifier_name": "pipeline_run"}
),
create_expectation_suite_with_name="ad_hoc_suite"
)
# Define and run expectations
all_passed = True
# Check required columns exist
result = validator.expect_column_to_exist("customer_id")
if not result["success"]:
print("customer_id column missing")
all_passed = False
result = validator.expect_column_to_exist("email")
if not result["success"]:
print("email column missing")
all_passed = False
# Check for null values in critical columns
result = validator.expect_column_values_to_not_be_null("customer_id")
if not result["success"]:
print(f"Found {result['result']['unexpected_count']} null customer_ids")
all_passed = False
# Check age range
result = validator.expect_column_values_to_be_between("age", min_value=0, max_value=120)
if not result["success"]:
print(f"Found {result['result']['unexpected_count']} ages outside valid range")
print(f" Unexpected values: {result['result']['partial_unexpected_list']}")
all_passed = False
# Check email format (basic)
result = validator.expect_column_values_to_match_regex("email", r"^[^@]+@[^@]+\.[^@]+$")
if not result["success"]:
print(f"Found {result['result']['unexpected_count']} invalid email formats")
all_passed = False
if all_passed:
printAll data quality checks passed!")
return True
else:
print("❌ Some data quality checks failed!")
return False
except Exception as e:
print(f"Error during validation: {e}")
return False
# --- Main pipeline execution logic ---
if __name__ == "__main__":
print("Fetching daily customer data...")
new_data = get_daily_customer_data()
print(f"Data shape: {new_data.shape}")
print(f"Data preview:\n{new_data}")
print("\n" + "="*50)
print("Running data quality gate (method 1)...")
is_data_valid = run_data_quality_gate(new_data)
print("\n" + "="*50)
print("Running simplified data quality check (method 2)...")
is_data_valid_simple = run_simple_data_quality_check(new_data)
print("\n" + "="*50)
if is_data_valid or is_data_valid_simple:
print("Proceeding with data processing and model training...")
# ... downstream processing logic would go here ...
else:
print("Halting pipeline due to poor data quality.")
# ... exit or error handling logic ...
This script can be scheduled to run daily using a workflow orchestrator like Apache Airflow, Prefect, or a simple cron job. If the data quality checks fail, the pipeline stops, preventing corrupted data from entering the production system. This automated quality gate is a fundamental component of a robust and reliable MLOps workflow.
Industry Applications and Case Studies
The principles of data quality assessment and profiling are not merely academic; they are foundational to the success of AI initiatives across all industries. The financial cost of poor data quality is substantial, manifesting in flawed analytics, failed projects, and missed opportunities.
A prime example is in the financial services industry for fraud detection. A model trained to detect fraudulent transactions relies on high-quality, timely data. If the data contains duplicate transaction IDs (uniqueness issue), the model might learn incorrect patterns. If transaction timestamps are inaccurate (accuracy/timeliness issue), the model’s ability to identify rapid, sequential fraudulent activities is compromised. Financial institutions implement rigorous data quality gates in their data ingestion pipelines, using tools like Great Expectations to validate data from various sources (SWIFT messages, credit card processors) before it is used for model training or real-time inference. The business impact is direct: higher accuracy in fraud detection reduces financial losses and protects customers.
In e-commerce and retail, data quality is crucial for personalization and supply chain management. Recommendation engines require clean and complete user browsing history and purchase data. If a significant portion of user IDs are not correctly mapped to user profiles (consistency issue), the recommendation engine’s effectiveness plummets, leading to a poor customer experience and lost sales. Similarly, inventory management systems rely on accurate sales data to forecast demand. Inaccurate or delayed sales data can lead to stockouts or overstocking, both of which have significant financial consequences. Companies like Amazon and Netflix invest heavily in automated data quality frameworks to ensure the data fueling their core business logic is trustworthy.
In the healthcare sector, the stakes are even higher, as data quality can impact patient outcomes. Electronic Health Records (EHR) often suffer from inconsistencies, with data entered differently across various hospital departments. A model designed to predict patient readmission risk might fail if a patient’s comorbidities are not recorded consistently (consistency/completeness issue). A dosage calculation model could produce dangerous recommendations if a patient’s weight is entered with a typo (accuracy issue). Healthcare providers are increasingly adopting data quality tools to create a “single source of truth” for patient data, validating information against clinical standards and running consistency checks across systems. This not only improves the reliability of predictive models but also enhances patient safety and operational efficiency.
Best Practices and Common Pitfalls
Successfully implementing a data quality program requires more than just tools; it requires a strategic approach and a culture that values data as a critical asset. Adhering to best practices can significantly increase the chances of success, while being aware of common pitfalls can help avoid costly mistakes.
A primary best practice is to shift data quality left, meaning that checks should be implemented as early as possible in the data lifecycle. It is far more efficient and effective to catch an issue at the point of data ingestion than to try to remediate it after it has propagated through multiple systems and been used in various analyses. This involves working with data producers to improve source system quality and establishing strict quality gates where data enters your ecosystem. This proactive stance is a hallmark of mature data organizations.
Another key practice is to treat data quality as code. Your validation rules and expectation suites should be version-controlled in a repository like Git, just like your application code. This enables collaboration, peer review, and the ability to roll back changes. It also allows you to link a specific version of your data validation logic to a specific version of a trained model, which is crucial for reproducibility and debugging. This “Data Quality as Code” approach is central to modern MLOps.
Furthermore, it is essential to involve domain experts in the process of defining data quality rules. A data engineer can identify that a column contains outliers, but only a domain expert (like a clinician in healthcare or a geologist in the energy sector) can determine whether that outlier is a genuine anomaly or a data entry error. Data quality is a team sport, and successful programs foster collaboration between technical teams and business stakeholders to define what “good” data means in a specific context.
A common pitfall is boiling the ocean, or trying to fix every data quality issue at once. This often leads to analysis paralysis and a lack of progress. A more effective approach is to be pragmatic and risk-based. Start by focusing on the most critical data elements—those that have the biggest impact on key business processes or high-value models. Prioritize fixing the issues that pose the greatest risk and deliver the most value, and then incrementally expand your data quality coverage over time. Another frequent mistake is viewing data quality as a one-time project. Data is dynamic, and systems change. Data quality must be a continuous process of monitoring, measurement, and improvement, deeply embedded into your organization’s daily operations.
Hands-on Exercises
- Basic Profiling and Interpretation:
- Objective: Gain hands-on experience with automated data profiling and learn to interpret the results.
- Task: Find a public dataset of interest (e.g., from Kaggle or the UCI Machine Learning Repository). It should have a mix of numerical and categorical features. Using
ydata-profiling
, generate an HTML report for this dataset. - Guidance: Carefully review the “Warnings” or “Alerts” section of the report. For each warning (e.g., high correlation, missing values, high cardinality), write a short paragraph explaining what the issue is and what its potential impact could be on a machine learning model.
- Success Criteria: You have successfully generated a report and can articulate the meaning of at least three different types of data quality warnings.
- Creating a Custom Expectation Suite:
- Objective: Learn to translate business rules into a formal
Great Expectations
suite. - Task: Using the synthetic customer dataset from the chapter, expand the
ExpectationSuite
with three new, custom expectations. For example, you could add an expectation thattotal_spent
should always be a positive number, or thatregistration_date
should not be in the future. - Guidance: Refer to the Great Expectations documentation for a gallery of available expectations. Run the validation and confirm that your new expectations are executed. Intentionally modify the DataFrame to make one of your new expectations fail and observe the output in the validation report.
- Success Criteria: You have successfully added and saved new expectations to a suite, and you can demonstrate both a passing and a failing validation run for your custom rules.
- Objective: Learn to translate business rules into a formal
- Team Project: Data Quality Audit of a Real-World System:
- Objective: Apply data quality assessment techniques in a collaborative, real-world scenario.
- Task (for a team of 2-3): Identify a data source within your organization or a complex public project (e.g., Wikipedia data, OpenStreetMap data). Your team’s goal is to perform a comprehensive data quality audit.
- Guidance:
- Phase 1 (Profiling): Each team member profiles a different subset of the data. The team then collaborates to synthesize the findings into a single data quality report, identifying the top 5 most critical quality issues.
- Phase 2 (Validation): As a team, create a
Great Expectations
suite that codifies the rules to detect these top 5 issues. - Phase 3 (Presentation): Prepare a short presentation for your peers or instructor that summarizes your findings, demonstrates your expectation suite, and proposes a high-level plan for remediating the identified issues.
- Success Criteria: The team produces a clear data profile summary, a working expectation suite, and a presentation that effectively communicates the data quality challenges and proposed solutions.
Tools and Technologies
The landscape of data quality tools is rich and evolving. The primary tools covered in this chapter, ydata-profiling
and Great Expectations
, represent two key philosophies: automated discovery and declarative validation, respectively. ydata-profiling
is excellent for initial exploration and rapid insights, making it a favorite among data scientists during the EDA phase. Great Expectations
is the industry standard for building robust, automated quality gates in production pipelines, favored by data engineers and MLOps practitioners.
Another major player in this space is dbt (Data Build Tool). While primarily a data transformation tool, dbt has built-in testing capabilities that allow you to define assertions (similar to expectations) directly within your data models. For example, you can specify that a column should be unique or not null directly in the YAML configuration for a model. This is extremely powerful for teams that have standardized on dbt for their data transformation workflows, as it co-locates transformation logic and quality tests.
For users in a big data ecosystem, tools like Apache Griffin and Deequ (developed by AWS) provide data quality solutions specifically for Apache Spark. They allow you to compute data quality metrics and run validation checks on massive datasets distributed across a cluster.
When selecting a tool, consider the context. For a data scientist exploring a new dataset, ydata-profiling
is often the fastest way to get started. For an engineering team building a production data pipeline, the rigor and automation capabilities of Great Expectations
or the integrated testing in dbt
are more appropriate choices.
Summary
- Data Quality is Foundational: The success of any AI/ML system is fundamentally dependent on the quality of its underlying data. The principle of “garbage in, garbage out” is a primary law of machine learning.
- A Multidimensional Concept: Data quality is not a single metric but is assessed across six core dimensions: completeness, accuracy, consistency, validity, uniqueness, and timeliness.
- Profiling is for Discovery: Data profiling is the exploratory process of using statistical analysis and other techniques to gain a deep understanding of a dataset’s characteristics and to uncover potential quality issues.
- Validation is for Enforcement: Automated data validation involves creating declarative rule sets, or Expectation Suites, that codify what high-quality data should look like. These suites are used to create quality gates in data pipelines.
- Automate and Shift Left: Best practices emphasize automating data quality checks and integrating them as early as possible in the data lifecycle to prevent bad data from propagating downstream.
- Tools for the Job: The modern data stack includes powerful open-source tools like
ydata-profiling
for rapid exploration andGreat Expectations
for robust, pipeline-integrated validation.
By mastering the concepts and tools in this chapter, you have gained one of the most critical and practical skills in AI engineering: the ability to systematically ensure that the data used to build and power intelligent systems is fit for its purpose.
Further Reading and Resources
- Great Expectations Documentation: (https://docs.greatexpectations.io/docs/) – The official and most comprehensive resource for learning and implementing Great Expectations.
ydata-profiling
GitHub Repository: (https://github.com/ydataai/ydata-profiling) – The official repository with documentation, examples, and community support.- “The Practitioner’s Guide to Data Quality” by T.J. Redman: A seminal book in the field that provides a deep dive into the business and technical aspects of data quality management.
- dbt (Data Build Tool) Documentation on Testing: (https://docs.getdbt.com/docs/build/tests) – An excellent resource for understanding how to implement data quality checks within a modern data transformation workflow.
- “Data Cleaning” by Ihab Ilyas and Ben Sowell: An academic yet practical book that covers the algorithms and theory behind data cleaning and quality techniques.
- The MLOps Community Blog: (https://mlops.community/) – A valuable resource for articles and discussions on the practical aspects of operationalizing machine learning, where data quality is a frequent topic.
- “Fundamentals of Data Engineering” by Joe Reis and Matt Housley: This book provides excellent context on where data quality fits within the broader landscape of data engineering and system design.
Glossary of Terms
- Data Profiling: The process of examining data from an existing source and collecting statistics and information about its structure, content, quality, and relationships.
- Data Validation: The process of checking data against a set of predefined rules or expectations to ensure it meets quality standards.
- Expectation: In Great Expectations, a declarative, machine-readable assertion about data. For example,
expect_column_values_to_be_unique
. - Expectation Suite: A collection of expectations that define a set of quality standards for a dataset.
- Data Quality Dimensions: The six core attributes used to measure data quality: Completeness, Accuracy, Consistency, Validity, Uniqueness, and Timeliness.
- Quality Gate: An automated step in a data or MLOps pipeline that validates data against an expectation suite and halts or alters the pipeline’s execution if the data fails to meet the required quality threshold.
- MLOps (Machine Learning Operations): A set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. Data quality management is a core component of MLOps.
- Outlier: A data point that differs significantly from other observations. It can be a valid but extreme value or an error.
- Z-score: A statistical measurement that describes a value’s relationship to the mean of a group of values, measured in terms of standard deviations.