Manage data volume with GX
Data volume, a critical aspect of data quality, refers to the quantity of records or data points within a dataset. Managing data volume effectively is crucial for maintaining data integrity, ensuring system performance, and deriving accurate insights. Unexpected changes in data volume can signal issues in data collection, processing, or storage, potentially leading to skewed analyses or system failures. Volume management is intrinsically linked to other aspects of data quality, such as data completeness and consistency, forming a crucial part of a comprehensive data quality strategy.
Great Expectations (GX) offers a powerful set of tools for monitoring and validating data volume through its volume-focused Expectations. By integrating these Expectations into your data pipelines, you can establish robust checks that ensure your datasets maintain the expected volume, catch anomalies early, and prevent downstream issues in your data workflows.
This guide will walk you through leveraging GX to effectively manage and validate data volume, helping you maintain high-quality, reliable datasets.
Prerequisite knowledge
This article assumes basic familiarity with GX components and workflows. If you're new to GX, start with the GX Overview to familiarize yourself with key concepts and setup procedures.
Data preview
The examples in this article use a sample financial transaction dataset that is provided from a public Postgres database table. The sample data is also available in CSV format.
transfer_type | sender_account_number | recipient_fullname | transfer_amount | transfer_ts |
---|---|---|---|---|
domestic | 244084670977 | Jaxson Duke | 9143.40 | 2024-05-01 01:12 |
domestic | 954005011218 | Nelson O’Connell | 3285.21 | 2024-05-01 05:08 |
This dataset represents daily financial transactions. In a real-world scenario, you'd expect a certain volume of transactions to occur each day.
Key volume Expectations
GX provides several Expectations specifically designed for managing data volume. These can be added via the GX Cloud UI or using the GX Core Python library.
Expect Table Row Count To Be Between
Ensures that the number of rows in a dataset falls within a specified range.
Use Case: Validate that transaction volumes are within expected bounds, alerting to unusual spikes or drops in activity.
gxe.ExpectTableRowCountToBeBetween(min_value=2, max_value=5)
ExpectTableRowCountToBeBetween
in the Expectation Gallery.
Expect Table Row Count To Equal
Verifies that the dataset contains exactly the specified number of records.
Use Case: Ensure that a specific number of records are processed, useful for batch operations or reconciliation tasks.
gxe.ExpectTableRowCountToEqual(value=4)
ExpectTableRowCountToEqual
in the Expectation Gallery.
Expect Table Row Count To Equal Other Table
Compares the row count of the current table to another table within the same database.
Use Case: Verify data consistency across different stages of a pipeline or between source and target systems.
gxe.ExpectTableRowCountToEqualOtherTable(other_table_name="transactions_summary")
ExpectTableRowCountToEqualOtherTable
in the Expectation Gallery.
- Regularly adjust your
ExpectTableRowCountToBeBetween
thresholds based on historical data and growth patterns to maintain relevance. - Use
ExpectTableRowCountToEqual
in conjunction with time-based partitioning for precise daily volume checks. - Implement
ExpectTableRowCountToEqualOtherTable
to ensure data integrity across your data pipeline stages.
Example: Validate daily transaction volume
Context: In SQL tables, data is often timestamped on row creation. Tables can hold historical data created over long ranges of time, however, organizations generally want to validate volume for a specific time period: over a year, over a month, over a day. When data arrives on a regular cadence, it is also useful to be able to monitor volume over the most recent window of time.
Goal: Using the ExpectTableRowCountToBeBetween
Expectation and either GX Core or GX Cloud, validate daily data volume by batching a single Data Asset (a Postgres table) on a time-based column, transfer_ts
.
- GX Core
- GX Cloud
Run the following GX Core workflow.
import great_expectations as gx
import great_expectations.expectations as gxe
import pandas as pd
# Create Data Context.
context = gx.get_context()
# Connect to sample data, create Data Source and Data Asset.
CONNECTION_STRING = "postgresql+psycopg2://try_gx:try_gx@postgres.workshops.greatexpectations.io/gx_learn_data_quality"
data_source = context.data_sources.add_postgres(
"postgres database", connection_string=CONNECTION_STRING
)
data_asset = data_source.add_table_asset(
name="financial transfers table", table_name="volume_financial_transfers"
)
# Add a Batch Definition with partitioning by day.
batch_definition = data_asset.add_batch_definition_daily(
name="daily transfers", column="transfer_ts"
)
# Create an Expectation testing that each batch (day) contains between 1 and 5 rows.
volume_expectation = gxe.ExpectTableRowCountToBeBetween(min_value=1, max_value=5)
# Validate data volume for each day in date range and capture result.
START_DATE = "2024-05-01"
END_DATE = "2024-05-07"
validation_results_by_day = []
for date in list(pd.date_range(start=START_DATE, end=END_DATE).to_pydatetime()):
daily_batch = batch_definition.get_batch(
batch_parameters={"year": date.year, "month": date.month, "day": date.day}
)
result = daily_batch.validate(volume_expectation)
validation_results_by_day.append(
{
"date": date,
"expectation passed": result["success"],
"observed rows": result["result"]["observed_value"],
}
)
pd.DataFrame(validation_results_by_day)
Result:
date | expectation passed | observed rows |
---|---|---|
2024-05-01 | True | 4 |
2024-05-02 | True | 5 |
2024-05-03 | True | 5 |
2024-05-04 | True | 5 |
2024-05-05 | True | 5 |
2024-05-06 | False | 6 |
2024-05-07 | True | 5 |
Use the GX Cloud UI to walk through the following steps.
-
Create a Postgres Data Asset for the
volume_financial_transfers
table, using the connection string:Connection stringpostgresql+psycopg2://try_gx:try_gx@postgres.workshops.greatexpectations.io/gx_learn_data_quality
-
Profile the Data Asset.
-
Add an Expect table row count to be between Expectation to the freshly created Data Asset.
-
Populate the Expectation with a Min Value of
1
and a Max Value of5
. -
Save the Expectation.
-
Click Define batch.
-
For Validate by, select Day.
-
Set the Batch column to
transfer_ts
. -
Click Save.
-
Click the Validate button and define which batch to validate.
- Latest validates data for the most recent batch found in the Data Asset.
- Custom validates data for the batch provided.
- Click Validate.
- Review Validation Results.
GX solution: GX enables volume validation for yearly, monthly, or daily ranges of data. Data validation can be defined and run using either GX Core or GX Cloud.
Scenarios
The scenarios in this section outline common real-world use cases for data volume validation, and how GX can be applied to identify and monitor volume issues.
Data reconciliation across systems
Context: In many organizations, data is often stored and processed across multiple systems, such as source systems, data warehouses, and reporting databases. Ensuring data consistency across these systems is crucial for accurate reporting and decision-making. For example, in a banking environment, data might be stored in core banking platforms, data warehouses, and reporting databases, and ensuring consistency across these systems is essential for regulatory compliance and accurate financial reporting.
GX solution: Implement checks using ExpectTableRowCountToEqualOtherTable
to ensure data volume consistency between source and target systems in a data reconciliation process.
Monitoring data volume in real-time streaming pipelines
Context: Many organizations process large volumes of data in real-time for various purposes, such as fraud detection, system monitoring, or real-time analytics. Monitoring data volume in real-time streaming pipelines is essential to ensure that the volume remains within expected bounds and to detect any anomalies promptly. For instance, banks often process large volumes of data in real-time for fraud detection or market monitoring, and detecting volume anomalies quickly is crucial for mitigating risks.
GX solution: Implement checks using ExpectTableRowCountToBeBetween
to monitor data volume in real-time streaming pipelines and alert when anomalies are detected.
Batch processing verification
Context: In batch processing systems, it is important to verify that each batch contains the expected number of records to ensure complete processing. This is applicable across various industries, such as retail, where sales transactions might be processed in batches, or in healthcare, where patient records might be updated through batch processes. Ensuring that each batch contains the expected number of records is crucial for maintaining data integrity and avoiding data loss.
GX solution: Validate data using ExpectTableRowCountToEqual
to ensure that each processed batch contains exactly the expected number of records.
Avoid common volume validation pitfalls
-
Static Thresholds: Avoid using fixed thresholds that don't account for natural growth or seasonality. Regularly review and adjust your
ExpectTableRowCountToBeBetween
parameters. For example, an e-commerce platform might need different volume thresholds for regular days versus holiday seasons. -
Ignoring Data Skew: Data skew refers to the uneven distribution of data across partitions or nodes in a distributed system. Failing to account for data skew when validating volume can lead to misleading results. Monitor volume at the partition level and implement checks to detect and handle data skew.
-
Ignoring Trends: Don't overlook gradual changes in data volume over time. Implement trend analysis alongside point-in-time checks. GX can be used in conjunction with time-series analysis tools to detect and alert on unexpected volume trends.
-
Overlooking Granularity: Ensure volume checks are applied at the appropriate level of granularity (e.g., daily, hourly) to catch issues promptly. For instance, a social media analytics pipeline might require hourly volume checks to detect and respond to viral content quickly.
-
Neglecting Context: Remember that volume changes might be legitimate due to business events or system changes. Incorporate contextual information when possible. GX can be integrated with external systems to factor in known events or changes when validating volume expectations.
The path forward
Proactive management and validation of data volume is a key part of ensuring the quality and reliability of your data. Implementing the strategies explored in this article will help you to enhance your data's integrity and trustworthiness.
Volume management is a critical component of data quality, however, it is one facet of a comprehensive data quality strategy. As you continue to iterate on your data quality strategy, leverage the full spectrum of GX capabilities to create a robust, scalable, and trustworthy data ecosystem. Explore our broader data quality series to gain insights into how other critical aspects of data quality can be seamlessly integrated into your workflows.