Version: 1.5.8

Manage data volume with GX

Data volume, a critical aspect of data quality, refers to the quantity of records or data points within a dataset. Managing data volume effectively is crucial for maintaining data integrity, ensuring system performance, and deriving accurate insights. Unexpected changes in data volume can signal issues in data collection, processing, or storage, potentially leading to skewed analyses or system failures. Volume management is intrinsically linked to other aspects of data quality, such as data completeness and consistency, forming a crucial part of a comprehensive data quality strategy.

Great Expectations (GX) offers a powerful set of tools for monitoring and validating data volume through its volume-focused Expectations. By integrating these Expectations into your data pipelines, you can establish robust checks that ensure your datasets maintain the expected volume, catch anomalies early, and prevent downstream issues in your data workflows.

This guide will walk you through leveraging GX to effectively manage and validate data volume, helping you maintain high-quality, reliable datasets.

Prerequisite knowledge

This article assumes basic familiarity with GX components and workflows. If you're new to GX, start with the GX Overview to familiarize yourself with key concepts and setup procedures.

Data preview

The examples in this article use a sample financial transaction dataset that is provided from a public Postgres database table. The sample data is also available in CSV format.

transfer_type	sender_account_number	recipient_fullname	transfer_amount	transfer_ts
domestic	244084670977	Jaxson Duke	9143.40	2024-05-01 01:12
domestic	954005011218	Nelson O’Connell	3285.21	2024-05-01 05:08

This dataset represents daily financial transactions. In a real-world scenario, you'd expect a certain volume of transactions to occur each day.

Key volume Expectations

GX provides several Expectations specifically designed for managing data volume, all of which can be added directly in GX Cloud or GX Core.

Expect Table Row Count To Be Between

Ensures that the number of rows in a dataset falls within a specified range.

Use Case: Validate that transaction volumes are within expected bounds, alerting to unusual spikes or drops in activity.

Python
gxe.ExpectTableRowCountToBeBetween(min_value=2, max_value=5)

Automate this rule with GX Cloud

When you create a new Data Asset or add an Expectation, you can enable Anomaly Detection to catch volume changes that deviate from historical patterns.

View ExpectTableRowCountToBeBetween in the Expectation Gallery.

Expect Table Row Count To Equal

Verifies that the dataset contains exactly the specified number of records.

Use Case: Ensure that a specific number of records are processed, useful for batch operations or reconciliation tasks.

Python
gxe.ExpectTableRowCountToEqual(value=4)

View ExpectTableRowCountToEqual in the Expectation Gallery.

Expect Table Row Count To Equal Other Table

Compares the row count of the current table to another table within the same database.

Use Case: Verify data consistency across different stages of a pipeline or between source and target systems.

Python
gxe.ExpectTableRowCountToEqualOtherTable(other_table_name="transactions_summary")

View ExpectTableRowCountToEqualOtherTable in the Expectation Gallery.

GX tips for volume Expectations

Regularly adjust your ExpectTableRowCountToBeBetween thresholds based on historical data and growth patterns to maintain relevance. Or to save time, automate a forecasted range with GX Cloud.
Use ExpectTableRowCountToEqual in conjunction with time-based partitioning for precise daily volume checks.
Implement ExpectTableRowCountToEqualOtherTable to ensure data integrity across your data pipeline stages.

Example: Validate daily transaction volume

Context: In SQL tables, data is often timestamped on row creation. Tables can hold historical data created over long ranges of time, however, organizations generally want to validate volume for a specific time period: over a year, over a month, over a day. When data arrives on a regular cadence, it is also useful to be able to monitor volume over the most recent window of time.

Goal: Using the ExpectTableRowCountToBeBetween Expectation and either GX Core or GX Cloud, validate daily data volume by batching a single Data Asset (a Postgres table) on a time-based column, transfer_ts.

GX Core
GX Cloud

Run the following GX Core workflow.

Python
import great_expectations as gx
import great_expectations.expectations as gxe
import pandas as pd

# Create Data Context.
context = gx.get_context()

# Connect to sample data, create Data Source and Data Asset.
CONNECTION_STRING = "postgresql+psycopg2://try_gx:try_gx@postgres.workshops.greatexpectations.io/gx_learn_data_quality"

data_source = context.data_sources.add_postgres(
    "postgres database", connection_string=CONNECTION_STRING
)
data_asset = data_source.add_table_asset(
    name="financial transfers table", table_name="volume_financial_transfers"
)

# Add a Batch Definition with partitioning by day.
batch_definition = data_asset.add_batch_definition_daily(
    name="daily transfers", column="transfer_ts"
)

# Create an Expectation testing that each batch (day) contains between 1 and 5 rows.
volume_expectation = gxe.ExpectTableRowCountToBeBetween(min_value=1, max_value=5)

# Validate data volume for each day in date range and capture result.
START_DATE = "2024-05-01"
END_DATE = "2024-05-07"

validation_results_by_day = []

for date in list(pd.date_range(start=START_DATE, end=END_DATE).to_pydatetime()):
    daily_batch = batch_definition.get_batch(
        batch_parameters={"year": date.year, "month": date.month, "day": date.day}
    )

    result = daily_batch.validate(volume_expectation)
    validation_results_by_day.append(
        {
            "date": date,
            "expectation passed": result["success"],
            "observed rows": result["result"]["observed_value"],
        }
    )

pd.DataFrame(validation_results_by_day)

Result:

date	expectation passed	observed rows
2024-05-01	True	4
2024-05-02	True	5
2024-05-03	True	5
2024-05-04	True	5
2024-05-05	True	5
2024-05-06	False	6
2024-05-07	True	5

Use the GX Cloud UI to walk through the following steps.

Create a Postgres Data Asset for the volume_financial_transfers table, using the connection string:
Connection string
```
postgresql+psycopg2://try_gx:try_gx@postgres.workshops.greatexpectations.io/gx_learn_data_quality
```
Profile the Data Asset.
Add an Expect table row count to be between Expectation to the freshly created Data Asset.
Populate the Expectation with a Min Value of 1 and a Max Value of 5.
Save the Expectation.
Click Define batch.
For Validate by, select Day.
Set the Batch column to transfer_ts.
Click Save.
Click the Validate button and define which batch to validate.

Latest validates data for the most recent batch found in the Data Asset.
Custom validates data for the batch provided.

Click Validate.
Review Validation Results.

GX solution: GX enables volume validation for yearly, monthly, or daily ranges of data. Data validation can be defined and run using either GX Core or GX Cloud.

Scenarios

The scenarios in this section outline common real-world use cases for data volume validation, and how GX can be applied to identify and monitor volume issues.

Data reconciliation across systems

Context: In many organizations, data is often stored and processed across multiple systems, such as source systems, data warehouses, and reporting databases. Ensuring data consistency across these systems is crucial for accurate reporting and decision-making. For example, in a banking environment, data might be stored in banking platforms, data warehouses, and reporting databases, and ensuring consistency across these systems is essential for regulatory compliance and accurate financial reporting.

GX solution: Implement checks using ExpectTableRowCountToEqualOtherTable to ensure data volume consistency between source and target systems in a data reconciliation process.

Monitoring data volume in real-time streaming pipelines

Context: Many organizations process large volumes of data in real-time for various purposes, such as fraud detection, system monitoring, or real-time analytics. Monitoring data volume in real-time streaming pipelines is essential to ensure that the volume remains within expected bounds and to detect any anomalies promptly. For instance, banks often process large volumes of data in real-time for fraud detection or market monitoring, and detecting volume anomalies quickly is crucial for mitigating risks.

GX solution: Implement checks using ExpectTableRowCountToBeBetween to monitor data volume in real-time streaming pipelines and alert when anomalies are detected.

Batch processing verification

Context: In batch processing systems, it is important to verify that each batch contains the expected number of records to ensure complete processing. This is applicable across various industries, such as retail, where sales transactions might be processed in batches, or in healthcare, where patient records might be updated through batch processes. Ensuring that each batch contains the expected number of records is crucial for maintaining data integrity and avoiding data loss.

GX solution: Validate data using ExpectTableRowCountToEqual to ensure that each processed batch contains exactly the expected number of records.

Avoid common volume validation pitfalls

Static Thresholds: Avoid using fixed thresholds for ExpectTableRowCountToBeBetween that don't account for natural growth or seasonality. Regularly review and adjust your parameters. Or to save time, automate Anomaly Detection with GX Cloud. For example, an e-commerce platform might need different volume thresholds for regular days versus holiday seasons.
Ignoring Data Skew: Data skew refers to the uneven distribution of data across partitions or nodes in a distributed system. Failing to account for data skew when validating volume can lead to misleading results. Monitor volume at the partition level and implement checks to detect and handle data skew.
Ignoring Trends: Don't overlook gradual changes in data volume over time. Implement trend analysis alongside point-in-time checks. GX can be used in conjunction with time-series analysis tools to detect and alert on unexpected volume trends.
Overlooking Granularity: Ensure volume checks are applied at the appropriate level of granularity (e.g., daily, hourly) to catch issues promptly. For instance, a social media analytics pipeline might require hourly volume checks to detect and respond to viral content quickly.
Neglecting Context: Remember that volume changes might be legitimate due to business events or system changes. Incorporate contextual information when possible. GX can be integrated with external systems to factor in known events or changes when validating volume expectations.

The path forward

Proactive management and validation of data volume is a key part of ensuring the quality and reliability of your data. Implementing the strategies explored in this article will help you to enhance your data's integrity and trustworthiness.

Volume management is a critical component of data quality, however, it is one facet of a comprehensive data quality strategy. As you continue to iterate on your data quality strategy, leverage the full spectrum of GX capabilities to create a robust, scalable, and trustworthy data ecosystem. Explore our broader data quality series to gain insights into how other critical aspects of data quality can be seamlessly integrated into your workflows.

Prerequisite knowledge​

Data preview​

Key volume Expectations​

Expect Table Row Count To Be Between​

Expect Table Row Count To Equal​

Expect Table Row Count To Equal Other Table​

Example: Validate daily transaction volume​

Scenarios​

Data reconciliation across systems​

Monitoring data volume in real-time streaming pipelines​

Batch processing verification​

Avoid common volume validation pitfalls​

The path forward​