Skip to main content
Version: 1.1.2

Manage data volume with GX

Data volume, a critical aspect of data quality, refers to the quantity of records or data points within a dataset. Managing data volume effectively is crucial for maintaining data integrity, ensuring system performance, and deriving accurate insights. Unexpected changes in data volume can signal issues in data collection, processing, or storage, potentially leading to skewed analyses or system failures. Volume management is intrinsically linked to other aspects of data quality, such as data completeness and consistency, forming a crucial part of a comprehensive data quality strategy.

Great Expectations (GX) offers a powerful set of tools for monitoring and validating data volume through its volume-focused Expectations. By integrating these Expectations into your data pipelines, you can establish robust checks that ensure your datasets maintain the expected volume, catch anomalies early, and prevent downstream issues in your data workflows.

This guide will walk you through leveraging GX to effectively manage and validate data volume, helping you maintain high-quality, reliable datasets.

Prerequisite knowledge

This article assumes basic familiarity with GX components and workflows. If you're new to GX, start with the GX Overview to familiarize yourself with key concepts and setup procedures.

Data preview

The examples in this article use a sample financial transaction dataset that is provided from a public Postgres database table. The sample data is also available in CSV format.

transfer_typesender_account_numberrecipient_fullnametransfer_amounttransfer_ts
domestic244084670977Jaxson Duke9143.402024-05-01 01:12
domestic954005011218Nelson O’Connell3285.212024-05-01 05:08

This dataset represents daily financial transactions. In a real-world scenario, you'd expect a certain volume of transactions to occur each day.

Key volume Expectations

GX provides several Expectations specifically designed for managing data volume. These can be added to an Expectation Suite via the GX Cloud UI or using the GX Core Python library.

Add a volume Expectation in GX Cloud

Expect Table Row Count To Be Between

Ensures that the number of rows in a dataset falls within a specified range.

Use Case: Validate that transaction volumes are within expected bounds, alerting to unusual spikes or drops in activity.

gxe.ExpectTableRowCountToBeBetween(min_value=2, max_value=5)
View ExpectTableRowCountToBeBetween in the Expectation Gallery.

Expect Table Row Count To Equal

Verifies that the dataset contains exactly the specified number of records.

Use Case: Ensure that a specific number of records are processed, useful for batch operations or reconciliation tasks.

gxe.ExpectTableRowCountToEqual(value=4)
View ExpectTableRowCountToEqual in the Expectation Gallery.

Expect Table Row Count To Equal Other Table

Compares the row count of the current table to another table within the same database.

Use Case: Verify data consistency across different stages of a pipeline or between source and target systems.

gxe.ExpectTableRowCountToEqualOtherTable(other_table_name="transactions_summary")
View ExpectTableRowCountToEqualOtherTable in the Expectation Gallery.

GX tips for volume Expectations
  • Regularly adjust your ExpectTableRowCountToBeBetween thresholds based on historical data and growth patterns to maintain relevance.
  • Use ExpectTableRowCountToEqual in conjunction with time-based partitioning for precise daily volume checks.
  • Implement ExpectTableRowCountToEqualOtherTable to ensure data integrity across your data pipeline stages.

Example: Validate daily transaction volume

Context: In SQL tables, data is often timestamped on row creation. Tables can hold historical data created over long ranges of time, however, organizations generally want to validate volume for a specific time period: over a year, over a month, over a day. When data arrives on a regular cadence, it is also useful to be able to monitor volume over the most recent window of time.

Goal: Using the ExpectTableRowCountToBeBetween Expectation and either GX Core or GX Cloud, validate daily data volume by batching a single Data Asset (a Postgres table) on a time-based column, transfer_ts.

Run the following GX Core workflow.

import great_expectations as gx
import great_expectations.expectations as gxe
import pandas as pd

# Create Data Context.
context = gx.get_context()

# Connect to sample data, create Data Source and Data Asset.
CONNECTION_STRING = "postgresql+psycopg2://try_gx:try_gx@postgres.workshops.greatexpectations.io/gx_learn_data_quality"

data_source = context.data_sources.add_postgres(
"postgres database", connection_string=CONNECTION_STRING
)
data_asset = data_source.add_table_asset(
name="financial transfers table", table_name="volume_financial_transfers"
)

# Add a Batch Definition with partitioning by day.
batch_definition = data_asset.add_batch_definition_daily(
name="daily transfers", column="transfer_ts"
)

# Create an Expectation testing that each batch (day) contains between 1 and 5 rows.
volume_expectation = gxe.ExpectTableRowCountToBeBetween(min_value=1, max_value=5)

# Validate data volume for each day in date range and capture result.
START_DATE = "2024-05-01"
END_DATE = "2024-05-07"

validation_results_by_day = []

for date in list(pd.date_range(start=START_DATE, end=END_DATE).to_pydatetime()):
daily_batch = batch_definition.get_batch(
batch_parameters={"year": date.year, "month": date.month, "day": date.day}
)

result = daily_batch.validate(volume_expectation)
validation_results_by_day.append(
{
"date": date,
"expectation passed": result["success"],
"observed rows": result["result"]["observed_value"],
}
)

pd.DataFrame(validation_results_by_day)

Result:

dateexpectation passedobserved rows
2024-05-01True4
2024-05-02True5
2024-05-03True5
2024-05-04True5
2024-05-05True5
2024-05-06False6
2024-05-07True5

GX solution: GX enables volume validation for yearly, monthly, or daily ranges of data. Data validation can be defined and run using either GX Core or GX Cloud.

Scenarios

The scenarios in this section outline common real-world use cases for data volume validation, and how GX can be applied to identify and monitor volume issues.

Data reconciliation across systems

Context: In many organizations, data is often stored and processed across multiple systems, such as source systems, data warehouses, and reporting databases. Ensuring data consistency across these systems is crucial for accurate reporting and decision-making. For example, in a banking environment, data might be stored in core banking platforms, data warehouses, and reporting databases, and ensuring consistency across these systems is essential for regulatory compliance and accurate financial reporting.

GX solution: Implement checks using ExpectTableRowCountToEqualOtherTable to ensure data volume consistency between source and target systems in a data reconciliation process.

Monitoring data volume in real-time streaming pipelines

Context: Many organizations process large volumes of data in real-time for various purposes, such as fraud detection, system monitoring, or real-time analytics. Monitoring data volume in real-time streaming pipelines is essential to ensure that the volume remains within expected bounds and to detect any anomalies promptly. For instance, banks often process large volumes of data in real-time for fraud detection or market monitoring, and detecting volume anomalies quickly is crucial for mitigating risks.

GX solution: Implement checks using ExpectTableRowCountToBeBetween to monitor data volume in real-time streaming pipelines and alert when anomalies are detected.

Batch processing verification

Context: In batch processing systems, it is important to verify that each batch contains the expected number of records to ensure complete processing. This is applicable across various industries, such as retail, where sales transactions might be processed in batches, or in healthcare, where patient records might be updated through batch processes. Ensuring that each batch contains the expected number of records is crucial for maintaining data integrity and avoiding data loss.

GX solution: Validate data using ExpectTableRowCountToEqual to ensure that each processed batch contains exactly the expected number of records.

Avoid common volume validation pitfalls

  • Static Thresholds: Avoid using fixed thresholds that don't account for natural growth or seasonality. Regularly review and adjust your ExpectTableRowCountToBeBetween parameters. For example, an e-commerce platform might need different volume thresholds for regular days versus holiday seasons.

  • Ignoring Data Skew: Data skew refers to the uneven distribution of data across partitions or nodes in a distributed system. Failing to account for data skew when validating volume can lead to misleading results. Monitor volume at the partition level and implement checks to detect and handle data skew.

  • Ignoring Trends: Don't overlook gradual changes in data volume over time. Implement trend analysis alongside point-in-time checks. GX can be used in conjunction with time-series analysis tools to detect and alert on unexpected volume trends.

  • Overlooking Granularity: Ensure volume checks are applied at the appropriate level of granularity (e.g., daily, hourly) to catch issues promptly. For instance, a social media analytics pipeline might require hourly volume checks to detect and respond to viral content quickly.

  • Neglecting Context: Remember that volume changes might be legitimate due to business events or system changes. Incorporate contextual information when possible. GX can be integrated with external systems to factor in known events or changes when validating volume expectations.

The path forward

Proactive management and validation of data volume is a key part of ensuring the quality and reliability of your data. Implementing the strategies explored in this article will help you to enhance your data's integrity and trustworthiness.

Volume management is a critical component of data quality, however, it is one facet of a comprehensive data quality strategy. As you continue to iterate on your data quality strategy, leverage the full spectrum of GX capabilities to create a robust, scalable, and trustworthy data ecosystem. Explore our broader data quality series to gain insights into how other critical aspects of data quality can be seamlessly integrated into your workflows.