Define a Multi-source Expectation
A Multi-source Expectation executes one SQL query for each of two Data Sources and compares their results for equality. This can be helpful for validating consistency between systems during data migration or regular data loading processes. Multi-source Expectations can detect data drift introduced during the ETL process through discrepancies in schemas, counts, time windows, data types, and precision levels between Data Sources. Here are some examples of comparisons you can test:
- Every row in table A matches every row in table B.
- An aggregate metric of table A matches the same aggregate metric of table B.
- An aggregate metric of table A matches a different aggregate metric of table B. (For example, the count of rows where X is true in table A matches the count of rows where Y and Z are true in table B.)
To compare results for equality, each row returned by the query for the base Data Source will be compared to each row returned by the query for the comparison Data Source. When you configure a Multi-source Expectation, you set a failure threshold with mostly
. The Expectation will fail if the portion of identical rows between your two queries falls below this threshold.
The portion of identical rows is computed by dividing the number of matching rows by the maximum number of rows in either result. Here are some example scenarios:
Base result row count | Comparison result row count | Matched rows | Portion of identical rows |
---|---|---|---|
200 | 200 | 200 | 1 |
25 | 100 | 25 | .25 |
100 | 25 | 1 | .01 |
0 | 0 | 0 | 1 |
Prerequisites
- Python version 3.9 to 3.12.
- An installation of GX Core.
- A preconfigured Data Context.
- Recommended. Preconfigured Data Sources and Data Assets connected to your data for testing your Multi-source Expectation.
Procedure
To create a Multi-source Expectation, add an ExpectQueryResultsToMatchComparison
Expectation for the base Data Source.
- Instructions
- Sample code
-
Define your comparison Data Source.
Pythonmy_upstream_source = "my_comparison_data_source"
-
Determine your base and comparison SQL queries. Each query should be written in the dialect of the associated Data Source.
In this example, the queries will both select any rows where the passenger count is greater than
0
:Pythonmy_base_query = """
SELECT
*
FROM
my_downstream_table
WHERE
passenger_count > 0
"""
my_comparison_query = """
SELECT
*
FROM
my_upstream_table
WHERE
passenger_count > 0
""" -
Customize how the Expectation renders in Data Docs.
As with other Expectations, the
description
attribute contains the text describing the Expectation when your results are rendered into Data Docs. You can format thedescription
string with Markdown syntax:Pythonmy_description = "Both tables should have the same rows with passengers."
-
Create a new Expectation using the
ExpectQueryResultsToMatchComparison
class and your parameters.The class name
ExpectQueryResultsToMatchComparison
describes the functionality of the Expectation: it queries multiple Data Sources and compares the results for equality. When you create your Expectation, you can use a name that is more indicative of your specific use case. In this example, the multi-source Expectation will be used to validate that two tables both have the same rows with passengers.Pythonexpect_passenger_rows_to_match = gx.expectations.ExpectQueryResultsToMatchComparison(
base_query=my_base_query,
comparison_data_source_name=my_upstream_source,
comparison_query=my_comparison_query,
mostly=1,
description=my_description,
) -
Use your Multi-source Expectation.
Now that you've created a Multi-source Expectation, you can add it to an Expectation Suite for the base Data Source and validate it like any other Expectation.
import great_expectations as gx
# Define your comparison Data Source.
my_upstream_source = "my_comparison_data_source"
# Define your base and comparison SQL queries.
my_base_query = """
SELECT
*
FROM
my_downstream_table
WHERE
passenger_count > 0
"""
my_comparison_query = """
SELECT
*
FROM
my_upstream_table
WHERE
passenger_count > 0
"""
# Customize how the Expectation renders in Data Docs.
my_description = "Both tables should have the same rows with passengers."
# Create an Expectation using the ExpectQueryResultsToMatchComparison class and your parameters.
expect_passenger_rows_to_match = gx.expectations.ExpectQueryResultsToMatchComparison(
base_query=my_base_query,
comparison_data_source_name=my_upstream_source,
comparison_query=my_comparison_query,
mostly=1,
description=my_description,
)
# Test the Expectation.
context = gx.get_context()
data_source_name = "my_base_data_source"
data_asset_name = "my_data_asset"
batch_definition_name = "my_batch_definition"
batch = (
context.data_sources.get(data_source_name)
.get_asset(data_asset_name)
.get_batch_definition(batch_definition_name)
.get_batch()
)
batch.validate(expect_passenger_rows_to_match)
Limitations
Keep the following limitations in mind when working with Multi-source Expectations:
- The comparison is limited to the first 200 rows of each query result. If you anticipate that a query will return more than 200 rows, use an
ORDER BY
clause to control what is surfaced first for comparison. - Batches are not supported. To test a time-based interval of data, use timestamp windows in your base and comparison SQL queries.
- The Expectation configuration and validation results are not reflected on the comparison Data Source. The Expectation is always managed on the Data Asset where you initially configure it.