Version: 1.5.7

Define a Multi-source Expectation

A Multi-source Expectation executes one SQL query for each of two Data Sources and compares their results for equality. This can be helpful for validating consistency between systems during data migration or regular data loading processes. Multi-source Expectations can detect data drift introduced during the ETL process through discrepancies in schemas, counts, time windows, data types, and precision levels between Data Sources. Here are some examples of comparisons you can test:

Every row in table A matches every row in table B.
An aggregate metric of table A matches the same aggregate metric of table B.
An aggregate metric of table A matches a different aggregate metric of table B. (For example, the count of rows where X is true in table A matches the count of rows where Y and Z are true in table B.)

To compare results for equality, each row returned by the query for the base Data Source will be compared to each row returned by the query for the comparison Data Source. When you configure a Multi-source Expectation, you set a failure threshold with mostly. The Expectation will fail if the portion of identical rows between your two queries falls below this threshold.

The portion of identical rows is computed by dividing the number of matching rows by the maximum number of rows in either result. Here are some example scenarios:

Base result row count	Comparison result row count	Matched rows	Portion of identical rows
200	200	200	1
25	100	25	.25
100	25	1	.01
0	0	0	1

Prerequisites

Procedure

To create a Multi-source Expectation, add an ExpectQueryResultsToMatchComparison Expectation for the base Data Source.

Instructions
Sample code

Define your comparison Data Source.

Python
my_upstream_source = "my_comparison_data_source"

Determine your base and comparison SQL queries. Each query should be written in the dialect of the associated Data Source.

In this example, the queries will both select any rows where the passenger count is greater than 0:

Python
my_base_query = """
    SELECT
        *
    FROM
        my_downstream_table
    WHERE
        passenger_count > 0
    """

my_comparison_query = """
    SELECT
        *
    FROM
        my_upstream_table
    WHERE
        passenger_count > 0
    """

Customize how the Expectation renders in Data Docs.

As with other Expectations, the description attribute contains the text describing the Expectation when your results are rendered into Data Docs. You can format the description string with Markdown syntax:
Python
```
my_description = "Both tables should have the same rows with passengers."
```
Create a new Expectation using the ExpectQueryResultsToMatchComparison class and your parameters.

The class name ExpectQueryResultsToMatchComparison describes the functionality of the Expectation: it queries multiple Data Sources and compares the results for equality. When you create your Expectation, you can use a name that is more indicative of your specific use case. In this example, the multi-source Expectation will be used to validate that two tables both have the same rows with passengers.
Python
```
expect_passenger_rows_to_match = gx.expectations.ExpectQueryResultsToMatchComparison(
    base_query=my_base_query,
    comparison_data_source_name=my_upstream_source,
    comparison_query=my_comparison_query,
    mostly=1,
    description=my_description,
)
```
Use your Multi-source Expectation.

Now that you've created a Multi-source Expectation, you can add it to an Expectation Suite for the base Data Source and validate it like any other Expectation.

Python
import great_expectations as gx

# Define your comparison Data Source.
my_upstream_source = "my_comparison_data_source"


# Define your base and comparison SQL queries.
my_base_query = """
    SELECT
        *
    FROM
        my_downstream_table
    WHERE
        passenger_count > 0
    """

my_comparison_query = """
    SELECT
        *
    FROM
        my_upstream_table
    WHERE
        passenger_count > 0
    """

# Customize how the Expectation renders in Data Docs.
my_description = "Both tables should have the same rows with passengers."

# Create an Expectation using the ExpectQueryResultsToMatchComparison class and your parameters.
expect_passenger_rows_to_match = gx.expectations.ExpectQueryResultsToMatchComparison(
    base_query=my_base_query,
    comparison_data_source_name=my_upstream_source,
    comparison_query=my_comparison_query,
    mostly=1,
    description=my_description,
)

# Test the Expectation.
context = gx.get_context()
set_up_context_for_example(context)
set_up_context2_for_example(context)

data_source_name = "my_base_data_source"
data_asset_name = "my_data_asset"
batch_definition_name = "my_batch_definition"
batch = (
    context.data_sources.get(data_source_name)
    .get_asset(data_asset_name)
    .get_batch_definition(batch_definition_name)
    .get_batch()
)

batch.validate(expect_passenger_rows_to_match)

Limitations

Keep the following limitations in mind when working with Multi-source Expectations:

The comparison is limited to the first 200 rows of each query result. If you anticipate that a query will return more than 200 rows, use an ORDER BY clause to control what is surfaced first for comparison.
Batches are not supported. To test a time-based interval of data, use timestamp windows in your base and comparison SQL queries.
The Expectation configuration and validation results are not reflected on the comparison Data Source. The Expectation is always managed on the Data Asset where you initially configure it.

Prerequisites​

Procedure​

Limitations​

Prerequisites

Procedure

Limitations