This guide will walk you through the process of using a auto-initializing ExpectationsA verifiable assertion about data. to automate parameter estimation when you are creating Expectations interactively by using a BatchA selection of records from a Data Asset. or Batches that have been loaded into a ValidatorUsed to run an Expectation Suite against data..
This guide assumes that you are creating and editing expectations in a Jupyter Notebook. This process is covered in the guide: How to create and edit expectations with instant feedback from a sample batch of data.
Additionally, this guide assumes that you are using a multi-batch Batch RequestProvided to a Datasource in order to create a Batch. to provide your sample data. (Auto-initializing Expectations will work when run on a single Batch, but they really shine when run on multiple Batches that would have otherwise needed to be individually processed if a manual aproach were taken.)
1. Determine if your Expectation is auto-initializing
Not all Expectations are auto-initializng. In order to be a auto-initializing Expectation, an Expectation must have parameters that can be estimated. As an example:
ExpectColumnToExist only takes in a
Domain (which is the column name) and checks whether the column name is in the list of names in the table's metadata. This would be an example of an Expectation that would not work under the auto-initializing framework.
An example of Expectations that would work under the auto-initializing framework would be the ones that have numeric ranges, like
To check whether the Expectation you are interested in works under the auto-initializing framework, run the
is_expectation_auto_initializing() method of the
from great_expectations.expectations.expectation import Expectation
False and print the message:
The Expectation expect_column_to_exist is not able to be auto-initialized.
However, the command:
True and print the message:
The Expectation expect_column_mean_to_be_between is able to be auto-initialized. Please run by using the auto=True parameter.
For the purposes of this guide, we will be using
expect_column_mean_to_be_between as our example Expectation.
2. Run the expectation with
Say you are interested in constructing an Expectation that captures the average distance of taxi trips across all of 2018. You have a DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. that provides 12 Batches (one for each month of the year) and you know that
expect_colum_mean_to_be_between is the Expectation you want to implement.
The manual way
expect_column_mean_to_be_between() has the following parameters:
- column (str): The column name.
- min_value (float or None): The minimum value for the column mean.
- max_value (float or None): The maximum value for the column mean.
- strict_min (boolean): If True, the column mean must be strictly larger than min_value, default=False
- strict_max (boolean): If True, the column mean must be strictly smaller than max_value, default=False
Without the auto-initialization framework you would have to get the values for
max_value for your series of 12 Batches by calculating the mean value for each Batch and using calculated
mean values to determine the
max_value parameters to pass your Expectation. This, although not difficult, would be a monotonous and time consuming task.
Auto-initializing Expectations automate this sort of calculation across batches. To perform the same calculation described above (the mean ranges across the 12 Batches in the 2018 taxi data) the only thing you need to do is run the Expectation with
expectation_result = validator.expect_column_mean_to_be_between(
Now the Expectation will calculate the
min_value (2.83) and
max_value (3.06) using all of the Batches that are loaded into the Validator. In our case, that means all 12 Batches associated with the 2018 taxi data.
3. Save your Expectation with the calculated values
Now that the Expectation's upper and lower bounds have come from the Batches, you can save your Expectation SuiteA collection of verifiable assertions about data. and move on.
To view the full scripts that were used in this page, see them on GitHub: