Version: 0.18.21

Create an Expectation Suite with the Missingness Data Assistant

Caution

Missingness Data Assistant functionality is Experimental.

Use the information provided here to learn how you can use the Missingness Data Assistant to profile your data and automate the creation of an Expectation Suite.

All the code used in the examples is available in GitHub at this location: how_to_create_an_expectation_suite_with_the_missingness_data_assistant.py.

Prerequisites

A configured Data Context.
An understanding of how to configure a Data Source.
An understanding of how to configure a Batch Request.

Prepare your Data Source and Validator

In the following examples, you'll be using existing New York taxi trip data to create a Validator.

This is the Data Source configuration:

Python
datasource = context.sources.add_pandas_filesystem(
    name="taxi_multi_batch_datasource",  # custom name to assign to new datasource, can be used to retrieve datasource later
    base_directory="./data",  # replace with your data directory
)

This is the Validator configuration:

Python
validator = datasource.read_csv(
    asset_name="all_years",  # custom name to assign to the asset, can be used to retrieve asset later
    batching_regex=r"yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv",
)

Caution

The Missingness Data Assistant runs multiple queries against your Data Source. Data Assistant performance can vary significantly depending on the number of Batches, the number of records per Batch, and network latency. If Data Assistant runtimes are too long, use a subset of your data when defining your Data Source and Validator.

Run the Missingness Data Assistant

To run a Data Assistant, you can call the run(...) method for the assistant. There are numerous parameters available for the run(...) method of the Missingness Data Assistant. For instance, the exclude_column_names parameter allows you to define the columns that should not be Profiled.

Run the following code to define the columns to exclude:

Python
exclude_column_names = [
    "VendorID",
    "pickup_datetime",
    "dropoff_datetime",
    "RatecodeID",
    "PULocationID",
    "DOLocationID",
    "payment_type",
    "fare_amount",
    "extra",
    "mta_tax",
    "tip_amount",
    "tolls_amount",
    "improvement_surcharge",
    "congestion_surcharge",
]

Run the following code to run the Missingness Data Assistant:

Python
data_assistant_result = context.assistants.missingness.run(
    validator=validator,
    exclude_column_names=exclude_column_names,
)

In this example, context is your Data Context instance.

Note

The example code uses the default estimation parameter ("exact").

If you consider your data to be valid, and want to produce Expectations with ranges that are identical to the data in the Validator, you don't need to alter the example code.

To identify potential outliers in your BatchRequest data, pass estimation="flag_outliers" to the run(...) method.

Note

The Missingness Data Assistant run(...) method can accept other parameters in addition to exclude_column_names such as include_column_names, include_column_name_suffixes, and cardinality_limit_mode. To view the available parameters, see this information.

Save your Expectation Suite

After executing the Missingness Data Assistant's run(...) method and generating Expectations for your data, run the following code to generate an Expectation Suite and save it to your Validator:

Python
validator.expectation_suite = data_assistant_result.get_expectation_suite(
    expectation_suite_name="my_custom_expectation_suite_name"
)
validator.save_expectation_suite(discard_failed_expectations=False)

Test your Expectation Suite

Run the following code to use a Checkpoint to operate with the Expectation Suite and Validator that you defined:

Python
checkpoint = context.add_or_update_checkpoint(
    name="yellow_tripdata_sample_all_years_checkpoint",
    validator=validator,
)
checkpoint_result = checkpoint.run()

assert checkpoint_result["success"] is True

You can check the "success" key of the Checkpoint's results to verify that your Expectation Suite worked.

Plot and inspect Metrics and Expectations

Run the following code to view Batch-level visualizations of the Metrics computed by the Missingness Data Assistant:

Python
data_assistant_result.plot_metrics()

Plot Metrics

Note

Hover over a data point to view more information about the Batch and its calculated Metric value.

Run the following command to view the Expectations produced and grouped by Expectation type:

Python
data_assistant_result.show_expectations_by_expectation_type()

Edit your Expectation Suite (Optional)

The Missingness Data Assistant creates as many Expectations as it can for the permitted columns. Although this can help with data analysis, it might be unnecessary. You might have some domain knowledge that is not reflected in the data that was sampled for the Profiling process. In these types of scenarios, you can edit your Expectation Suite to better align with your business requirements.

To edit your new Expectation Suite, see Edit an existing Expectation Suite.

Prerequisites​

Prepare your Data Source and Validator​

Run the Missingness Data Assistant​

Save your Expectation Suite​

Test your Expectation Suite​

Plot and inspect Metrics and Expectations​

Edit your Expectation Suite (Optional)​

Prerequisites

Prepare your Data Source and Validator

Run the Missingness Data Assistant

Save your Expectation Suite

Test your Expectation Suite

Plot and inspect Metrics and Expectations

Edit your Expectation Suite (Optional)