Skip to main content

How to create a new Expectation Suite using Rule Based Profilers

In this tutorial, you will develop hands-on experience with configuring a Rule-Based Profiler to create an Expectation Suite. You will Profile several batches of NYC yellow taxi trip data to come up with reasonable estimates for the ranges of expectations for several numeric columns.

Prerequisites: This how-to guide assumes you have:

Steps#

1. Create a new Great Expectations project#

  • Create a new directory, called taxi_profiling_tutorial
  • Within this directory, create another directory called data
  • Navigate to the top level of taxi_profiling_tutorial in a terminal and run great_expectations --v3-api init

2. Download the data#

  • Download this directory of yellow taxi trip csv files from the Great Expectations GitHub repo. You can use a tool like DownGit to do so
  • Move the unzipped directory of csv files into the data directory that you created in Step 1

3. Setting up your Datasource#

  • Follow the steps in the How to connect to data on a filesystem using Pandas. For the purpose of this tutorial, we will work from a yaml to set up your datasource config. When you open up your notebook to create and test and save your datasource config, replace the config docstring with the following docstring:
example_yaml = f"""name: taxi_pandasclass_name: Datasourceexecution_engine:  class_name: PandasExecutionEnginedata_connectors:  monthly:    base_directory: ../<YOUR BASE DIR>/    glob_directive: '*.csv'    class_name: ConfiguredAssetFilesystemDataConnector    assets:      my_reports:        base_directory: ./        group_names:          - name          - year          - month        class_name: Asset        pattern: (.+)_(\d.*)-(\d.*)\.csv"""
  • Test your YAML config to make sure it works - you should see some of the taxi csv filenames listed
  • Save your datasource config

4. Configuring the Profiler#

  • Now, we'll create a new script in the same top-level taxi_profiling_tutorial directory called profiler_script.py. If you prefer, you could open up a Jupyter Notebook and run this there instead.
  • At the top of this file, we will create a new YAML docstring assigned to a variable called profiler_config. This will look similar to the YAML docstring we used above when creating our Datasource. Over the next several steps, we will slowly add lines to this docstring by typing or pasting in the lines below
profiler_config = """
"""

First, we'll add in a Variables key and some Variables that we'll use. Next, we'll add a top level rules key, and then the name of your rule:

variables:  false_positive_rate: 0.01  mostly: 1.0
rules:  row_count_rule:

After that, we'll add our DomainBuilder. In this case, we'll use a TableDomainBuilder, which will indicate that any expectations we build for this Domain will be at the Table level. Each Rule in our Profiler config can only use one DomainBuilder.

domain_builder:        class_name: TableDomainBuilder

Next, we'll use a NumericMetricRangeMultiBatchParameterBuilder to get an estimate to use for the min_value and max_value of our expect_table_row_count_to_be_between expectation. This ParameterBuilder will take in a BatchRequest consisting of the five Batches prior to our current Batch, and use the row counts of each of those months to get a probable range of row counts that you could use in your ExpectationConfiguration.

parameter_builders:      - parameter_name: row_count_range        class_name: NumericMetricRangeMultiBatchParameterBuilder        batch_request:            datasource_name: taxi_pandas            data_connector_name: monthly            data_asset_name: my_reports            data_connector_query:              index: "-6:-1"        metric_name: table.row_count        metric_domain_kwargs: $domain.domain_kwargs        false_positive_rate: $variables.false_positive_rate        round_decimals: 0        truncate_values:          lower_bound: 0

A Rule can have multiple ParameterBuilders if needed, but in our case, we'll only use the one for now.

Finally, you would use an ExpectationConfigurationBuilder to actually build your expect_table_row_count_to_be_between expectation, where the Domain is the Domain returned by your TableDomainBuilder (your entire table), and the min_value and max_value are Parameters returned by your NumericMetricRangeMultiBatchParameterBuilder.

expectation_configuration_builders:      - expectation_type: expect_table_row_count_to_be_between        class_name: DefaultExpectationConfigurationBuilder        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder        min_value: $parameter.row_count_range.value.min_value        max_value: $parameter.row_count_range.value.max_value        mostly: $variables.mostly        meta:          profiler_details: $parameter.row_count_range.details

You can see here that we use a special $ syntax to reference variables and parameters that have been previously defined in our config. You can see a more thorough description of this syntax in the docstring for ParameterContainer here.

  • When we put it all together, here is what our config with our single row_count_rule looks like:
variables:  false_positive_rate: 0.01  mostly: 1.0
rules:  row_count_rule:    domain_builder:        class_name: TableDomainBuilder    parameter_builders:      - parameter_name: row_count_range        class_name: NumericMetricRangeMultiBatchParameterBuilder        batch_request:            datasource_name: taxi_pandas            data_connector_name: monthly            data_asset_name: my_reports            data_connector_query:              index: "-6:-1"        metric_name: table.row_count        metric_domain_kwargs: $domain.domain_kwargs        false_positive_rate: $variables.false_positive_rate        round_decimals: 0        truncate_values:          lower_bound: 0    expectation_configuration_builders:      - expectation_type: expect_table_row_count_to_be_between        class_name: DefaultExpectationConfigurationBuilder        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder        min_value: $parameter.row_count_range.value.min_value        max_value: $parameter.row_count_range.value.max_value        mostly: $variables.mostly        meta:          profiler_details: $parameter.row_count_range.details

5. Running the Profiler#

Now let's use our config to profile our data and create a simple Expectation Suite!

First we'll do some basic set-up - set up a DataContext and parse our YAML

data_context = DataContext()
# Instantiate Profilerfull_profiler_config_dict: dict = yaml.load(profiler_config)profiler: Profiler = Profiler(

Next, we'll instantiate our Profiler, passing in our config and our DataContext

data_context=data_context,)
suite = profiler.profile(expectation_suite_name="test_suite_name")

Finally, we'll run profiler.profile() and save it to a variable.


Then, we can print our suite so we can see how it looks!

"data_asset_type": None,        "expectations": [            {                "kwargs": {"min_value": 10000, "max_value": 10000, "mostly": 1.0},                "expectation_type": "expect_table_row_count_to_be_between",                "meta": {                    "profiler_details": {                        "metric_configuration": {                            "metric_name": "table.row_count",                            "metric_domain_kwargs": {},                        }                    }                },            }        ],        "expectation_suite_name": "tmp_suite_Profiler_e66f7cbb",    }"""

6. Adding a Rule for Columns#

Let's add one more rule to our Rule-Based Profiler config. This Rule will use the DomainBuilder to populate a list of all of the numeric columns in one Batch of taxi data (in this case, the most recent Batch). It will then use our NumericMetricRangeMultiBatchParameterBuilder looking at the five Batches prior to our most recent Batch to get probable ranges for the min and max values for each of those columns. Finally, it will use those ranges to add two ExpectationConfigurations for each of those columns: expect_column_min_to_be_between and expect_column_max_to_be_between. This rule will go directly below our previous rule.

As before, we will first add the name of our rule, and then specify the DomainBuilder.

domain_builder:      class_name: SimpleSemanticTypeColumnDomainBuilder      semantic_types:        - numeric      # BatchRequest yielding exactly one batch (March, 2019 trip data)      batch_request:        datasource_name: taxi_pandas        data_connector_name: monthly        data_asset_name: my_reports        data_connector_query:          index: -1

In this case, our DomainBuilder configuration is a bit more complex. First, we are using a SimpleSemanticTypeColumnDomainBuilder. This will take a table, and return a list of all columns that match the semantic_type specified - numeric in our case.

Then, we need to specify a BatchRequest that returns exactly one Batch of data (this is our data_connector_query with index equal to -1). This tells us which Batch to use to get the columns from which we will select our numeric columns. Though we might hope that all our Batches of data have the same columns, in actuality, there might be differences between the Batches, and so we explicitly specify the Batch we want to use here.

After this, we specify our ParameterBuilders. This is very similar to the specification in our previous rule, except we will be specifying two NumericMetricRangeMultiBatchParameterBuilders to get a probable range for the min_value and max_value of each of our numeric columns. Thus one ParameterBuilder will take the column.min metric_name, and the other will take the column.max metric_name.

parameter_builders:      - parameter_name: min_range        class_name: NumericMetricRangeMultiBatchParameterBuilder        batch_request:            datasource_name: taxi_pandas            data_connector_name: monthly            data_asset_name: my_reports            data_connector_query:              index: "-6:-1"        metric_name: column.min        metric_domain_kwargs: $domain.domain_kwargs        false_positive_rate: $variables.false_positive_rate        round_decimals: 2      - parameter_name: max_range        class_name: NumericMetricRangeMultiBatchParameterBuilder        batch_request:            datasource_name: taxi_pandas            data_connector_name: monthly            data_asset_name: my_reports            data_connector_query:              index: "-6:-1"        metric_name: column.max        metric_domain_kwargs: $domain.domain_kwargs        false_positive_rate: $variables.false_positive_rate        round_decimals: 2

Finally, we'll put together our Domains and Parameters in our ExpectationConfigurationBuilders

expectation_configuration_builders:      - expectation_type: expect_column_min_to_be_between        class_name: DefaultExpectationConfigurationBuilder        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder        column: $domain.domain_kwargs.column        min_value: $parameter.min_range.value.min_value        max_value: $parameter.min_range.value.max_value        mostly: $variables.mostly        meta:          profiler_details: $parameter.min_range.details      - expectation_type: expect_column_max_to_be_between        class_name: DefaultExpectationConfigurationBuilder        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder        column: $domain.domain_kwargs.column        min_value: $parameter.max_range.value.min_value        max_value: $parameter.max_range.value.max_value        mostly: $variables.mostly        meta:          profiler_details: $parameter.max_range.details

Putting together our entire config, with both of our Rules, we get

variables:  false_positive_rate: 0.01  mostly: 1.0
rules:  row_count_rule:    domain_builder:        class_name: TableDomainBuilder    parameter_builders:      - parameter_name: row_count_range        class_name: NumericMetricRangeMultiBatchParameterBuilder        batch_request:            datasource_name: taxi_pandas            data_connector_name: monthly            data_asset_name: my_reports            data_connector_query:              index: "-6:-1"        metric_name: table.row_count        metric_domain_kwargs: $domain.domain_kwargs        false_positive_rate: $variables.false_positive_rate        round_decimals: 0        truncate_values:          lower_bound: 0    expectation_configuration_builders:      - expectation_type: expect_table_row_count_to_be_between        class_name: DefaultExpectationConfigurationBuilder        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder        min_value: $parameter.row_count_range.value.min_value        max_value: $parameter.row_count_range.value.max_value        mostly: $variables.mostly        meta:          profiler_details: $parameter.row_count_range.details  column_ranges_rule:    domain_builder:      class_name: SimpleSemanticTypeColumnDomainBuilder      semantic_types:        - numeric      # BatchRequest yielding exactly one batch (March, 2019 trip data)      batch_request:        datasource_name: taxi_pandas        data_connector_name: monthly        data_asset_name: my_reports        data_connector_query:          index: -1    parameter_builders:      - parameter_name: min_range        class_name: NumericMetricRangeMultiBatchParameterBuilder        batch_request:            datasource_name: taxi_pandas            data_connector_name: monthly            data_asset_name: my_reports            data_connector_query:              index: "-6:-1"        metric_name: column.min        metric_domain_kwargs: $domain.domain_kwargs        false_positive_rate: $variables.false_positive_rate        round_decimals: 2      - parameter_name: max_range        class_name: NumericMetricRangeMultiBatchParameterBuilder        batch_request:            datasource_name: taxi_pandas            data_connector_name: monthly            data_asset_name: my_reports            data_connector_query:              index: "-6:-1"        metric_name: column.max        metric_domain_kwargs: $domain.domain_kwargs        false_positive_rate: $variables.false_positive_rate        round_decimals: 2    expectation_configuration_builders:      - expectation_type: expect_column_min_to_be_between        class_name: DefaultExpectationConfigurationBuilder        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder        column: $domain.domain_kwargs.column        min_value: $parameter.min_range.value.min_value        max_value: $parameter.min_range.value.max_value        mostly: $variables.mostly        meta:          profiler_details: $parameter.min_range.details      - expectation_type: expect_column_max_to_be_between        class_name: DefaultExpectationConfigurationBuilder        module_name: great_expectations.rule_based_profiler.expectation_configuration_builder        column: $domain.domain_kwargs.column        min_value: $parameter.max_range.value.min_value        max_value: $parameter.max_range.value.max_value        mostly: $variables.mostly        meta:          profiler_details: $parameter.max_range.details

And if we re-instantiate our Profiler with our config which now has two rules, and then we re-run the Profiler, we'll have an updated suite with a table row count expectation for our table, and column min and column max expectations for each of our numeric columns!

🚀Congratulations! You have successfully Profiled multi-batch data using a Rule-Based Profiler. Now you can try adding some new rules, or running your Profiler on some other data (remember to change the BatchRequest in your config)!🚀

Additional Notes#

To view the full script used in this page, see it on GitHub: