Great Expectations provides a mechanism to automatically generate Expectations, using a feature called a
Profiler builds an Expectation Suite from one or more Data Assets. It usually also validates the data against the
newly-generated Expectation Suite to return a Validation Result. There are several Profilers included with Great
A Profiler makes it possible to quickly create a starting point for generating Expectations about a dataset. For
example, during the
init flow, Great Expectations uses the
UserConfigurableProfiler to demonstrate important
features of Expectations by creating and validating an Expectation Suite that has several different kinds of
Expectations built from a small sample of data. A Profiler is also critical to generating the Expectation Suites used
You can also extend Profilers to capture organizational knowledge about your data. For example, a team might have a convention that all columns named "id" are primary keys, whereas all columns ending with the suffix "_id" are foreign keys. In that case, when the team using Great Expectations first encounters a new dataset that followed the convention, a Profiler could use that knowledge to add an expect_column_values_to_be_unique Expectation to the "id" column (but not, for example an "address_id" column).
Rule-Based profilers allow users to provide a highly configurable specification which is composed of Rules to use in order to build an Expectation Suite by profiling existing data.
Imagine you have a table of Sales that comes in every month. You could profile last month's data, inspecting it in order to automatically create a number of expectations that you can use to validate next month's data.
A Rule in a rule-based profiler could say something like "Look at every column in my Sales table, and if that column is numeric, add an
expect_column_values_to_be_between Expectation to my Expectation Suite, where the
min_value for the Expectation is the minimum value for the column, and the
max_value for the Expectation is the maximum value for the column."
Each rule in a rule-based profiler has three types of components:
- DomainBuilders: A DomainBuilder will inspect some data that you provide to the Profiler, and compile a list of Domains for which you would like to build expectations
- ParameterBuilders: A ParameterBuilder will inspect some data that you provide to the Profiler, and compile a dictionary of Parameters that you can use when constructing your ExpectationConfigurations
- ExpectationConfigurationBuilders: An ExpectationConfigurationBuilder will take the Domains compiled by the DomainBuilder, and assemble ExpectationConfigurations using Parameters built by the ParameterBuilder
In the above example, imagine your table of Sales has twenty columns, of which five are numeric:
- Your DomainBuilder would inspect all twenty columns, and then yield a list of the five numeric columns
- You would specify two ParameterBuilders: one which gets the min of a column, and one which gets a max. Your Profiler would loop over the Domain (or column) list built by the DomainBuilder and use the two ParameterBuilders to get the min and max for each column.
- Then the Profiler loops over Domains built by the DomainBuilder and uses the ExpectationConfigurationBuilders to add a
expect_column_values_to_betweencolumn for each of these Domains, where the
max_valueare the values that we got in the ParameterBuilders.
In addition to Rules, a rule-based profiler enables you to specify Variables, which are global and can be used in any of the Rules. For instance, you may want to reference the same BatchRequest or the same tolerance in multiple Rules, and declaring these as Variables will enable you to do so.
variables: my_last_month_sales_batch_request: # We will use this BatchRequest in our DomainBuilder and both of our ParameterBuilders so we can pinpoint the data to Profile datasource_name: my_sales_datasource data_connector_name: monthly_sales data_asset_name: sales_data data_connector_query: index: -1 mostly_default: 0.95 # We can set a variable here that we can reference as the `mostly` value for our expectations belowrules: my_rule_for_numeric_columns: # This is the name of our Rule domain_builder: batch_request: $variables.my_last_month_sales_batch_request # We use the BatchRequest that we specified in Variables above using this $ syntax class_name: SemanticTypeColumnDomainBuilder # We use this class of DomainBuilder so we can specify the numeric type below semantic_types: - numeric parameter_builders: - parameter_name: my_column_min class_name: MetricParameterBuilder batch_request: $variables.my_last_month_sales_batch_request metric_name: column.min # This is the metric we want to get with this ParameterBuilder metric_domain_kwargs: $domain.domain_kwargs # This tells us to use the same Domain that is gotten by the DomainBuilder. We could also put a different column name in here to get a metric for that column instead. - parameter_name: my_column_max class_name: MetricParameterBuilder batch_request: $variables.my_last_month_sales_batch_request metric_name: column.max metric_domain_kwargs: $domain.domain_kwargs expectation_configuration_builders: - expectation_type: expect_column_values_to_be_between # This is the name of the expectation that we would like to add to our suite class_name: DefaultExpectationConfigurationBuilder column: $domain.domain_kwargs.column min_value: $parameter.my_column_min.value # We can reference the Parameters created by our ParameterBuilders using the same $ notation that we use to get Variables max_value: $parameter.my_column_max.value mostly: $variables.mostly_default
You can see another example config containing multiple rules here: alice_user_workflow_verbose_profiler_config.yml
This config is used in the below diagram to provide a better sense of how the different parts of the Profiler config fit together. You can see a larger version of this file here.
- You can try out a tutorial that walks you through the set-up of a Rule-Based Profiler here: How to create a new Expectation Suite using Rule Based Profilers