Validate Data

Expectations describe data assets. Data assets are composed of batches. Validation checks expectations against a batch of data.

Validation = checking if a batch of data from a data asset X conforms to all expectations in expectation suite Y. Expectation suite Y is a collection of expectations that you created that specify what a valid batch of data asset X should look like.

To run validation you need a batch of data. To get a batch of data you need:

  • to specify which data asset the batch is from

  • to specify an expectation suite to validate against

This tutorial will explain each of these objects, show how to obtain them, execute validation and view its result.

Video

If you prefer videos to written tutorials, James (one of the original core contributors) walks you through this turorial in a video on YouTube.

0. Open Jupyter Notebook

This tutorial assumes that:

  • you ran great_expectations init and went through the steps covered in the previous tutorial: Run great_expectations init.

  • your current directory is the root of the project where you ran great_expectations init

The dataset used in this tutorial is a folder with CSV files containing National Provider Identifier (NPI) data that are processed with pandas.

You can either follow the tutorial with the dataset that it uses or you can execute the same steps on your project with your own data.

If you get stuck, find a bug or want to ask a question, go to our Slack - this is the best way to get help from the contributors and other users.

Validation is typically invoked inside the code of a data pipeline (e.g., an Airflow operator). This tutorial uses a Jupyter notebook as a validation playground.

The great_expectations init command created a great_expectations/notebooks/ folder in your project. The folder contains example notebooks for pandas, Spark and SQL datasources.

If you are following this tutorial using the NPI dataset, open the pandas notebook. If you are working with your dataset, see the instructions for your datasource:

pandas

jupyter notebook great_expectations/notebooks/pandas/validation_playground.ipynb

pyspark

jupyter notebook great_expectations/notebooks/spark/validation_playground.ipynb

SQLAlchemy

jupyter notebook great_expectations/notebooks/sql/validation_playground.ipynb

1. Get a DataContext Object

A DataContext represents a Great Expectations project. It organizes datasources, notification settings, data documentation sites, and storage and access for expectation suites and validation results. The DataContext is configured via a yml file stored in a directory called great_expectations; the configuration file as well as managed expectation suites should be stored in version control.

Instantiating a DataContext loads your project configuration and all its resources.

context = ge.data_context.DataContext()

To read more about DataContexts, see: DataContexts

2. List Data Assets

A Data Asset is data you can describe with expectations.

pandas

A Pandas datasource generates data assets from Pandas DataFrames or CSV files. In this example the pipeline processes NPI data that it reads from CSV files in the npidata directory into Pandas DataFrames. This is the data you want to describe with expectations. That directory and its files form a data asset, named “npidata” (based on the directory name).

pyspark

A Spark datasource generates data assets from Spark DataFrames or CSV files. The data loaded into a data asset is the data you want to describe and specify with expectations. If this example read CSV files in a directory called npidata into a Spark DataFrame, the resulting data asset would be called “npidata” based on the directory name.

SQLAlchemy

A SQLAlchemy datasource generates data assets from tables, views and query results.

  • If the data resided in a table (or view) in a database, it would be accessible as a data asset with the name of that table (or view).

  • If the data did not reside in one table npidata and, instead, the example pipeline ran an SQL query that fetched the data (probably from multiple tables), the result set of that query would be accessible as a data asset. The name of this data asset would be up to us (e.g., “npidata” or “npidata_query”).

Great Expectations’ jupyter_ux module has a convenience method that lists all data assets and expectation suites known to a Data Context:

great_expectations.jupyter_ux.list_available_data_asset_names(context)

Here is the output of this method when executed in our example project:

../_images/list_data_assets.png

npidata is the short name of the data asset. Full names of data assets in a DataContext consist of three parts, for example: data__dir/default/npidata. You don’t need to know (yet) how the namespace is managed and the exact meaning of each part. The DataContexts article describes this in detail.

3. Pick a data asset and expectation suite

The previous section showed how to list all data assets and expectation suites in a project.

In this section you choose a data asset name from this list.

The normalize_data_asset_name method converts the short name of a data asset to a full name:

data_asset_name = "npidata"
normalized_data_asset_name = context.normalize_data_asset_name(data_asset_name)
normalized_data_asset_name

Choose the expectation suite you will validate the batch against:

expectation_suite_name = "warning"

3.a. If you don’t have an expectation suite, let’s create a simple one

If you don’t have an expectation suite for this data asset, the notebook’s next cell will create a suite of very basic expectations, so that you have some expectations to play with. The expectation suite will have expect_column_to_exist expectations for each column.

If you created an expectation suite for this data asset, you can skip executing the next cell (if you execute it, it will do nothing).

4. Load a batch of data to validate

Expectations describe data assets. Data assets are composed of batches. Validation checks expectations against a batch of data.

For example, a batch could be the most recent day of log data. For a database table, a batch could be the data in that table at a particular time.

In order to validate a batch of data you will load it as a Great Expectations Dataset.

The DataContext’s get_batch method is used to load a batch of a data asset:

batch = context.get_batch(normalized_data_asset_name,
                          expectation_suite_name,
                          batch_kwargs)

Calling this method asks the Context to get a batch of data from the data asset normalized_data_asset_name and attach the expectation suite expectation_suite_name to it. The batch_kwargs argument specifies which batch of the data asset should be loaded.

If you have no preference as to which batch of the data asset should be loaded, use the yield_batch_kwargs method on the data context:

batch_kwargs = context.yield_batch_kwargs(data_asset_name)

This tutorial and its notebook provide a playground for validation. When Great Expectations is integrated into a data pipeline, the pipeline calls GE to validate a specific batch (an input to a pipeline’s step or its output).

Click here to learn how to specify batch_kwargs for fetching a particular batch

batch_kwargs provide detailed instructions for the datasource how to construct a batch. Each datasource accepts different types of batch_kwargs:

pandas

A pandas datasource can accept batch_kwargs that describe either a path to a file or an existing DataFrame. For example, if the data asset is a collection of CSV files in a folder that are processed with Pandas, then a batch could be one of these files. Here is how to construct batch_kwargs that specify a particular file to load:

batch_kwargs = {'path': "PATH_OF_THE_FILE_YOU_WANT_TO_LOAD"}

To instruct get_batch to read CSV files with specific options (e.g., not to interpret the first line as the header or to use a specific separator), add them to the the batch_kwargs.

See the complete list of options for Pandas read_csv.

batch_kwargs might look like the following:

{
    "path": "/data/npidata/npidata_pfile_20190902-20190908.csv",
    "partition_id": "npidata_pfile_20190902-20190908",
    "sep": null,
    "engine": "python"
}

If you already loaded the data into a Pandas DataFrame, here is how you construct batch_kwargs that instruct the datasource to use your dataframe as a batch:

batch_kwargs = {'df': "YOUR_PANDAS_DF"}

pyspark

A pyspark datasource can accept batch_kwargs that describe either a path to a file or an existing DataFrame. For example, if the data asset is a collection of CSV files in a folder that are processed with Pandas, then a batch could be one of these files. Here is how to construct batch_kwargs that specify a particular file to load:

batch_kwargs = {'path': "PATH_OF_THE_FILE_YOU_WANT_TO_LOAD"}

To instruct get_batch to read CSV files with specific options (e.g., not to interpret the first line as the header or to use a specific separator), add them to the the batch_kwargs.

See the complete list of options for Spark DataFrameReader

SQLAlchemy

A SQLAlchemy datasource can accept batch_kwargs that instruct it load a batch from a table, a view, or a result set of a query:

If you would like to validate an entire table (or a view) in your database’s default schema:

batch_kwargs = {'table': "YOUR TABLE NAME"}

If you would like to validate an entire table or view from a non-default schema in your database:

batch_kwargs = {'table': "YOUR TABLE NAME", "schema": "YOUR SCHEMA"}

If you would like to validate using a query to construct a temporary table:

batch_kwargs = {'query': 'SELECT YOUR_ROWS FROM YOUR_TABLE'}

The examples of batch_kwargs above can also be the outputs of “generators” used by Great Expectations. You can read about the default Generators’ behavior and how to implement additional generators in this article: Batch Generators.


Now you have the contents of one of the files loaded as batch of the data asset data__dir/default/npidata.

5. Set a Run Id

A run_id links together validations of different data assets, making it possible to track “runs” of a pipeline and follow data assets as they are transformed, joined, annotated, enriched, or evaluated. The run id can be any string; by default, Great Expectations will use an ISO 8601-formatted UTC datetime string.

The default run_id generated by Great Expectations is built using the following code:

run_id = datetime.datetime.utcnow().strftime("%Y%m%dT%H%M%S.%fZ")

When you integrate validation in your pipeline, your pipeline runner probably has a run id that can be inserted here to make smoother integration.

6. Validate the batch

Validation evaluates your expectations against the given batch and produces a report that describes observed values and any places where expectations are not met. To validate the batch of data call the validate() method on the batch:

validation_result = batch.validate(run_id=run_id)

In a data pipeline you may take specific actions based on the the result of the validation.

A common pattern is to check the validation_result’s success key (True if the batch meets all the expectations in the expectation suite), and stop or issue a warning in the code in case of failure:

if validation_result["success"]:
  logger.info("This file meets all expectations from a valid batch of {0:s}".format(str(data_asset_name)))
else:
  logger.warning("This file is not a valid batch of {0:s}".format(str(data_asset_name)))

The validation_result object has detailed information about every expectation in the suite that was used to validate the batch: whether the batch met the expectation and even more details if it did not. You can read more about the result object’s structure here: Validation Results.

You can print this object out:

print(json.dumps(validation_result, indent=4))

Here is what a part of this object looks like:

../_images/validation_playground_result_json.png

Don’t panic! This blob of JSON is meant for machines. Data Docs are an compiled HTML view of both expectation suites and validation results that is far more suitable for humans. You will see how easy it is to build them in the next sections.

7. Validation Operators

The validate() method evaluates one batch of data against one expectation suite and returns a dictionary of validation results. This is sufficient when you explore your data and get to know Great Expectations.

When deploying Great Expectations in a real data pipeline, you will typically discover these additional needs:

  • Validating a group of batches that are logically related (e.g. Did all my salesforce integrations work last night?).

  • Validating a batch against several expectation suites (e.g. Did my nightly clickstream event job have any critical failures I need to deal with asap or warnings I should investigate later?).

  • Doing something with the validation results (e.g., saving them for a later review, sending notifications in case of failures, etc.).

Validation Operators provide a convenient abstraction for both bundling the validation of multiple expectation suites and the actions that should be taken after the validation. See the Validation Operators And Actions Introduction for more information.

An instance of action_list_operator operator is configured in the default great_expectations.yml configuration file. ActionListValidationOperator validates each batch in the list that is passed as assets_to_validate argument to its run method against the expectation suite included within that batch and then invokes a list of configured actions on every validation result.

Below is the operator’s configuration snippet in the great_expectations.yml file:

action_list_operator:
  class_name: ActionListValidationOperator
  action_list:
    - name: store_validation_result
      action:
        class_name: StoreAction
    - name: store_evaluation_params
      action:
        class_name: ExtractAndStoreEvaluationParamsAction
    - name: update_data_docs
      action:
        class_name: UpdateDataDocsAction
    - name: send_slack_notification_on_validation_result
      action:
        class_name: SlackNotificationAction
        # put the actual webhook URL in the uncommitted/config_variables.yml file
        slack_webhook: ${validation_notification_slack_webhook}
        notify_on: all # possible values: "all", "failure", "success"
        renderer:
          module_name: great_expectations.render.renderer.slack_renderer
          class_name: SlackRenderer

We will show how to use the two most commonly used actions that are available to this operator:

Save Validation Results

The DataContext object provides a configurable validations_store where GE can store validation_result objects for subsequent evaluation and review. By default, the DataContext stores results in the great_expectations/uncommitted/validations directory. To specify a different directory or use a remote store such as s3 or gcs, edit stores section of the DataContext configuration object:

stores:
  validations_store:
    class_name: ValidationsStore
    store_backend:
      class_name: FixedLengthTupleS3Backend
      bucket: my_bucket
      prefix: my_prefix

Validation results will be stored according to the same hierarchical namespace used to refer to data assets elsewhere in the context, and will have the run_id prepended: base_location/run_id/datasource_name/generator_name/generator_asset/expectation_suite_name.json.

Removing the store_validation_result action from the action_list_operator configuration will disable automatically storing validation_result objects.

Send a Slack Notification

The last action in the action list of the Validation Operator above sends notifications using a user-provided callback function based on the validation result.

- name: send_slack_notification_on_validation_result
  action:
    class_name: SlackNotificationAction
    # put the actual webhook URL in the uncommitted/config_variables.yml file
    slack_webhook: ${validation_notification_slack_webhook}
    notify_on: all # possible values: "all", "failure", "success"
    renderer:
      module_name: great_expectations.render.renderer.slack_renderer
      class_name: SlackRenderer

GE includes a slack-based notification in the base package. To enable a slack notification for results, simply specify the slack webhook URL in the uncommitted/config_variables.yml file:

validation_notification_slack_webhook: https://slack.com/your_webhook_url

8. View the Validation Results in Data Docs

Data Docs compiles raw Great Expectations objects including Expectations and Validations into structured documents such as HTML documentation. By default the HTML website is hosted on your local filesystem. When you are working in a team, the website can be hosted in the cloud (e.g., on S3) and serve as the shared source of truth for the team working on the data pipeline.

Read more about the capabilities and configuration of Data Docs here: Data Docs.

One of the actions executed by the validation operator in the previous section rendered the validation result as HTML and added this page to the Data Docs site.

You can open the page programmatically and examine the result:

context.open_data_docs()

Congratulations!

Now you you know how to validate a batch of data.

What is next? This is a collection of tutorials that walk you through a variety of useful Great Expectations workflows: Tutorials.