Expectations are the key concept in Great Expectations.
Each Expectation is a declarative, machine-verifiable assertion about the expected format, content, or behavior of your data. Great Expectations comes with dozens of built-in Expectations, and it’s possible to develop your own custom Expectations, too.
Admonition from Mr. Dickens.
“Take nothing on its looks; take everything on evidence. There’s no better rule.”
The CLI will help you create your first Expectation Suite. Suites are simply collections of Expectations. In order to create a new suite, we will use the built-in profiler to automatically create an Expectation Suite called
taxi.demo. Type the following into your terminal:
great_expectations --v3-api suite new
You will see the following output:
Using v3 (Batch Request) APIHow would you like to create your Expectation Suite? 1. Manually, without interacting with a sample batch of data (default) 2. Interactively, with a sample batch of data 3. Automatically, using a profiler: 3 A batch of data is required to edit the suite - let's help you to specify it.Select data_connector 1. default_inferred_data_connector_name 2. default_runtime_data_connector_name: 1 Which data asset (accessible by Data Connector "taxi_data_example_data_connector") would you like to use? 1. yellow_tripdata_sample_2019-01.csv 2. yellow_tripdata_sample_2019-02.csv: 1 Name the new Expectation Suite [yellow_tripdata_sample_2019-01.csv.warning]: taxi.demo When you run this notebook, Great Expectations will store these expectations in a new Expectation Suite "taxi.demo" here: <path_to_project>/great_expectations/expectations/taxi/demo.json Would you like to proceed? [Y/n]: <press Enter>
This will open up a Jupyter Notebook that helps you create the new suite. Before diving into the notebook, let’s first explain what we just did.
What just happened?
You may now wonder why we chose the first file in this step. Here’s an explanation: Recall that our data directory contains two CSV files:
yellow_tripdata_sample_2019-01contains the January 2019 taxi data. Since we want to build an Expectation Suite based on what we know about our taxi data from the January 2019 data set, we want to use it for profiling.
yellow_tripdata_sample_2019-02contains the February 2019 data, which we consider the “new” data set that we want to validate before using it in production. We’ll use it later when showing you how to validate data.
Makes sense, right?
After selecting the table, Great Expectations will open a Jupyter Notebook which will take you through the next part of this workflow.
Don’t execute the Jupyter Notebook cells just yet!
Notebooks are a simple way of interacting with the Great Expectations Python API. You could also just write all this in plain Python code, but for convenience, Great Expectations provides you some boilerplate code in notebooks.
Since notebooks are often less permanent, creating Expectations in a notebook also helps reinforce that the source of truth about Expectations is the Expectation Suite, not the code that generates the Expectations.
Let’s take a look through the notebook and see what’s happening in each cell:
- The first cell does several things: It imports all the relevant libraries, loads a Data Context, and creates a
Validator, which combines a Batch Request to define your batch of data, and an Expectation Suite.
- The second cell allows you to specify which columns you want to ignore when creating Expectations. Remember how we want to add some tests on the
passenger_countcolumn to ensure that its values range between 1 and 6? Let’s comment just this one line to include it:
ignored_columns = [ "vendor_id", "pickup_datetime", "dropoff_datetime", # "passenger_count", "trip_distance",
The next cell is where you configure a
UserConfigurableProfilerand instantiate it, which will then profile the data and create the relevant Expectations to add to your
taxi.demosuite. You can leave these defaults as-is for now - learn more about the available parameters here.
The last cell does several things again: It saves the Expectation Suite to disk, runs the validation against the loaded data batch, and then builds and opens Data Docs, so you can look at the Validation Results. We will explain the validation step later in the “Validate your data” section.
Let’s execute all the cells and wait for Great Expectations to open a browser window with Data Docs. Go to the next step in the tutorial for an explanation of what you see in Data Docs!