Version: 0.18.21

GX V0 to V1 Migration Guide

Overview

This guide for migrating your Great Expectations V0 configurations to V1 covers all the Great Expectations domain objects found in V0 and shows how they map to their equivalent V1 domain objects.

GX Cloud Context Users

If you are a GX cloud user, you are able to immediately try out GX V1! Cloud will do the translation of your configurations for you. Your context = gx.get_context() call will return updated configurations. You can inspect your configuration objects by calling all() on the appropriate domain namespace. For example, context.data_sources.all() will list all of your datasources that have been automatically translated to V1. If there are incompatible configurations, they will be filtered out of this list. You can retrieve them by using a GX >=0.18.19 Python client. If you need to translate any of these missing configurations to 1.0 you can look at the various API sections below the domain object you are interested in to see a comparison of the V0 and V1 API calls and determine what you need to do to translate the configuration.

GX File Context

Below in each section you will see a side-by-side comparison of the configuration files for each domain object along with a description of how they have changed and what features have been removed and added. You can use this as a basis for translating your configuration objects from V0 to V1.

Domain objects

Expectation Suites and Expectations

In GX 0.X and in GX 1.0, every Expectation Suite has its own configuration file and the path to them in the Great Expectations project directory is:

gx/expectations/<suite_name>.json

Configuration file differences

Here is a side-by-side comparison of a suite called suite_for_yellow_tripdata:

V0 Expectation Suite Configuration
V1 Expectation Suite Configuration

JSON
{
    "expectation_suite_name": "suite_for_yellow_tripdata",
    "data_asset_type": "CSVAsset",
    "evaluation_parameters": {
        "parameter_name": "value"
},
"expectations": [
    {
        "expectation_type": "expect_column_values_to_be_between",
        "kwargs": {
            "column": "passenger_count",
            "max_value": 4,
            "min_value": 0
        },
        "meta": {}
    },
    {
        "expectation_type": "expect_column_values_to_be_in_set",
        "kwargs": {
            "column": "VendorID",
            "value_set": [
                1,
                2,
                3,
                4
            ]
        },
    "meta": {}
    }
],
"ge_cloud_id": null,
"meta": {
    "foo": "bar",
    "great_expectations_version": "0.18.19"
    }
}              

JSON
{
    "name": "suite_for_yellow_tripdata",
    "suite_parameters": {
        "parameter_name": "value"
    },
    "expectations": [
        {
            "type": "expect_column_values_to_be_between",
            "kwargs": {
                "column": "passenger_count",
                "max_value": 4.0,
                "min_value": 0.0
            },
            "meta": {},
            "id": "24dc475c-38a3-4234-ab47-b13d0f233242"
        },
        {
            "type": "expect_column_values_to_be_in_set",
            "kwargs": {
                "column": "VendorID",
                "value_set": [
                    1,
                    2,
                    3,
                    4
                ]
            },
            "meta": {},
            "id": "d8b3b4e9-296f-4dd5-bd29-aac6a00cba1c"
        }
    ],
    "id": "77373d6f-3561-4d62-b150-96c36dccbe55",
    "meta": {
        "foo": "bar",
        "great_expectations_version": "1.0.0"
    },
    "notes": "This is a new field."
}            

expectation_suite_name: This is now called name and has the name of the suite.

data_asset_type: This has been removed. Expectation suites can be associated with any asset type.

evaluation_parameters: This has been renamed to suite_parameters. The contents are unchanged.

expectations: This is a list of expectations. The expectation keys have changed as follows

expectation_type: This has been changed to type.

kwargs: This is unchanged

meta: This dictionary that a user can populate with whatever metadata they would like. The notes key that Great Expectations Cloud used has been pulled out into a top level key.

id: This new field introduced in 1.0 can be any arbitrary, unique UUID. When migrating, generate and add a UUID.

notes (new field): This new top-level field replaces meta.notes. This is consumed by Great Expectations Cloud to display user notes on the Cloud UI.

ge_cloud_id: This is now id. This is now a required field. Migrators can generate a unique, arbitrary UUID and add it.

meta: The format is unchanged.

notes: This is new in 1.0 and is an arbitrary string.

Expectation Suite API Calls

The suites above were created with the following API calls. This example demonstrates how to create an equivalent suite to your V0 suite in V1.

V0 Expectation Suite API
V1 Expectation Suite API

Python
suite = context.add_expectation_suite(
    expectation_suite_name="suite_for_yellow_tripdata",
    meta={"foo": "bar", "notes": "Here are some suite notes."},
    evaluation_parameters={"parameter_name": "value"},
    data_asset_type="CSVAsset", # V1 no longer supports this argument, expectations are type independent
)
validator = context.get_validator(batch_request=asset.build_batch_request(), expectation_suite_name="suite_for_yellow_tripdata")
validator.expect_column_values_to_be_between(column="passenger_count", min_value=0, max_value=4)
validator.expect_column_values_to_be_in_set(column="VendorID", value_set=[1,2,3,4])
validator.save_expectation_suite(discard_failed_expectations=False)

Python
suite = context.suites.add(
    gx.ExpectationSuite(
        name="suite_for_yellow_tripdata",
        meta={"foo": "bar"},
        suite_parameters={"parameter_name": "value"},
        notes="Here are some suite notes.",
        id="77373d6f-3561-4d62-b150-96c36dccbe55",
    )
)
suite.add_expectation(gxe.ExpectColumnValuesToBeBetween(column="passenger_count", min_value=0, max_value=4))
suite.add_expectation(gxe.ExpectColumnValuesToBeInSet(column="VendorID", value_set=[1,2,3,4]))

Data Sources and Data Assets

Data Source configurations are stored in the YAML file gx/great_expectations.yml, in the top-level block whose key is fluent_datasources.

We’ll walk through examples of different Data Source configurations in V0 and V1 so you can see how to translate between the two.

Pandas API

Pandas Filesystem Data

Here is a side-by-side comparison of a Data Source called pandas_fs_ds with 4 assets called: yearly_taxi_data, monthly_taxi_data, daily_taxi_data, and arbitrary_taxi_data.

V0 Pandas Filesystem Config
V1 Pandas Filesystem Config

YAML
fluent_datasources:
    pandas_fs_ds:
        type: pandas_filesystem
        assets:
        yearly_taxi_data:
            type: csv
            batching_regex: sampled_yellow_tripdata_(?P<year>\d{4})\.csv
        monthly_taxi_data:
            type: csv
            batching_regex: sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})\.csv
        daily_taxi_data:
            type: csv
            batching_regex: sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv
        arbitrary_taxi_data:
            type: csv
            batching_regex: sampled_yellow_tripdata_(?P<code>\w+)\.csv
        base_directory: data

YAML
fluent_datasources:
    pandas_fs_ds:
        type: pandas_filesystem
        id: 2ea309bf-bb5f-421b-ab6b-ea1cc9e70c8e
        assets:
        taxi_data:
            type: csv
            id: 34b98eca-790f-4504-ab4b-b65bc128b5ee
            batch_definitions:
            yearly_batches:
                id: a04f8071-33d9-4834-b667-e3d8c2ca70aa
                partitioner:
                regex: sampled_yellow_tripdata_(?P<year>\d{4})\.csv
                sort_ascending: true
            monthly_batches:
                id: f07aa73d-bf56-438e-9dc2-0d05fb7d32a1
                partitioner:
                regex: sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})\.csv
                sort_ascending: true
                param_names:
                    - year
                    - month
            daily_batches:
                id: 37b4b2eb-4b37-46c6-b51c-f2d21ba0e6d6
                partitioner:
                regex: sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv
                sort_ascending: true
                param_names:
                    - year
                    - month
                    - day
        base_directory: data

In 0.X, a Data Source represents where the data lives and the execution engine (e.g. reading data from the local filesystem using pandas) and a Data Asset represents the data file format and how the data should be partitioned (e.g. a parameterized regex which matches file names). In 1.0 the Data Source has the same meaning. However, the Data Asset now only represents the data file format and there is a new concept, Batch Definition, which represents how the data is partitioned. This manifests as an extra layer in the YAML asset block.

pandas_fs_ds (example): The keys below fluent_datasources are the names of the Data Source. This is unchanged.

type: The type of Data Source. This is unchanged.

assets: A list of the Data Assets. Each key is an asset name in both V0 and V1. The asset value is different. In V0 the nested keys are:

type: This is unchanged

batching_regex: This has been replaced with batch_definitions. The format for batch_definitions follows. You will notice that the regex now lives in the partitioners regex field. The batch_definition configuration format is:

yearly_batches (example Batch Definition name): These keys are the names of the batch definitions.

id: This is an arbitrary UUID and can be chosen to be any unique UUID.

partitioner: This is a key with information about how the batch is defined

regex: This is the regex previously living on the asset keyed by batching_regex

sort_ascending: A boolean. true if the batch order is increasing in time, false if the ordering is decreasing in time. Previously in V0 one could specify an order_by field on the asset which could sort the different date components in different orders (eg year could be sorted increasing in time while month could be sorted decreasing in time). This is no longer supported.

param_names: This is a list of the parameter names which will be identical to the named matches from the regex. That is, the items will be year, month, or day. If this list would only contain year it can be excluded from the configuration file.

id: This is a new field and is an arbitrary UUID. If migrating you can pick any unique UUID.

base_directory: The path to the data files. This is unchanged.

id: This is a new field and is an arbitrary UUID. If migrating you can pick any unique UUID.

Note

We no longer support arbitrary batching regexes. Batches must be defined by one of our temporal batch definitions which are yearly, monthly, or daily.

Pandas Filesystem Creation via API

V0 Pandas Filesystem Creation via API
V1 Pandas Filesystem Creation via API

Python
# Pandas Filesystem Data Source
datasource = context.sources.add_pandas_filesystem(name="pd_fs_ds", base_directory="data")

# Data Assets
yearly = datasource.add_csv_asset(name="yearly_taxi_data", batching_regex=r"sampled_yellow_tripdata_(?P<year>\d{4})\.csv")
monthly = datasource.add_csv_asset(name="monthly_taxi_data", batching_regex=r"sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})\.csv")
daily = datasource.add_csv_asset(name="daily_taxi_data", batching_regex=r"sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv")
arbitrary = datasource.add_csv_asset(name="arbitrary_taxi_data", batching_regex=r"sampled_yellow_tripdata_(?P<code>\w+)\.csv")

Python
# Pandas Filesystem Data Source
data_source = context.data_sources.add_pandas_filesystem(name="pd_fs_ds", base_directory="data")

# CSV Data Asset
file_csv_asset = data_source.add_csv_asset(name="taxi_data")

# Batch Definitions
yearly = file_csv_asset.add_batch_definition_yearly(name="yearly_batches", regex=r"sampled_yellow_tripdata_(?P<year>\d{4})\.csv")
monthly = file_csv_asset.add_batch_definition_monthly(name="monthly_batches", regex=r"sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})\.csv")
daily = file_csv_asset.add_batch_definition_daily(name="daily_batches", regex=r"sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv")

Pandas Dataframe

V0 Pandas Dataframe Config
V1 Pandas Dataframe Config

YAML
fluent_datasources:
pd_df_ds:
    type: pandas
    assets:
    taxi_dataframe_asset:
        type: dataframe
        batch_metadata: {}

YAML
fluent_datasources:
    pd_df_ds:
        type: pandas
        assets:
        taxi_dataframe_asset:
            type: dataframe
            batch_metadata: {}
            batch_definitions:
            taxi_dataframe_batch_def:
                id: bf0de640-7791-4654-86b0-5f737319e993
                partitioner:
            id: 352b392d-f0a5-4c7c-911f-fd68903599e0
        id: 4e0a4b9c-efc2-40e8-8114-6a45ac697554

In both V0 and V1 a pandas Data Source reads in data from a pandas dataframe. In V1 there is a concept of a Batch Definition that is used to partition data into batches. For a pandas dataframe the only Batch Definition currently available is the whole dataframe Batch Definition.

pd_df_ds (example): The keys below fluent_datasources are the names of the Data Sources. This is unchanged.

assets: A list of the Data Assets. Each key is an asset name in both V0 and V1. The asset value is different.

type: The type of Data Source. This is unchanged.

batch_metadata: Arbitrary key/values pairs used to annotate the Data Asset. In V1 this is unchanged, it still describes the asset.

batch_definitions: This is new in V1. There is only 1 option here. The key is the name of the Batch Definition. It has 2 fields:

id: An arbitrary UUID. Migrators can assign any unique UUID.

partitioner: This is left empty as we only allow the whole dataframe

id: In V1, the asset has a unique ID. Migrators can assign any unique UUID.

id: In V1, the Data Source has a unique ID. Migrators can assign any unique UUID.

Pandas Dataframe Creation via API

V0 Pandas Dataframe Creation via API
V1 Pandas Dataframe Creation via API

Python
dataframe_ds = context.sources.add_pandas(name="pd_df_ds")
dataframe_asset = dataframe_ds.add_dataframe_asset(name="taxi_dataframe_asset")

Python
dataframe_ds = context.data_sources.add_pandas(name="pd_df_ds")
dataframe_asset = dataframe_ds.add_dataframe_asset(name="taxi_dataframe_asset")
dataframe_bd = dataframe_asset.add_batch_definition_whole_dataframe(name="taxi_dataframe_batch_def")

Snowflake API

Here is a side-by-side comparision of a both a V0 Snowflake table and query Data Asset to their equivalents in V1. We walk through all the currently supported V1 Batch Definitions: yearly, monthly, daily, and whole table.

V0 Snowflake Config
V1 Snowflake Config

YAML
fluent_datasources:
    snowflake_ds:
        type: snowflake
        assets:
        yearly_taxi_data:
            type: table
            order_by:
            - key: year
                reverse: false
            batch_metadata: {}
            splitter:
            column_name: pickup_datetime
            method_name: split_on_year
            table_name: TAXI_DATA_ALL_SAMPLES
            schema_name:
        monthly_taxi_data:
            type: table
            order_by:
            - key: year
                reverse: true
            - key: month
                reverse: true
            batch_metadata: {}
            splitter:
            column_name: pickup_datetime
            method_name: split_on_year_and_month
            table_name: TAXI_DATA_ALL_SAMPLES
            schema_name:
        daily_taxi_data:
            type: table
            order_by:
            - key: year
                reverse: false
            - key: month
                reverse: false
            - key: day
                reverse: false
            batch_metadata: {}
            splitter:
            column_name: pickup_datetime
            method_name: split_on_year_and_month_and_day
            table_name: TAXI_DATA_ALL_SAMPLES
            schema_name:
        all_taxi_data:
            type: table
            order_by: []
            batch_metadata: {}
            table_name: TAXI_DATA_ALL_SAMPLES
            schema_name:
        query_yearly:
            type: query
            order_by:
            - key: year
                reverse: false
            batch_metadata: {}
            splitter:
            column_name: pickup_datetime
            method_name: split_on_year
            query: select * from TAXI_DATA_ALL_SAMPLES
        query_monthly:
            type: query
            order_by:
            - key: year
                reverse: true
            - key: month
                reverse: true
            batch_metadata: {}
            splitter:
            column_name: pickup_datetime
            method_name: split_on_year_and_month
            query: select * from TAXI_DATA_ALL_SAMPLES
        query_daily:
            type: query
            order_by:
            - key: year
                reverse: false
            - key: month
                reverse: false
            - key: day
                reverse: false
            batch_metadata: {}
            splitter:
            column_name: pickup_datetime
            method_name: split_on_year_and_month_and_day
            query: select * from TAXI_DATA_ALL_SAMPLES
        whole_query:
            type: query
            order_by: []
            batch_metadata: {}
            query: select * from TAXI_DATA_ALL_SAMPLES
        connection_string: 
        snowflake://<user_login_name>:<password>@<account_identifier>/<database_name>/<schema_name>?warehouse=<warehouse_name>&role=<role_name>    

YAML
fluent_datasources:
    snowflake_ds:
        type: snowflake
        id: f4ac98d6-dccf-4373-b5f3-ac90ed60b139
        assets:
        taxi_data:
            type: table
            id: ad9e8ece-0c14-45bc-bcdd-ef2e40922df4
            batch_metadata: {}
            batch_definitions:
            table_yearly:
                id: 75a41bce-da84-425f-a3d3-92acd5c5f7f8
                partitioner:
                column_name: PICKUP_DATETIME
                sort_ascending: true
                method_name: partition_on_year
            table_monthly:
                id: 67ec396a-e7ca-499d-8cb7-84a803d976af
                partitioner:
                column_name: PICKUP_DATETIME
                sort_ascending: false
                method_name: partition_on_year_and_month
            table_daily:
                id: 7d410bd4-ca6d-464d-b82d-3b070e6fd229
                partitioner:
                column_name: PICKUP_DATETIME
                sort_ascending: true
                method_name: partition_on_year_and_month_and_day
            whole_table:
                id: bd88cdd9-a5f4-4bdf-bbf3-e43827996dd0
                partitioner:
            table_name: TAXI_DATA_ALL_SAMPLES
            schema_name: public
        query_data:
            type: query
            id: 44b0eccc-54f2-46e1-a6f9-3558662d4f8a
            batch_metadata: {}
            batch_definitions:
            query_yearly:
                id: 7f3909d4-912f-44aa-8140-7ab4e7b13f4e
                partitioner:
                column_name: PICKUP_DATETIME
                sort_ascending: true
                method_name: partition_on_year
            query_monthly:
                id: d0c347fc-03e5-4880-a8e8-1eff04432c2f
                partitioner:
                column_name: PICKUP_DATETIME
                sort_ascending: false
                method_name: partition_on_year_and_month
            query_daily:
                id: 1f6701bd-b470-4ddb-a001-4cc6167ab4d0
                partitioner:
                column_name: PICKUP_DATETIME
                sort_ascending: true
                method_name: partition_on_year_and_month_and_day
            whole_query:
                id: 4817cf80-1727-4aad-b31a-5552efeea441
                partitioner:
            query: SELECT * FROM TAXI_DATA_ALL_SAMPLES
        connection_string: 
            snowflake://<user_login_name>:<password>@<account_identifier>/<database_name>/<schema_name>?warehouse=<warehouse_name>&role=<role_name>

In 0.X, a Data Source represents where the data is persisted and the execution engine (e.g. the Snowflake database) and a Data Asset represents the data and how the data should be partitioned (e.g. by a datetime column). In 1.0 the Data Source has the same meaning. However, the Data Asset now represents only the data and there is a new concept, the batch definition, which represents how the data is partitioned. This manifests as an extra layer in the YAML asset block.

A few configurations are NO LONGER SUPPORTED:

In V1, we currently only allow batching by time (e.g. year, month, day). In V0 one could split the data into batches in lots of ways. For example a table could be split by a value in a column or a file regex could contain arbitrary match expressions. We consider non-time based splitting to represent different conceptual assets that may happen to reside in the same table. For those, one should compute views or use a query asset.
In V0, one could set the sorting order independently for the year, month, and day dimensions. That is, one could sort ascending by year, but then descending by month and day. In V1 we only allow sorting of all the batches in ascending or descending order. For example, one can no longer sort year and month in the opposite order.

snowflake_ds (example): The keys under fluent_datasources are the names of the datasources.

type: The type of Data Source, this is unchanged.

assets: The keys to Data Assets are the names of the assets. In this example yearly_taxi_data is the name of a V0 asset. In V1, the asset is called taxi_data.

type: The type of asset (table or query). This is unchanged.

order_by: This is no longer a key. The information has moved inside the V1 batch definitions under the partitioner.

splitter: This is no longer a key has been replaced by batch_definitions. The format for batch definitions is:

table_yearly (an example): The name of the Batch Definition is the key to each configuration.

id: Migrators can assign any unique UUID.

partitioner: Contains the batching and sorting information. This has no value for a “whole table” partitioner.

column_name: The column on which to split the data. This must be a DATETIME field.

sort_ascending: A boolean. true sorts the most recent batch first, while false sorts with the most recent batch last.

method_name: A string indicating the batching resolution. The options are: partition_on_year, partition_on_year_and_month, partition_on_year_and_month_and_day.

batch_metadata: This is unchanged.

table_name (TableAsset only): The name of the table that holds data for this asset. This is unchanged.

schema_name (TableAsset only): The name of the schema to which the table belongs. In V1 this is now a required field.

query (QueryAsset only): The query to be run to generate the data for this asset. This is unchanged.

id (New in V1): This is a new field in V1 and is a random UUID. Migrators can assign any unique UUID.

id (New in V1): An id for the Data Asset. Migrators can assign any unique UUID.

Snowflake Creation via API

V0 Snowflake Creation via API
V1 Snowflake Creation via API

Python
# Create datasource
connection_string = "snowflake://<user_login_name>:<password>@<account_identifier>/<database_name>/<schema_name>?warehouse=<warehouse_name>&role=<role_name>"
snowflake_ds = context.sources.add_snowflake(name="snowflake_ds", connection_string=connection_string)

# Create table assets
yearly_snowflake_asset = snowflake_ds.add_table_asset(name="yearly_taxi_data", table_name="TAXI_DATA_ALL_SAMPLES", order_by=["+year"])
yearly_snowflake_asset.add_splitter_year(column_name="pickup_datetime")
monthly_snowflake_asset = snowflake_ds.add_table_asset(name="monthly_taxi_data", table_name="TAXI_DATA_ALL_SAMPLES", order_by=["-year", "-month"])
monthly_snowflake_asset.add_splitter_year_and_month(column_name="pickup_datetime")
daily_snowflake_asset = snowflake_ds.add_table_asset(name="daily_taxi_data", table_name="TAXI_DATA_ALL_SAMPLES", order_by=["+year", "+month", "+day"])
daily_snowflake_asset.add_splitter_year_and_month_and_day(column_name="pickup_datetime")
whole_table_snowflake_asset = snowflake_ds.add_table_asset(name="all_taxi_data", table_name="TAXI_DATA_ALL_SAMPLES")

# Create query assets
yearly_query_asset = snowflake_ds.add_query_asset(name="query_yearly", query="select * from TAXI_DATA_ALL_SAMPLES", order_by=["+year"])
yearly_query_asset.add_splitter_year(column_name="pickup_datetime")
monthly_query_asset = snowflake_ds.add_query_asset(name="query_monthly", query="select * from TAXI_DATA_ALL_SAMPLES", order_by=["-year", "-month"])
monthly_query_asset.add_splitter_year_and_month(column_name="pickup_datetime")
daily_query_asset = snowflake_ds.add_query_asset(name="query_daily", query="select * from TAXI_DATA_ALL_SAMPLES", order_by=["+year", "+month", "+day"])
daily_query_asset.add_splitter_year_and_month_and_day(column_name="pickup_datetime")
query_whole_table_asset = snowflake_ds.add_query_asset(name="whole_query", query="select * from TAXI_DATA_ALL_SAMPLES")

Python
# Create datasource
connection_string = "snowflake://<user_login_name>:<password>@<account_identifier>/<database_name>/<schema_name>?warehouse=<warehouse_name>&role=<role_name>"
snowflake_ds = context.data_sources.add_snowflake(name="snowflake_ds", connection_string=connection_string)

# Create table asset and batch definitions
table_asset = snowflake_ds.add_table_asset(name="taxi_data", table_name="TAXI_DATA_ALL_SAMPLES")
table_yearly = table_asset.add_batch_definition_yearly(name="table_yearly", column="PICKUP_DATETIME", sort_ascending=True)
table_monthly = table_asset.add_batch_definition_monthly(name="table_monthly", column="PICKUP_DATETIME", sort_ascending=False)
table_daily = table_asset.add_batch_definition_daily(name="table_daily", column="PICKUP_DATETIME", sort_ascending=True)
whole_table = table_asset.add_batch_definition_whole_table(name="whole_table")

# Create query asset and batch definitions
query_asset = snowflake_ds.add_query_asset(name="query_data", query="SELECT * FROM TAXI_DATA_ALL_SAMPLES")
query_yearly = query_asset.add_batch_definition_yearly(name="query_yearly", column="PICKUP_DATETIME", sort_ascending=True)
query_monthly = query_asset.add_batch_definition_monthly(name="query_monthly", column="PICKUP_DATETIME", sort_ascending=False)
query_daily = query_asset.add_batch_definition_daily(name="query_daily", column="PICKUP_DATETIME", sort_ascending=True)
query_whole_table = query_asset.add_batch_definition_whole_table(name="whole_query")

Postgres API

The postgres Data Source/Asset migration from V0 to V1 is almost identical to the Snowflake one in terms of fields. All the fields are identical and how to migrate them from V0 to V1 is identical so please refer to the Snowflake section for a description. The differences in values are:

The type field value is postgres instead of snowflake
We are NOT requiring schemas in V1 for postgres table assets.

Here is an example great_expectations.yml fluent_datasources block and creation of this datasource and asset via the API.

The provided connection string is a sample dataset GX maintains.

Here is an example great_expectations.yml fluent_datasources block and creation of this datasource and asset via the API.

The provided connection string is a sample dataset GX maintains.

V0 Postgres Config
V1 Postgres Config

YAML
fluent_datasources:
    postgres_ds:
        type: postgres
        assets:
        yearly_taxi_data:
            type: table
            order_by:
            - key: year
                reverse: false
            batch_metadata: {}
            splitter:
            column_name: pickup
            method_name: split_on_year
            table_name: nyc_taxi_data
            schema_name:
        monthly_taxi_data:
            type: table
            order_by:
            - key: year
                reverse: true
            - key: month
                reverse: true
            batch_metadata: {}
            splitter:
            column_name: pickup
            method_name: split_on_year_and_month
            table_name: nyc_taxi_data
            schema_name:
        daily_taxi_data:
            type: table
            order_by:
            - key: year
                reverse: false
            - key: month
                reverse: false
            - key: day
                reverse: false
            batch_metadata: {}
            splitter:
            column_name: pickup
            method_name: split_on_year_and_month_and_day
            table_name: nyc_taxi_data
            schema_name:
        all_taxi_data:
            type: table
            order_by: []
            batch_metadata: {}
            table_name: nyc_taxi_data
            schema_name:
        query_yearly:
            type: query
            order_by:
            - key: year
                reverse: false
            batch_metadata: {}
            splitter:
            column_name: pickup
            method_name: split_on_year
            query: select * from nyc_taxi_data
        query_monthly:
            type: query
            order_by:
            - key: year
                reverse: true
            - key: month
                reverse: true
            batch_metadata: {}
            splitter:
            column_name: pickup
            method_name: split_on_year_and_month
            query: select * from nyc_taxi_data
        query_daily:
            type: query
            order_by:
            - key: year
                reverse: false
            - key: month
                reverse: false
            - key: day
                reverse: false
            batch_metadata: {}
            splitter:
            column_name: pickup
            method_name: split_on_year_and_month_and_day
            query: select * from nyc_taxi_data
        whole_query:
            type: query
            order_by: []
            batch_metadata: {}
            query: select * from nyc_taxi_data
        connection_string: postgresql+psycopg2://try_gx:try_gx@postgres.workshops.greatexpectations.io/gx_example_db

YAML
fluent_datasources:
    postgres_ds:
        type: postgres
        id: cc4984f4-dbad-4488-8b0a-47ec47fc294c
        assets:
        taxi_data:
            type: table
            id: cb140e3c-d33f-4920-9bfc-2a23de990283
            batch_metadata: {}
            batch_definitions:
            table_yearly:
                id: 23e9d1c7-d22e-44f3-b1fa-eb0db1df4ce8
                partitioner:
                column_name: pickup
                sort_ascending: true
                method_name: partition_on_year
            table_monthly:
                id: be939a11-a257-4f9a-83c8-8efd1b25d9c9
                partitioner:
                column_name: pickup
                sort_ascending: false
                method_name: partition_on_year_and_month
            table_daily:
                id: 80fb4af2-2ab2-4a09-a05d-849835677c45
                partitioner:
                column_name: pickup
                sort_ascending: true
                method_name: partition_on_year_and_month_and_day
            whole_table:
                id: 09674cda-573c-400b-9a64-10dcdaecb60b
                partitioner:
            table_name: nyc_taxi_data
            schema_name:
        query_data:
            type: query
            id: 9ad6b38b-2337-4f51-bae2-31afb212c5f2
            batch_metadata: {}
            batch_definitions:
            query_yearly:
                id: 56455714-0622-46b0-857f-60d964e1d004
                partitioner:
                column_name: pickup
                sort_ascending: true
                method_name: partition_on_year
            query_monthly:
                id: e96513f1-12b8-419d-a1b9-4aacedfd396d
                partitioner:
                column_name: pickup
                sort_ascending: false
                method_name: partition_on_year_and_month
            query_daily:
                id: 996a2813-6eff-4c8a-88c6-5ca9ab60e275
                partitioner:
                column_name: pickup
                sort_ascending: true
                method_name: partition_on_year_and_month_and_day
            whole_query:
                id: f947cbc4-3d3b-4f92-bee0-4186fdac2b61
                partitioner:
            query: SELECT * FROM nyc_taxi_data
        connection_string: postgresql+psycopg2://try_gx:try_gx@postgres.workshops.greatexpectations.io/gx_example_db

Postgresql Creation via API

V0 Postgres Creation via API
V1 Postgres Creation via API

Python
# Creating a datasource
connection_string = "postgresql+psycopg2://try_gx:try_gx@postgres.workshops.greatexpectations.io/gx_example_db"
ds = context.sources.add_postgres(name="postgres_ds", connection_string=connection_string)

# Creating table assets
yearly_asset = ds.add_table_asset(name="yearly_taxi_data", table_name="nyc_taxi_data", order_by=["+year"])
yearly_asset.add_splitter_year(column_name="pickup")
monthly_asset = ds.add_table_asset(name="monthly_taxi_data", table_name="nyc_taxi_data", order_by=["-year", "-month"])
monthly_asset.add_splitter_year_and_month(column_name="pickup")
daily_asset = ds.add_table_asset(name="daily_taxi_data", table_name="nyc_taxi_data", order_by=["+year", "+month", "+day"])
daily_asset.add_splitter_year_and_month_and_day(column_name="pickup")
whole_table_asset = ds.add_table_asset(name="all_taxi_data", table_name="nyc_taxi_data")

# Creating query Assets
yearly_query_asset = ds.add_query_asset(name="query_yearly", query="select * from nyc_taxi_data", order_by=["+year"])
yearly_query_asset.add_splitter_year(column_name="pickup")
monthly_query_asset = ds.add_query_asset(name="query_monthly", query="select * from nyc_taxi_data", order_by=["-year", "-month"])
monthly_query_asset.add_splitter_year_and_month(column_name="pickup")
daily_query_asset = ds.add_query_asset(name="query_daily", query="select * from nyc_taxi_data", order_by=["+year", "+month", "+day"])
daily_query_asset.add_splitter_year_and_month_and_day(column_name="pickup")
query_whole_table_asset = ds.add_query_asset(name="whole_query", query="select * from nyc_taxi_data")

Python
# Creating a datasource
connection_string = "postgresql+psycopg2://try_gx:try_gx@postgres.workshops.greatexpectations.io/gx_example_db"
ds = context.data_sources.add_postgres(name="postgres_ds", connection_string=connection_string)

# Creating a table asset and batch definitions
table_asset = ds.add_table_asset(name="taxi_data", table_name="nyc_taxi_data")
table_yearly = table_asset.add_batch_definition_yearly(name="table_yearly", column="pickup", sort_ascending=True)
table_monthly = table_asset.add_batch_definition_monthly(name="table_monthly", column="pickup", sort_ascending=False)
table_daily = table_asset.add_batch_definition_daily(name="table_daily", column="pickup", sort_ascending=True)
whole_table = table_asset.add_batch_definition_whole_table(name="whole_table")

# Creating a query asset and batch definitions
query_asset = ds.add_query_asset(name="query_data", query="SELECT * FROM nyc_taxi_data")
query_yearly = query_asset.add_batch_definition_yearly(name="query_yearly", column="pickup", sort_ascending=True)
query_monthly = query_asset.add_batch_definition_monthly(name="query_monthly", column="pickup", sort_ascending=False)
query_daily = query_asset.add_batch_definition_daily(name="query_daily", column="pickup", sort_ascending=True)
query_whole_table = query_asset.add_batch_definition_whole_table(name="whole_query")

Spark API

Spark Filesystem

This is almost identical to the pandas filesystem and we only present a daily and a yearly asset conversion here.

V0 Spark Filesystem Config
V1 Spark Filesystem Config

YAML
fluent_datasources:
    spark_fs:
        type: spark_filesystem
        assets:
        yearly_taxi_data:
            type: csv
            batching_regex: sampled_yellow_tripdata_(?P<year>\d{4})\.csv
        daily_taxi_data:
            type: csv
            batching_regex: sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv
        spark_config:
        spark.executor.memory: 4g
        persist: true
        base_directory: data

YAML
fluent_datasources:
    spark_fs:
        type: spark_filesystem
        id: 62a7c671-8f2a-468c-be53-a82576d7b436
        assets:
        taxi_data:
            type: csv
            id: 78d5ccc2-1697-490f-886a-c9672d5548c6
            batch_definitions:
            yearly_batches:
                id: 4a0ff04f-a9fe-4c36-b680-0b1c61f4e0c2
                partitioner:
                regex: sampled_yellow_tripdata_(?P<year>\d{4})\.csv
                sort_ascending: true
            daily_batches:
                id: b2e056fe-6f1d-4fdc-ab69-75d3a19f1a44
                partitioner:
                regex: sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv
                sort_ascending: true
                param_names:
                    - year
                    - month
                    - day
        spark_config:
        spark.executor.memory: 4g
        persist: true
        base_directory: data

Spark Filesystem API

V0 Spark Filesystem API
V1 Spark Filesystem API

Python
import great_expectations as gx
context = gx.get_context(mode="file")

datasource = context.sources.add_spark_filesystem(name="spark_fs", base_directory="data", spark_config={"spark.executor.memory": "4g"}, persist=True)
yearly = datasource.add_csv_asset(name="yearly_taxi_data", batching_regex=r"sampled_yellow_tripdata_(?P<year>\d{4})\.csv")
daily = datasource.add_csv_asset(name="daily_taxi_data", batching_regex=r"sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv")

Python
import great_expectations as gx
context = gx.get_context(mode="file")

data_source = context.data_sources.add_spark_filesystem(name="spark_fs", base_directory="data", spark_config={"spark.executor.memory": "4g"}, persist=True)
asset = data_source.add_csv_asset(name="taxi_data")
yearly = asset.add_batch_definition_yearly(name="yearly_batches", regex=r"sampled_yellow_tripdata_(?P<year>\d{4})\.csv")
daily = asset.add_batch_definition_daily(name="daily_batches", regex=r"sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv")

Spark Dataframe

Here is a side-by-side comparison of the Spark dataframe data source configuration.

V0 Spark Dataframe Config
V1 Spark Dataframe Config

YAML
fluent_datasources:
    spark_ds:
        type: spark
        assets:
        taxi_dataframe_asset:
            type: dataframe
            batch_metadata: {}
        spark_config:
        spark.executor.memory: 4g
        force_reuse_spark_context: true
        persist: true

YAML
fluent_datasources:
    spark_ds:
        type: spark
        id: 134de28d-bfdc-4980-aa2e-4f59788afef3
        assets:
        taxi_dataframe_asset:
            type: dataframe
            id: 4110d2ff-5711-47df-a4be-eaefc2a638b4
            batch_metadata: {}
            batch_definitions:
            taxi_dataframe_batch_def:
                id: 76738b8b-28ab-4857-aa98-f0ff80c8f137
                partitioner:
        spark_config:
        spark.executor.memory: 4g
        force_reuse_spark_context: true
        persist: true

Spark dataframe API

V0 Spark Dataframe API
V1 Spark Dataframe API

Python
import great_expectations as gx
context = gx.get_context(mode="file")

dataframe_ds = context.sources.add_spark(name="spark_ds", spark_config={"spark.executor.memory": "4g"}, force_reuse_spark_context=True, persist=True)
dataframe_asset = dataframe_ds.add_dataframe_asset(name="taxi_dataframe_asset")

Python
import great_expectations as gx
context = gx.get_context(mode="file")

dataframe_ds = context.data_sources.add_spark(name="spark_ds", spark_config={"spark.executor.memory": "4g"}, force_reuse_spark_context=True, persist=True)
dataframe_asset = dataframe_ds.add_dataframe_asset(name="taxi_dataframe_asset")
dataframe_bd = dataframe_asset.add_batch_definition_whole_dataframe(name=ƒ"taxi_dataframe_batch_def")

Spark Directory Asset

Spark directory assets are different than our other dataframe Data Assets. These assets pull all the files from a directory into a single dataframe. Then, like for SQL Data Sources, one specifies a column when adding the Batch Definition. This column will be used to split the dataframe into batches.

For this example all the data files live in directory data/data2/ relative to our project directory.

In V0, we split the data based on an exact string. In V1, our batch definitions are all based on datetime (eg batches are by day, month, or year).

V0 Spark Directory Asset Config
V1 Spark Directory Asset Config

YAML
fluent_datasources:
    spark:
        type: spark_filesystem
        assets:
        spark_asset:
            type: directory_csv
            header: true
            data_directory: data2
        spark_config:
        spark.executor.memory: 4g
        persist: true
        base_directory: data

YAML
fluent_datasources:
    spark:
        type: spark_filesystem
        id: a35e995d-dd60-45e4-90f0-061d2bda6544
        assets:
        spark_asset:
            type: directory_csv
            id: 9454840d-f064-4129-b8ff-38cfbb71af99
            batch_definitions:
            monthly:
                id: 853d02de-54b1-45a7-a4e2-b9f8a8ca0a33
                partitioner:
                column_name: tpep_pickup_datetime
                method_name: partition_on_year_and_month
            header: true
            data_directory: data2
        spark_config:
        spark.executor.memory: 4g
        persist: true
        base_directory: data

The configuration for 0.X because we only allow splitting the data into batches by exact string match and we require uses to fully specify the batch request options in GX 0.X (batch parameters in GX 1.0). I am not showing all the spark specific configuration options. They are both supported in the same way in GX 0.X and GX 1.0.

Spark directory asset API

V0 Spark Directory Asset API
V1 Spark Directory Asset API

Python
import great_expectations as gx

context = gx.get_context(mode="file")

ds = context.sources.add_spark_filesystem(name="spark", base_directory="data", spark_config={"spark.executor.memory": "4g"}, persist=True)

asset = ds.add_directory_csv_asset(name="spark_asset", data_directory="data2", header=True)
# This must really be a year-month date column instead of a datetime column for splitting by month in GX 0.X
asset.add_splitter_column_value(column_name="tpep_pickup_datetime")
# There is no sorting added because in GX 0.X, one has to specify all parameters so sorting is a no-op

Python
import great_expectations as gx

context = gx.get_context(mode="file")
ds = context.data_sources.add_spark_filesystem(name="spark", base_directory="data", spark_config={"spark.executor.memory": "4g"}, persist=True)
asset = ds.add_directory_csv_asset(name="spark_asset", data_directory="data2", header=True)
bd = asset.add_batch_definition_monthly(name="monthly", column="tpep_pickup_datetime")

b = bd.get_batch()
b.head(fetch_all=True)

Checkpoints

In V0, there were multiple equivalent ways to configure the exact same Checkpoint. This is because a Checkpoint object contained a validations parameter which was a list of the validations the Checkpoint would run. Each item in this list took all the arguments necessary for a validation such as the Expectation Suite, the Batch Request, the actions, etc. However, all these same arguments are also present on the Checkpoint initializer. Usually, if an argument was present in the validation, that would be used, but if any argument was not present in a validation, GX would fall back to the argument defined on the Checkpoint itself. We’d call these default values the “top-level values”. In addition, if the validations argument was an empty list or None, GX would infer the Checkpoint had 1 validation and create one using only “top-level values”. In this case, we’d call this validation a “top-level validation”. This fallback led to some confusing behavior, especially since it wasn’t consistently implemented.

In V1, we have removed all top-level arguments so every validation must be fully specified in the validation_definitions argument which is the analog to the old validations argument. We’ve also promoted the Validation Definition to its own domain object since it encapsulates the unit of validation. Checkpoints are groupings of Validation Definitions packaged with actions that may be taken after a validation is run. With this in mind, the V0 checkpoint configuration has been broken into 2 files, a Validation Definition configuration file and a checkpoint configuration file.

We walk through 4 cases of V0 configuration files:

Case 1: An empty validations argument so only a top-level validation exists.

Case 2: No top-level validations so all values come from the validations argument.

Case 3: A validation with values specified both in the validation and on the top level.

Case 4: A validation with values specified on the top level that is overridden in the validation.

We hope that this gives enough breadth over the possible ways to convert a Checkpoint that a migrator will have a helpful example. If there are missing cases that you’d like to see appear, please reach out.

Case 1: Empty Validations Argument

The V0 configuration lives in gx/checkpoints/<CHECKPOINT_NAME>.yml. In V1, the configuration is JSON and lives in 2 files: gx/checkpoints/<CHECKPOINT_NAME> and gx/validation_definitions/<VALIDATION_DEFINITION_NAME>.

V0: gx/checkpoints/my_checkpoint.yml
V1: gx/checkpoints/my_checkpoint and gx/validation_definitions/my_validation_definition

YAML
name: my_checkpoint
config_version: 1.0
template_name:
module_name: great_expectations.checkpoint
class_name: Checkpoint
run_name_template:
expectation_suite_name: my_suite
batch_request:
datasource_name: pd_fs_ds
data_asset_name: monthly_taxi_data
action_list:
- name: store_validation_result
    action:
    class_name: StoreValidationResultAction
- name: store_evaluation_params
    action:
    class_name: StoreEvaluationParametersAction
- name: update_data_docs
    action:
    class_name: UpdateDataDocsAction
- name: my_email_action
    action:
    class_name: EmailAction
    notify_on: all
    use_tls: true
    use_ssl: false
    renderer:
        module_name: great_expectations.render.renderer.email_renderer
        class_name: EmailRenderer
    smtp_address: smtp.myserver.com
    smtp_port: 587
    sender_login: sender@myserver.com
    sender_password: XXXXXXXXXX
    sender_alias: alias@myserver.com
    receiver_emails: receiver@myserver.com
evaluation_parameters: {}
runtime_configuration: {}
validations: []
profilers: []
ge_cloud_id:
expectation_suite_ge_cloud_id:

gx/validation_definitions/my_validation_definition

JSON
{
    "data": {
        "asset": {
        "id": "ae696e27-fb6a-45fb-a2a0-bf1b8627c07e",
        "name": "taxi_data"
        },
        "batch_definition": {
        "id": "9b396884-ef73-47f5-b8f7-c2fc1306589b",
        "name": "monthly_batches"
        },
        "datasource": {
        "id": "934fd0e2-4c34-4e88-be1a-6b56ed69d614",
        "name": "pd_fs_ds"
        }
    },
    "id": "cbd6552b-12d4-4b9f-92d5-1223eb6730d8",
    "name": "my_validation_definition",
    "suite": {
        "id": "a71b700d-867a-46be-b5f2-6b9402dcc925",
        "name": "my_suite"
    }
}

gx/checkpoints/my_checkpoint

JSON
{
    "actions": [
        {
        "name": "update_data_docs",
        "site_names": [],
        "type": "update_data_docs"
        },
        {
        "name": "my_email_action",
        "notify_on": "all",
        "notify_with": null,
        "receiver_emails": "receiver@myserver.com",
        "renderer": {
            "class_name": "EmailRenderer",
            "module_name": "great_expectations.render.renderer.email_renderer"
        },
        "sender_alias": "alias@myserver.com",
        "sender_login": "sender@myserver.com",
        "sender_password": "XXXXXXXXXX",
        "smtp_address": "smtp.myserver.com",
        "smtp_port": "587",
        "type": "email",
        "use_ssl": false,
        "use_tls": true
        }    
    ],
    "id": "ff7a0cd3-6b64-463a-baa0-4b5b4d7512b5",
    "name": "my_checkpoint",
    "result_format": "SUMMARY",
    "validation_definitions": [
        {
        "id": "cbd6552b-12d4-4b9f-92d5-1223eb6730d8",
        "name": "my_validation_definition"
        }
    ]
    } 

We provide a mapping from the V0 fields to the V1 fields along with any new V1 fields.

name: This gets mapped to the name field in the V1 Checkpoint configuration file.

config_version: This is no longer a parameter.

template_name: This is no longer a supported feature. If you need to migrate this over, you should find the template values and set them explicitly in the new Checkpoint.

module_name: This is no longer necessary and is inferred so is no longer a supported parameter.

class_name: This is no longer necessary and is inferred so is no longer a supported parameter.

run_name_template: This is no longer a supported feature.

expectation_suite_name: This is now found in the validation definition configuration in suite.name.

batch_request: There is no longer a batch request concept in V1. The Data Source and Data Asset are now found in the Validation Definition configuration data field. The data field has 3 keys: asset, batch_definition, and datasource. The value is a dictionary with the keys:

name: The name of the asset/batch_definition/datasource found in great_expectations.yml.

id: The id for the asset/batch_definition/datasource found in great_expectations.yml.

action_list: This is now mapped to the checkpoint configurations actions key which is a list of dictionaries where each dictionary configures one action. The name for an action in a V0 action list maps to the V1 action dictionary name key. A few things to note:

V1 has no default actions.
The store_validation_result is no longer an action since validation results are always stored and this is built into running a checkpoint (and running a validation definition directly).
The store_evaluation_params action no longer exists since runtime parameters must now be passed in at runtime so we don’t store defaults anywhere.
The update_data_docs action is no longer automatically added and must be explicitly added. Its configuration is a list of site_names. If you’ve configured these in V0, you can move them over directly and they have the same values. There is a new field called type, which all actions have, that is a unique literal string for a particular action. For this action type should be set to “update_data_docs”.

evaluation_parameters: This is no longer supported at the checkpoint level. In V0 one could also configure evaluation_parameters in the expectation suite parameters. One can still do that there (now called suite_parameters, see the Expectation Suites and Expectations section) and using that Expectation Suite will enable these parameters for checkpoints using that suite.

runtime_configuration: The runtime configuration supported by V1 is result format. There is now an explicit result_format key in the checkpoint configuration whose value is one of the following strings: SUMMARY, COMPLETE, BASIC, BOOLEAN_ONLY.

validations: This is now the checkpoint configuration field validation_definitions which is a list of dictionaries where each item in the list corresponds to a validation definition. There are 2 keys in the Validation Definition dictionary:

id: This must match the top-level id field in the validation_definitions configuration file that corresponds to this validation definition.

name: This must match the top-level name field in the validation_definitions configuration file that corresponds to this validation definition.

There are now restrictions on which validations can be grouped together in a checkpoint. Each Validation Definition in a Checkpoint must take the same batch parameters at runtime. So if you grouped together multiple validations together in V0 whose batches are parameterized differently (e.g. one uses a “whole table” batch definition and another uses a “daily” batch definition) they will have to be split up into multiple checkpoints.

profilers: This feature has been removed in V1. Some form of profilers will be re-introduced in V1 at a later date.

ge_cloud_id: This should be empty for file-based configurations and has been removed in V1.

expectation_suite_ge_cloud_id: This should be empty for file based configurations and has been removed in V1.

Case 1: API calls

V0 Checkpoint API
V1 Checkpoint API

Python
import great_expectations as gx
from great_expectations.checkpoint import EmailAction

context = gx.get_context(mode="file")


datasource = context.sources.add_pandas_filesystem(name="pd_fs_ds", base_directory="data")
monthly = datasource.add_csv_asset(name="monthly_taxi_data", batching_regex=r"sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})\.csv")

suite = context.add_expectation_suite(
    expectation_suite_name="my_suite",
    data_asset_type="CSVAsset",
)

validator = context.get_validator(batch_request=monthly.build_batch_request(), expectation_suite_name="my_suite")
validator.expect_column_values_to_be_between(column="passenger_count", min_value=0, max_value=10)
validator.save_expectation_suite(discard_failed_expectations=False)


batch_request = monthly.build_batch_request()  # options={"year": "2019", "month": "01"})

email_action_config = {
    "name": "my_email_action",
    "action": {
        "class_name": "EmailAction",
        "notify_on": "all",
        "use_tls": True,
        "use_ssl": False,
        "renderer": {
            "module_name": "great_expectations.render.renderer.email_renderer",
            "class_name": "EmailRenderer"
        },
        "smtp_address": "smtp.myserver.com",
        "smtp_port": 587,
        "sender_login": "sender@myserver.com",
        "sender_password": "XXXXXXXXXX",
        "sender_alias": "alias@myserver.com",
        "receiver_emails": "receiver@myserver.com",
    }
}

action_list = [
    {'name': 'store_validation_result',
    'action': {'class_name': 'StoreValidationResultAction'}},
    {'name': 'store_evaluation_params',
    'action': {'class_name': 'StoreEvaluationParametersAction'}},
    {'name': 'update_data_docs',
    'action': {'class_name': 'UpdateDataDocsAction'}},
    email_action_config
]

checkpoint_config = {
    "name": "my_checkpoint",
    "config_version": 1.0,
    "class_name": "Checkpoint",
    "module_name": "great_expectations.checkpoint",
    "expectation_suite_name": "my_suite",
    "batch_request": batch_request,
    "action_list": action_list,
}

checkpoint = context.add_checkpoint(**checkpoint_config)
result = context.run_checkpoint("my_checkpoint")

Python
import great_expectations as gx
import great_expectations.expectations as gxe

context = gx.get_context(mode="file")
data_source = context.data_sources.add_pandas_filesystem(name="pd_fs_ds", base_directory="./data")
file_csv_asset = data_source.add_csv_asset(name="taxi_data")
monthly = file_csv_asset.add_batch_definition_monthly(name="monthly_batches", regex=r"sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})\.csv")

suite = context.suites.add(gx.ExpectationSuite(name="my_suite"))
suite.add_expectation(gxe.ExpectColumnValuesToBeBetween(column="passenger_count", min_value=0, max_value=10))

validation_definition = context.validation_definitions.add(
    gx.ValidationDefinition(data=monthly, suite=suite, name="my_validation_definition")
)
checkpoint = context.checkpoints.add(
    gx.Checkpoint(
        name="my_checkpoint",
        validation_definitions=[validation_definition],
        actions=[
            gx.checkpoint.UpdateDataDocsAction(name="update_data_docs"),
            gx.checkpoint.EmailAction(
                name="my_email_action",
                notify_on="all",
                use_tls=True,
                use_ssl=False,
                smtp_address="smtp.myserver.com",
                smtp_port=587,
                sender_login="sender@myserver.com",
                sender_password="XXXXXXXXXX",
                sender_alias="alias@myserver.com",
                receiver_emails="receiver@myserver.com",
            ),
        ],
    )
)
result = checkpoint.run()

Case 2: No top-level arguments

We only show the V0 configuration and code samples here because the V1 configuration and code is identical to case 1.

One unique thing to notice is that while in the API code snippet below all actions are defined in the validation argument, you will see in the configuration file that the actions get split up and some end up being defined on the top level and some appear on the validation. All actions will get run when the checkpoint is run, which is inconsistent with the normal “overriding” behavior for values defined in the validation.

V0 Checkpoint Configuration

YAML
name: my_checkpoint
config_version: 1.0
template_name:
module_name: great_expectations.checkpoint
class_name: Checkpoint
run_name_template:
expectation_suite_name:
batch_request: {}
action_list:
    - name: store_validation_result
        action:
        class_name: StoreValidationResultAction
    - name: store_evaluation_params
        action:
        class_name: StoreEvaluationParametersAction
    - name: update_data_docs
        action:
        class_name: UpdateDataDocsAction
evaluation_parameters: {}
runtime_configuration: {}
validations:
- action_list:
    - name: store_validation_result
        action:
        class_name: StoreValidationResultAction
    - name: store_evaluation_params
        action:
        class_name: StoreEvaluationParametersAction
    - name: update_data_docs
        action:
        class_name: UpdateDataDocsAction
    - name: my_email_action
        action:
        class_name: EmailAction
        notify_on: all
        use_tls: true
        use_ssl: false
        renderer:
            module_name: great_expectations.render.renderer.email_renderer
            class_name: EmailRenderer
        smtp_address: smtp.myserver.com
        smtp_port: 587
        sender_login: sender@myserver.com
        sender_password: XXXXXXXXXX
        sender_alias: alias@myserver.com
        receiver_emails: receiver@myserver.com
    batch_request:
    datasource_name: pd_fs_ds
    data_asset_name: monthly_taxi_data
    options: {}
    batch_slice:
    expectation_suite_name: my_suite
profilers: []
ge_cloud_id:
expectation_suite_ge_cloud_id:

Case 2: API calls

V0 Checkpoint API

Python
import great_expectations as gx

context = gx.get_context(mode="file")

datasource = context.sources.add_pandas_filesystem(name="pd_fs_ds", base_directory="data")
monthly = datasource.add_csv_asset(name="monthly_taxi_data", batching_regex=r"sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})\.csv")

suite = context.add_expectation_suite(
    expectation_suite_name="my_suite",
    data_asset_type="CSVAsset",
)
validator = context.get_validator(batch_request=monthly.build_batch_request(), expectation_suite_name="my_suite")
validator.expect_column_values_to_be_between(column="passenger_count", min_value=0, max_value=10)
validator.save_expectation_suite(discard_failed_expectations=False)

batch_request = monthly.build_batch_request()

email_action_config = {
    "name": "my_email_action",
    "action": {
        "class_name": "EmailAction",
        "notify_on": "all",
        "use_tls": True,
        "use_ssl": False,
        "renderer": {
            "module_name": "great_expectations.render.renderer.email_renderer",
            "class_name": "EmailRenderer"
        },
        "smtp_address": "smtp.myserver.com",
        "smtp_port": 587,
        "sender_login": "sender@myserver.com",
        "sender_password": "XXXXXXXXXX",
        "sender_alias": "alias@myserver.com",
        "receiver_emails": "receiver@myserver.com",
    }
}

action_list = [
    {'name': 'store_validation_result',
    'action': {'class_name': 'StoreValidationResultAction'}},
    {'name': 'store_evaluation_params',
    'action': {'class_name': 'StoreEvaluationParametersAction'}},
    {'name': 'update_data_docs',
    'action': {'class_name': 'UpdateDataDocsAction'}},
    email_action_config
]

checkpoint_config = {
    "name": "my_checkpoint",
    "config_version": 1.0,
    "class_name": "Checkpoint",
    "module_name": "great_expectations.checkpoint",
    "validations": [
        {
            "expectation_suite_name": "my_suite",
            "batch_request": batch_request,
            "action_list": action_list,
        }
    ],
}

checkpoint = context.add_checkpoint(**checkpoint_config)
result_case_2 = context.run_checkpoint("my_checkpoint")

Case 3: Combined top level and validation configuration

We only show the V0 configuration and code samples here because the V1 configuration and code is identical to case 1.

V0 Checkpoint Configuration

YAML
name: top_level_and_validation_checkpoint
config_version: 1.0
template_name:
module_name: great_expectations.checkpoint
class_name: Checkpoint
run_name_template:
expectation_suite_name: my_suite
batch_request: {}
action_list:
- name: store_validation_result
    action:
    class_name: StoreValidationResultAction
- name: store_evaluation_params
    action:
    class_name: StoreEvaluationParametersAction
- name: update_data_docs
    action:
    class_name: UpdateDataDocsAction
- name: my_email_action
    action:
    class_name: EmailAction
    notify_on: all
    use_tls: true
    use_ssl: false
    renderer:
        module_name: great_expectations.render.renderer.email_renderer
        class_name: EmailRenderer
    smtp_address: smtp.myserver.com
    smtp_port: 587
    sender_login: sender@myserver.com
    sender_password: XXXXXXXXXX
    sender_alias: alias@myserver.com
    receiver_emails: receiver@myserver.com
evaluation_parameters: {}
runtime_configuration: {}
validations:
- batch_request:
    datasource_name: pd_fs_ds
    data_asset_name: monthly_taxi_data
    options: {}
    batch_slice:
profilers: []
ge_cloud_id:
expectation_suite_ge_cloud_id:

Case 3: API calls

V0 Checkpoint API

Python
import great_expectations as gx

context = gx.get_context(mode="file")

datasource = context.sources.add_pandas_filesystem(name="pd_fs_ds", base_directory="data")
monthly = datasource.add_csv_asset(name="monthly_taxi_data", batching_regex=r"sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})\.csv")

suite = context.add_expectation_suite(
    expectation_suite_name="my_suite",
    data_asset_type="CSVAsset",
)
validator = context.get_validator(batch_request=monthly.build_batch_request(), expectation_suite_name="my_suite")
validator.expect_column_values_to_be_between(column="passenger_count", min_value=0, max_value=10)
validator.save_expectation_suite(discard_failed_expectations=False)

batch_request = monthly.build_batch_request()

email_action_config = {
    "name": "my_email_action",
    "action": {
        "class_name": "EmailAction",
        "notify_on": "all",
        "use_tls": True,
        "use_ssl": False,
        "renderer": {
            "module_name": "great_expectations.render.renderer.email_renderer",
            "class_name": "EmailRenderer"
        },
        "smtp_address": "smtp.myserver.com",
        "smtp_port": 587,
        "sender_login": "sender@myserver.com",
        "sender_password": "XXXXXXXXXX",
        "sender_alias": "alias@myserver.com",
        "receiver_emails": "receiver@myserver.com",
    }
}

action_list = [
    {'name': 'store_validation_result',
    'action': {'class_name': 'StoreValidationResultAction'}},
    {'name': 'store_evaluation_params',
    'action': {'class_name': 'StoreEvaluationParametersAction'}},
    {'name': 'update_data_docs',
    'action': {'class_name': 'UpdateDataDocsAction'}},
    email_action_config
]

checkpoint_config = {
    "name": "top_level_and_validation_checkpoint",
    "config_version": 1.0,
    "class_name": "Checkpoint",
    "module_name": "great_expectations.checkpoint",
    "expectation_suite_name": "my_suite",
    "action_list": action_list,
    "validations": [
        {
            "batch_request": batch_request,
        }
    ],
}

context.add_checkpoint(**checkpoint_config)
result = context.run_checkpoint("top_level_and_validation_checkpoint")

Case 4: Combined top level with validation configuration override

We only show the V0 configuration and code samples here because the V1 configuration and code is identical to case 1.

V0 Checkpoint Configuration

YAML
name: top_level_and_validation_override_checkpoint
config_version: 1.0
template_name:
module_name: great_expectations.checkpoint
class_name: Checkpoint
run_name_template:
expectation_suite_name: my_suite
batch_request: {}
action_list:
- name: store_validation_result
    action:
    class_name: StoreValidationResultAction
- name: store_evaluation_params
    action:
    class_name: StoreEvaluationParametersAction
- name: update_data_docs
    action:
    class_name: UpdateDataDocsAction
- name: my_email_action
    action:
    class_name: EmailAction
    notify_on: all
    use_tls: true
    use_ssl: false
    renderer:
        module_name: great_expectations.render.renderer.email_renderer
        class_name: EmailRenderer
    smtp_address: smtp.myserver.com
    smtp_port: 587
    sender_login: sender@myserver.com
    sender_password: XXXXXXXXXX
    sender_alias: alias@myserver.com
    receiver_emails: receiver@myserver.com
evaluation_parameters: {}
runtime_configuration: {}
validations:
- batch_request:
    datasource_name: pd_fs_ds
    data_asset_name: monthly_taxi_data
    options: {}
    batch_slice:
    expectation_suite_name: my_other_suite
profilers: []
ge_cloud_id:
expectation_suite_ge_cloud_id:

Case 4: API calls

V0 Checkpoint API

Python
import great_expectations as gx

context = gx.get_context(mode="file")

datasource = context.sources.add_pandas_filesystem(name="pd_fs_ds", base_directory="data")
monthly = datasource.add_csv_asset(name="monthly_taxi_data", batching_regex=r"sampled_yellow_tripdata_(?P<year>\d{4})-(?P<month>\d{2})\.csv")

suite = context.add_expectation_suite(
    expectation_suite_name="my_suite",
    data_asset_type="CSVAsset",
)
validator = context.get_validator(batch_request=monthly.build_batch_request(), expectation_suite_name="my_suite")
validator.expect_column_values_to_be_between(column="passenger_count", min_value=0, max_value=10)
validator.save_expectation_suite(discard_failed_expectations=False)

other_suite = context.add_expectation_suite(
    expectation_suite_name="my_other_suite",
    data_asset_type="CSVAsset",
)
validator = context.get_validator(batch_request=monthly.build_batch_request(), expectation_suite_name="my_other_suite")
validator.expect_column_values_to_be_between(column="passenger_count", min_value=0, max_value=4)
validator.save_expectation_suite(discard_failed_expectations=False)

batch_request = monthly.build_batch_request()

email_action_config = {
    "name": "my_email_action",
    "action": {
        "class_name": "EmailAction",
        "notify_on": "all",
        "use_tls": True,
        "use_ssl": False,
        "renderer": {
            "module_name": "great_expectations.render.renderer.email_renderer",
            "class_name": "EmailRenderer"
        },
        "smtp_address": "smtp.myserver.com",
        "smtp_port": 587,
        "sender_login": "sender@myserver.com",
        "sender_password": "XXXXXXXXXX",
        "sender_alias": "alias@myserver.com",
        "receiver_emails": "receiver@myserver.com",
    }
}

action_list = [
    {'name': 'store_validation_result',
    'action': {'class_name': 'StoreValidationResultAction'}},
    {'name': 'store_evaluation_params',
    'action': {'class_name': 'StoreEvaluationParametersAction'}},
    {'name': 'update_data_docs',
    'action': {'class_name': 'UpdateDataDocsAction'}},
    email_action_config
]

checkpoint_config = {
    "name": "top_level_and_validation_override_checkpoint",
    "config_version": 1.0,
    "class_name": "Checkpoint",
    "module_name": "great_expectations.checkpoint",
    "expectation_suite_name": "my_suite",
    "action_list": action_list,
    "validations": [
        {
            "expectation_suite_name": "my_other_suite",
            "batch_request": batch_request,
        }
    ],
}

context.add_checkpoint(**checkpoint_config)
result = context.run_checkpoint("top_level_and_validation_override_checkpoint")

Data Context Variables

The Data Context variables will be automatically converted for GX Cloud users when switching from V0 to V1. For file context users, we will show the difference in the yaml so you can translate the configuration block in great_expectations.yml.

V0 Data Context Configuration
V1 Data Context Configuration

YAML
config_version: 3.0
config_variables_file_path: uncommitted/config_variables.yml
plugins_directory: plugins/
stores:
expectations_store:
    class_name: ExpectationsStore
    store_backend:
    class_name: TupleFilesystemStoreBackend
    base_directory: expectations/
validations_store:
    class_name: ValidationsStore
    store_backend:
    class_name: TupleFilesystemStoreBackend
    base_directory: uncommitted/validations/
evaluation_parameter_store:
    class_name: EvaluationParameterStore
checkpoint_store:
    class_name: CheckpointStore
    store_backend:
    class_name: TupleFilesystemStoreBackend
    suppress_store_backend_id: true
    base_directory: checkpoints/
profiler_store:
    class_name: ProfilerStore
    store_backend:
    class_name: TupleFilesystemStoreBackend
    suppress_store_backend_id: true
    base_directory: profilers/
expectations_store_name: expectations_store
validations_store_name: validations_store
evaluation_parameter_store_name: evaluation_parameter_store
checkpoint_store_name: checkpoint_store
data_docs_sites:
local_site:
    class_name: SiteBuilder
    show_how_to_buttons: true
    store_backend:
    class_name: TupleFilesystemStoreBackend
    base_directory: uncommitted/data_docs/local_site/
    site_index_builder:
    class_name: DefaultSiteIndexBuilder
anonymous_usage_statistics:
data_context_id: a7441dab-9db7-4043-a3e7-011cdab54cfb
enabled: false
usage_statistics_url: https://qa.stats.greatexpectations.io/great_expectations/v1/usage_statistics
fluent_datasources:
spark_fs:
    type: spark_filesystem
    assets:
    directory_csv_asset:
        type: directory_csv
        data_directory: data
    spark_config:
    spark.executor.memory: 4g
    persist: true
    base_directory: data
notebooks:
include_rendered_content:
globally: false
expectation_suite: false
expectation_validation_result: false

YAML
config_version: 4.0
config_variables_file_path: uncommitted/config_variables.yml
plugins_directory: plugins/
stores:
expectations_store:
    class_name: ExpectationsStore
    store_backend:
    class_name: TupleFilesystemStoreBackend
    base_directory: expectations/
validation_results_store:
    class_name: ValidationResultsStore
    store_backend:
    class_name: TupleFilesystemStoreBackend
    base_directory: uncommitted/validations/
checkpoint_store:
    class_name: CheckpointStore
    store_backend:
    class_name: TupleFilesystemStoreBackend
    suppress_store_backend_id: true
    base_directory: checkpoints/
validation_definition_store:
    class_name: ValidationDefinitionStore
    store_backend:
    class_name: TupleFilesystemStoreBackend
    base_directory: validation_definitions/
expectations_store_name: expectations_store
validation_results_store_name: validation_results_store
checkpoint_store_name: checkpoint_store
data_docs_sites:
local_site:
    class_name: SiteBuilder
    show_how_to_buttons: true
    store_backend:
    class_name: TupleFilesystemStoreBackend
    base_directory: uncommitted/data_docs/local_site/
    site_index_builder:
    class_name: DefaultSiteIndexBuilder
analytics_enabled: true
fluent_datasources:
spark_ds:
    type: spark
    id: 134de28d-bfdc-4980-aa2e-4f59788afef3
    assets:
    taxi_dataframe_asset:
        type: dataframe
        id: 4110d2ff-5711-47df-a4be-eaefc2a638b4
        batch_metadata: {}
        batch_definitions:
        taxi_dataframe_batch_def:
            id: 76738b8b-28ab-4857-aa98-f0ff80c8f137
            partitioner:
    spark_config:
    spark.executor.memory: 4g
    force_reuse_spark_context: true
    persist: true
data_context_id: 12bc94a0-8ac3-4e97-bf90-03cd3d92f8c4

config_version: For V1 this should be set to 4.0

config_variables_file_path: This is unchanged.

plugins_directory: This is unchanged.

stores: This is a dictionary of store names to configuration. In V0 the keys names were configurable. In V1, there is a fixed set of keys. These V1 keys are:

expectations_store: The configuration of the Expectations store. The value here is unchanged from the value that was stored with the key that was configured in the top-level variable expectations_store_name.

validation_results_store: The configuration to the validation results store. The value here is slightly changed from the value that was stored with the key that was configured in the top-level variable validations_store_name. The value change is ValidationsStore is now ValidationResultsStore.

checkpoint_store: This key and value are unchanged between V0 and V1.

validation_definition_store: Validation definitions are a new concept in V1. For file-based contexts, you can use this example V1 configuration directly. You can update the base_directory if you need to change the path where the configuration for validation definitions get stored.

expectations_store_name: While still present, this must now always be set to “expectations_store”.

validations_store_name: This should now be “validation_results_store_name”. Its value must be the value “validation_results_store“.

evaluation_parameter_store_name: This key has been removed. One can no longer store evaluation_parameters since they are now a runtime concept called expectation_parameters. If you want to set a default value of an expectation parameter, you should do that in code where you run the validation.

checkpoint_store_name: This parameter name is unchanged. The value must be “checkpoint_store”.

data_docs_sites: This key and its value are unchanged in V1.

anonymous_usage_statistics:

enabled: This value is now the top-level key analytics_enabled

data_context_id: This value is now the top-level key data_context_id

usage_statistics_url: This field is no longer configurable.

fluent_datasources: While this appears in the great_expectations.yml file, it is not a data context variable. Please see the “Data Sources and Data Assets” portion of this doc for instructions on migrating this from V0 to V1.

notebooks: This is no longer supported and does not appear in V1’s configuration.

include_rendered_content: This only mattered for GX Cloud users and no longer appears in this configuration file.

New V1 Fields

data_context_id: If previously one had the field anonymous_usage_statistics.data_context_id set, one should use that value here. Otherwise, this can be set to a unique, arbitrary UUID.

Store Backends

In previous versions of GX, we supported a number of configurable backend stores, including ones that persisted to databases, S3, Google Cloud Platform, and Azure. V1 drops support for these; file contexts only use TupleFilesystemStoreBackend and cloud contexts only use cloud stores. A number of GX users have a need for persisting their configurations, or subsets of their configurations, outside of their filesystem and either cannot or would prefer not to use cloud contexts. While GX no longer supports the tooling for these persistence models directly, users may use external libraries/services to handle this, e.g. copying their config to S3 via boto3.

V1 API

In V1, the configuration for all data context variables can be changed via the Python API. For a data context named context one can view via context.variables.<variable_name> and update via:

Python
context.variables.<variable_name> = new_value
context.variables.save()

# At this time you need to reload the context to have it take effect
context = gx.get_context() 

Overview​

GX Cloud Context Users​

GX File Context​

Domain objects​

Expectation Suites and Expectations​

Configuration file differences​

Expectation Suite API Calls​

Data Sources and Data Assets​

Pandas API​

Pandas Filesystem Data​

Pandas Filesystem Creation via API​

Pandas Dataframe​

Pandas Dataframe Creation via API​

Snowflake API​

Snowflake Creation via API​

Postgres API​

Postgresql Creation via API​

Spark API​

Spark Filesystem​

Spark Filesystem API​

Spark Dataframe​

Spark dataframe API​

Spark Directory Asset​

Spark directory asset API​

Checkpoints​

Case 1: Empty Validations Argument​

Case 1: API calls​

Case 2: No top-level arguments​

Case 2: API calls​

Case 3: Combined top level and validation configuration​

Case 3: API calls​

Case 4: Combined top level with validation configuration override​

Case 4: API calls​

Data Context Variables​

New V1 Fields​

Store Backends​

V1 API​

Overview

GX Cloud Context Users

GX File Context

Domain objects

Expectation Suites and Expectations

Configuration file differences

Expectation Suite API Calls

Data Sources and Data Assets

Pandas API

Pandas Filesystem Data

Pandas Filesystem Creation via API

Pandas Dataframe

Pandas Dataframe Creation via API

Snowflake API

Snowflake Creation via API

Postgres API

Postgresql Creation via API

Spark API

Spark Filesystem

Spark Filesystem API

Spark Dataframe

Spark dataframe API

Spark Directory Asset

Spark directory asset API

Checkpoints

Case 1: Empty Validations Argument

Case 1: API calls

Case 2: No top-level arguments

Case 2: API calls

Case 3: Combined top level and validation configuration

Case 3: API calls

Case 4: Combined top level with validation configuration override

Case 4: API calls

Data Context Variables

New V1 Fields

Store Backends

V1 API