Skip to main content
Version: 1.3.0

Connect to dataframe data

A dataframe is a set of data that resides in-memory and is represented in your code by a variable to which it is assigned. To connect to this in-memory data you will define a Data Source based on the type of dataframe you are connecting to, a Data Asset that connects to the dataframe in question, and a Batch Definition that will return all of the records in the dataframe as a single Batch of data.

Create a Data Source

Because the dataframes reside in memory you do not need to specify the location of the data when you create your Data Source. Instead, the type of Data Source you create depends on the type of dataframe containing your data. Great Expectations has methods for connecting to both pandas and Spark dataframes.

Prerequisites

Procedure

  1. Define the Data Source parameters.

    A dataframe Data Source requires the following information:

    • name: A name by which to reference the Data Source. This should be unique among all Data Sources on the Data Context.

    Update data_source_name in the following code with a descriptive name for your Data Source:

    Python
    data_source_name = "my_data_source"
  2. Create the Data Source.

    To read a pandas dataframe you will need to create a pandas Data Source. Likewise, to read a Spark dataframe you will need to create a Spark Data Source.

    Execute the following code to create a pandas Data Source:

    Python
    data_source = context.data_sources.add_pandas(name=data_source_name)
    assert data_source.name == data_source_name

Create a Data Asset

A dataframe Data Asset is used to group your Validation Results. For instance, if you have a data pipeline with three stages and you wanted the Validation Results for each stage to be grouped together, you would create a Data Asset with a unique name representing each stage.

Prerequisites

Procedure

  1. Optional. Retrieve your Data Source.

    If you do not already have a variable referencing your pandas or Spark Data Source, you can retrieve a previously created one with:

    Python
    data_source_name = "my_data_source"
    data_source = context.data_sources.get(data_source_name)
  2. Define the Data Asset's parameters.

    A dataframe Data Asset requires the following information:

    • name: A name by which the Data Asset can be referenced. This should be unique among Data Assets on the Data Source.

    Update the data_asset_name parameter in the following code with a descriptive name for your Data Asset:

    Python
    data_asset_name = "my_dataframe_data_asset"
  3. Add a Data Asset to the Data Source.

    Execute the following code to add a Data Asset to your Data Source:

    Python
    data_asset = data_source.add_dataframe_asset(name=data_asset_name)

Create a Batch Definition

Typically, a Batch Definition is used to describe how the data within a Data Asset should be retrieved. With dataframes, all of the data in a given dataframe will always be retrieved as a Batch.

This means that Batch Definitions for dataframe Data Assets don't work to subdivide the data returned for validation. Instead, they serve as an additional layer of organization and allow you to further group your Validation Results. For example, if you have already used your dataframe Data Assets to group your Validation Results by pipeline stage, you could use two Batch Definitions to further group those results by having all automated validations use one Batch Definition and all manually executed validations use the other.

For API-managed Expectations only

If you use GX Cloud and GX Core together, note that Batch Definitions you create with the API apply to API-managed Expectations only.

Prerequisites

Procedure

  1. Optional. Retrieve your Data Asset.

    If you do not already have a variable referencing your pandas or Spark Data Asset, you can retrieve a previously created Data Asset with:

    Python
    data_source_name = "my_data_source"
    data_asset_name = "my_dataframe_data_asset"
    data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
  2. Define the Batch Definition's parameters.

    A dataframe Batch Definition requires the following information:

    • name: A name by which the Batch Definition can be referenced. This should be unique among Batch Definitions on the Data Asset.

    Because dataframes are always provided in their entirety, dataframe Batch Definitions always use the add_batch_definition_whole_dataframe() method.

    Update the value of batch_definition_name in the following code with something that describes your dataframe:

    Python
    batch_definition_name = "my_batch_definition"
  3. Add the Batch Definition to the Data Asset.

    Execute the following code to add a Batch Definition to your Data Asset:

    Python
    batch_definition = data_asset.add_batch_definition_whole_dataframe(
    batch_definition_name
    )

Provide a dataframe through Batch Parameters

Because dataframes exist in memory and cease to exist when a Python session ends the dataframe itself is not saved as part of a Data Assset or Batch Definition. Instead, a dataframe created in the current Python session is passed in at runtime as a Batch Parameter dictionary.

Prerequisites

Procedure

  1. Define the Batch Parameter dictionary.

    A dataframe can be added to a Batch Parameter dictionary by defining it as the value of the dictionary key dataframe:

    Python
    batch_parameters = {"dataframe": dataframe}

    The following examples create a dataframe by reading a .csv file and storing it in a Batch Parameter dictionary:

    Python
    import pandas

    csv_path = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"
    dataframe = pandas.read_csv(csv_path)

    batch_parameters = {"dataframe": dataframe}
  2. Pass the Batch Parameter dictionary to a get_batch() or validate() method call.

    Runtime Batch Parameters can be provided to the get_batch() method of a Batch Definition or to the validate() method of a Validation Definition.

    The get_batch() method of a Batch Definition retrieves a single Batch of data. Runtime Batch Parameters can be provided to the get_batch() method to specify the data returned as a Batch. The validate() method of this Batch can then be used to test individual Expectations.

    Python
    import great_expectations as gx

    context = gx.get_context()
    setup_context_for_example(context)

    # Retrieve the dataframe Batch Definition
    data_source_name = "my_data_source"
    data_asset_name = "my_dataframe_data_asset"
    batch_definition_name = "my_batch_definition"
    batch_definition = (
    context.data_sources.get(data_source_name)
    .get_asset(data_asset_name)
    .get_batch_definition(batch_definition_name)
    )

    # Create an Expectation to test
    expectation = gx.expectations.ExpectColumnValuesToBeBetween(
    column="passenger_count", max_value=6, min_value=1
    )

    # Get the dataframe as a Batch
    batch = batch_definition.get_batch(batch_parameters=batch_parameters)

    # Test the Expectation
    validation_results = batch.validate(expectation)
    print(validation_results)

    The results generated by batch.validate() are not persisted in storage. This workflow is solely intended for interactively creating Expectations and engaging in data Exploration.

    For further information on using an individual Batch to test Expectations see Test an Expectation.