Connect to dataframe data
A dataframe is a set of data that resides in-memory and is represented in your code by a variable to which it is assigned. To connect to this in-memory data you will define a Data Source based on the type of dataframe you are connecting to, a Data Asset that connects to the dataframe in question, and a Batch Definition that will return all of the records in the dataframe as a single Batch of data.
Create a Data Source
Because the dataframes reside in memory you do not need to specify the location of the data when you create your Data Source. Instead, the type of Data Source you create depends on the type of dataframe containing your data. Great Expectations has methods for connecting to both pandas and Spark dataframes.
Prerequisites
- Python version 3.8 to 3.11
-
An installation of GX Core
- Optional. To connect to data with Spark you will also need the an installation of the Python dependencies for Spark.
- A preconfigured Data Context. These examples assume the variable
context
contains your Data Context.
Procedure
- Instructions
- Sample code
-
Define the Data Source parameters.
A dataframe Data Source requires the following information:
name
: A name by which to reference the Data Source. This should be unique among all Data Sources on the Data Context.
Update
data_source_name
in the following code with a descriptive name for your Data Source:Pythondata_source_name = "my_data_source"
-
Create the Data Source.
To read a pandas dataframe you will need to create a pandas Data Source. Likewise, to read a Spark dataframe you will need to create a Spark Data Source.
- pandas
- Spark
Execute the following code to create a pandas Data Source:
Pythondata_source = context.data_sources.add_pandas(name=data_source_name)
Execute the following code to create a Spark Data Source:
Pythondata_source = context.data_sources.add_spark(name=data_source_name)
- pandas
- Spark
import great_expectations as gx
# Retrieve your Data Context
context = gx.get_context()
# Define the Data Source name
data_source_name = "my_data_source"
# Add the Data Source to the Data Context
data_source = context.data_sources.add_pandas(name=data_source_name)
import great_expectations as gx
# Retrieve your Data Context
context = gx.get_context()
# Define the Data Source name
data_source_name = "my_data_source"
# Add the Data Source to the Data Context
data_source = context.data_sources.add_spark(name=data_source_name)
Create a Data Asset
A dataframe Data Asset is used to group your Validation Results. For instance, if you have a data pipeline with three stages and you wanted the Validation Results for each stage to be grouped together, you would create a Data Asset with a unique name representing each stage.
Prerequisites
- Python version 3.8 to 3.11
-
An installation of GX Core
- Optional. To connect to data with Spark you will also need the an installation of the Python dependencies for Spark.
- A preconfigured Data Context. These examples assume the variable
context
contains your Data Context. - A pandas or Spark dataframe Data Source.
Procedure
- Instructions
- Sample code
-
Optional. Retrieve your Data Source.
If you do not already have a variable referencing your pandas or Spark Data Source, you can retrieve a previously created one with:
Pythondata_source_name = "my_data_source"
data_source = context.data_sources.get(data_source_name) -
Define the Data Asset's parameters.
A dataframe Data Asset requires the following information:
name
: A name by which the Data Asset can be referenced. This should be unique among Data Assets on the Data Source.
Update the
data_asset_name
parameter in the following code with a descriptive name for your Data Asset:Pythondata_asset_name = "my_dataframe_data_asset"
-
Add a Data Asset to the Data Source.
Execute the following code to add a Data Asset to your Data Source:
data_asset = data_source.add_dataframe_asset(name=data_asset_name)
import great_expectations as gx
context = gx.get_context()
# Retrieve the Data Source
data_source_name = "my_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset name
data_asset_name = "my_dataframe_data_asset"
# Add a Data Asset to the Data Source
data_asset = data_source.add_dataframe_asset(name=data_asset_name)
Create a Batch Definition
Typically, a Batch Definition is used to describe how the data within a Data Asset should be retrieved. With dataframes, all of the data in a given dataframe will always be retrieved as a Batch.
This means that Batch Definitions for dataframe Data Assets don't work to subdivide the data returned for validation. Instead, they serve as an additional layer of organization and allow you to further group your Validation Results. For example, if you have already used your dataframe Data Assets to group your Validation Results by pipeline stage, you could use two Batch Definitions to further group those results by having all automated validations use one Batch Definition and all manually executed validations use the other.
Prerequisites
- Python version 3.8 to 3.11
-
An installation of GX Core
- Optional. To connect to data with Spark you will also need the an installation of the Python dependencies for Spark.
- A preconfigured Data Context. These examples assume the variable
context
contains your Data Context. - A pandas or Spark dataframe Data Asset.
Procedure
- Instructions
- Sample code
-
Optional. Retrieve your Data Asset.
If you do not already have a variable referencing your pandas or Spark Data Asset, you can retrieve a previously created Data Asset with:
Pythondata_source_name = "my_data_source"
data_asset_name = "my_dataframe_data_asset"
data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name) -
Define the Batch Definition's parameters.
A dataframe Batch Definition requires the following information:
name
: A name by which the Batch Definition can be referenced. This should be unique among Batch Definitions on the Data Asset.
Because dataframes are always provided in their entirety, dataframe Batch Definitions always use the
add_batch_definition_whole_dataframe()
method.Update the value of
batch_definition_name
in the following code with something that describes your dataframe:Pythonbatch_definition_name = "my_batch_definition"
-
Add the Batch Definition to the Data Asset.
Execute the following code to add a Batch Definition to your Data Asset:
Pythonbatch_definition = data_asset.add_batch_definition_whole_dataframe(
batch_definition_name
)
import great_expectations as gx
context = gx.get_context()
# Retrieve the Data Asset
data_source_name = "my_data_source"
data_asset_name = "my_dataframe_data_asset"
data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
# Define the Batch Definition name
batch_definition_name = "my_batch_definition"
# Add a Batch Definition to the Data Asset
batch_definition = data_asset.add_batch_definition_whole_dataframe(
batch_definition_name
)
Provide a dataframe through Batch Parameters
Because dataframes exist in memory and cease to exist when a Python session ends the dataframe itself is not saved as part of a Data Assset or Batch Definition. Instead, a dataframe created in the current Python session is passed in at runtime as a Batch Parameter dictionary.
Prerequisites
- Python version 3.8 to 3.11
-
An installation of GX Core
- Optional. To connect to data with Spark you will also need the an installation of the Python dependencies for Spark.
- A preconfigured Data Context. These examples assume the variable
context
contains your Data Context. - A Batch Definition on a pandas or Spark dataframe Data Asset.
- Data in a pandas or Spark dataframe. These examples assume the variable
dataframe
contains your pandas or Spark dataframe. - Optional. A Validation Definition.
Procedure
-
Define the Batch Parameter dictionary.
A dataframe can be added to a Batch Parameter dictionary by defining it as the value of the dictionary key
dataframe
:Pythonbatch_parameters = {"dataframe": dataframe}
The following examples create a dataframe by reading a
.csv
file and stores it in a Batch Parameter dictionary:- pandas
- Spark
Pythonimport pandas
csv_path = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"
dataframe = pandas.read_csv(csv_path)
batch_parameters = {"dataframe": dataframe}Pythonfrom pyspark.sql import SparkSession
csv = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"
spark = SparkSession.builder.appName("Read CSV").getOrCreate()
dataframe = spark.read.csv(csv, header=True, inferSchema=True)
batch_parameters = {"dataframe": dataframe} -
Pass the Batch Parameter dictionary to a
get_batch()
orvalidate()
method call.Runtime Batch Parameters can be provided to the
get_batch()
method of a Batch Definition or to thevalidate()
method of a Validation Definition.- Batch Definition
- Validation Definition
The
get_batch()
method of a Batch Definition retrieves a single Batch of data. Runtime Batch Parameters can be provided to theget_batch()
method to specify the data returned as a Batch. Thevalidate()
method of this Batch can then be used to test individual Expectations.Pythonimport great_expectations as gx
context = gx.get_context()
# Retrieve the dataframe Batch Definition
data_source_name = "my_data_source"
data_asset_name = "my_dataframe_data_asset"
batch_definition_name = "my_batch_definition"
batch_definition = (
context.data_sources.get(data_source_name)
.get_asset(data_asset_name)
.get_batch_definition(batch_definition_name)
)
# Create an Expectation to test
expectation = gx.expectations.ExpectColumnValuesToBeBetween(
column="passenger_count", max_value=6, min_value=1
)
# Get the dataframe as a Batch
batch = batch_definition.get_batch(batch_parameters=batch_parameters)
# Test the Expectation
validation_results = batch.validate(expectation)
print(validation_results)The results generated by
batch.validate()
are not persisted in storage. This workflow is solely intended for interactively creating Expectations and engaging in data Exploration.For further information on using an individual Batch to test Expectations see Test an Expectation.
A Validation Definition's
run()
method validates an Expectation Suite against a Batch returned by a Batch Definition. Runtime Batch Parameters can be provided to a Validation Definition'srun()
method to specify the data returned in the Batch. This allows you to validate your dataframe by executing the Expectation Suite included in the Validation Definition.Pythonimport great_expectations as gx
context = gx.get_context()
# Retrieve a Validation Definition that uses the dataframe Batch Definition
validation_definition_name = "my_validation_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
# Validate the dataframe by passing it to the Validation Definition as Batch Parameters.
validation_results = validation_definition.run(batch_parameters=batch_parameters)
print(validation_results)For more information on Validation Definitions see Run Validations.