Skip to main content

How to connect to in-memory data in a Pandas dataframe

This guide will help you connect to your data that is an in-memory Pandas dataframe. This will allow you to ValidateThe act of applying an Expectation Suite to a Batch. and explore your data.

Prerequisites: This how-to guide assumes you have:
  • Completed the Getting Started Tutorial
  • A working installation of Great Expectations
  • Have access to data in a Pandas dataframe

Steps​

1. Choose how to run the code in this guide​

Get an environment to run the code in this guide. Please choose an option below.

If you use the Great Expectations CLICommand Line Interface, run this command to automatically generate a pre-configured Jupyter Notebook. Then you can follow along in the YAML-based workflow below:

great_expectations datasource new

2. Instantiate your project's DataContext​

Import these necessary packages and modules.

import pandas as pd
from ruamel import yaml

import great_expectations as ge
from great_expectations.core.batch import RuntimeBatchRequest

Load your DataContext into memory using the get_context() method.

context = ge.get_context()

3. Configure your Datasource​

Using this example configuration we configure a RuntimeDataConnector as part of our DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems., which will take in our in-memory frame.:

datasource_yaml = f"""
name: example_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_runtime_data_connector_name:
class_name: RuntimeDataConnector
batch_identifiers:
- default_identifier_name
"""

Run this code to test your configuration.

context.test_yaml_config(datasource_yaml)

Note: Since the Datasource does not have data passed-in until later, the output will show that no data_asset_names are currently available. This is to be expected.

4. Save the Datasource configuration to your DataContext​

Save the configuration into your DataContext by using the add_datasource() function.

context.add_datasource(**yaml.load(datasource_yaml))

6. Test your new Datasource​

Verify your new Datasource by loading data from it into a Validator using a RuntimeBatchRequest.

The dataframe we are using in this example looks like the following

Please feel free to substitute your data.

df = pd.DataFrame([[1, 2, 3], [4, 5, 6], [7, 8, 9]], columns=["a", "b", "c"])

Add the variable containing your dataframe (df in this example) to the batch_data key under runtime_parameters in your RuntimeBatchRequest.

batch_request = RuntimeBatchRequest(
datasource_name="example_datasource",
data_connector_name="default_runtime_data_connector_name",
data_asset_name="<YOUR_MEANINGFUL_NAME>", # This can be anything that identifies this data_asset for you
runtime_parameters={"batch_data": df}, # df is your dataframe
batch_identifiers={"default_identifier_name": "default_identifier"},
)

Then load data into the Validator.

context.create_expectation_suite(
expectation_suite_name="test_suite", overwrite_existing=True
)
validator = context.get_validator(
batch_request=batch_request, expectation_suite_name="test_suite"
)
print(validator.head())

πŸš€πŸš€ Congratulations! πŸš€πŸš€ You successfully connected Great Expectations with your data.

Additional Notes​

To view the full scripts used in this page, see them on GitHub:

Next Steps​

Now that you've connected to your data, you'll want to work on these core skills: