Skip to main content

How to configure a RuntimeDataConnector

This guide demonstrates how to configure a RuntimeDataConnector and only applies to the V3 (Batch Request) API. A RuntimeDataConnector allows you to specify a Batch using a Runtime Batch Request, which is used to create a Validator. A Validator is the key object used to create Expectations and validate datasets.

Prerequisites: This how-to guide assumes you have:

A RuntimeDataConnector is a special kind of Data Connector that enables you to use a RuntimeBatchRequest to provide a Batch's data directly at runtime. The RuntimeBatchRequest can wrap an in-memory dataframe, a filepath, or a SQL query, and must include batch identifiers that uniquely identify the data (e.g. a run_id from an AirFlow DAG run). The batch identifiers that must be passed in at runtime are specified in the RuntimeDataConnector's configuration.

Steps#

1. Instantiate your project's DataContext#

Import these necessary packages and modules:

import great_expectations as gefrom great_expectations.core.batch import RuntimeBatchRequest

2. Set up a Datasource#

All of the examples below assume you’re testing configuration using something like:

datasource_yaml = """name: taxi_datasourceclass_name: Datasourceexecution_engine:  class_name: PandasExecutionEnginedata_connectors:  <DATACONNECTOR NAME GOES HERE>:    <DATACONNECTOR CONFIGURATION GOES HERE>"""context.test_yaml_config(yaml_config=datasource_config)

If you’re not familiar with the test_yaml_config method, please check out: How to configure Data Context components using test_yaml_config

3. Add a RuntimeDataConnector to a Datasource configuration#

This basic configuration can be used in multiple ways depending on how the RuntimeBatchRequest is configured:

datasource_yaml = """name: taxi_datasourceclass_name: Datasourcemodule_name: great_expectations.datasourceexecution_engine:  module_name: great_expectations.execution_engine  class_name: PandasExecutionEnginedata_connectors:  default_runtime_data_connector_name:    class_name: RuntimeDataConnector    batch_identifiers:      - default_identifier_name"""

Once the RuntimeDataConnector is configured you can add your datasource using:

context.add_datasource(**datasource_config)

Example 1: RuntimeDataConnector for access to file-system data:#

At runtime, you would get a Validator from the Data Context by first defining a RuntimeBatchRequest with the path to your data defined in runtime_parameters:

batch_request = RuntimeBatchRequest(    datasource_name="taxi_datasource",    data_connector_name="default_runtime_data_connector_name",    data_asset_name="<YOUR MEANINGFUL NAME>",  # This can be anything that identifies this data_asset for you    runtime_parameters={"path": "<PATH TO YOUR DATA HERE>"},  # Add your path here.    batch_identifiers={"default_identifier_name": "<YOUR MEANINGFUL IDENTIFIER>"},)

Next, you would pass that request into context.get_validator:

validator = context.get_validator(    batch_request=batch_request,    create_expectation_suite_with_name="<MY EXPECTATION SUITE NAME>",)

Example 2: RuntimeDataConnector that uses an in-memory DataFrame#

At runtime, you would get a Validator from the Data Context by first defining a RuntimeBatchRequest with the DataFrame passed into batch_data in runtime_parameters:

import pandas as pd
path = "<PATH TO YOUR DATA HERE>"
df = pd.read_csv(path)
batch_request = RuntimeBatchRequest(    datasource_name="taxi_datasource",    data_connector_name="default_runtime_data_connector_name",    data_asset_name="<YOUR MEANINGFUL NAME>",  # This can be anything that identifies this data_asset for you    runtime_parameters={"batch_data": df},  # Pass your DataFrame here.    batch_identifiers={"default_identifier_name": "<YOUR MEANINGFUL IDENTIFIER>"},)

Next, you would pass that request into context.get_validator:

batch_request=batch_request,    expectation_suite_name="<MY EXPECTATION SUITE NAME>",)print(validator.head())

Additional Notes#

To view the full script used in this page, see it on GitHub: