Skip to main content

How to get one or more Batches of data from a configured Datasource

This guide will help you load a BatchA selection of records from a Data Asset. for validation using an active Data ConnectorProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets.. For guides on loading batches of data from specific DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems. using a Data Connector see the Datasource specific guides in the "Connecting to your data" section.

A ValidatorUsed to run an Expectation Suite against data. knows how to ValidateThe act of applying an Expectation Suite to a Batch. a particular Batch of data on a particular Execution EngineA system capable of processing data to compute Metrics. against a particular Expectation SuiteA collection of verifiable assertions about data.. In interactive mode, the Validator can store and update an Expectation Suite while conducting Data Discovery or Exploratory Data Analysis.

You can read more about the core classes that make Great Expectations run in our Core Concepts reference guide.

Prerequisites: This how-to guide assumes you have:

Steps: Loading one or more Batches of dataโ€‹

To load one or more Batch(es), the steps you will take are the same regardless of the type of Datasource or Data Connector you have set up. To learn more about Datasources, Data Connectors and Batch(es) see our Datasources Core Concepts Guide in the Core Concepts reference guide.

1. Construct a BatchRequestโ€‹

note

As outlined in the Datasource and Data Connector docs mentioned above, this Batch Request must reference a previously configured Datasource and Data Connector.

# Here is an example BatchRequest for all batches associated with the specified DataAsset
batch_request = BatchRequest(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
)

Since a BatchRequest can return multiple Batch(es), you can optionally provide additional parameters to filter the retrieved Batch(es). See Datasources Core Concepts Guide for more info on filtering besides batch_filter_parameters and limit including custom filter functions and sampling. The example BatchRequests below shows several non-exhaustive possibilities.

# Here is an example data_connector_query filtering based on an index which can be
# any valid python slice. The example here is retrieving the latest batch using -1:
data_connector_query_last_index = {
"index": -1,
}
last_index_batch_request = BatchRequest(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
data_connector_query=data_connector_query_last_index,
)
# This BatchRequest adds a query to retrieve only the twelve batches from 2020
data_connector_query_2020 = {
"batch_filter_parameters": {"group_name_from_your_data_connector_eg_year": "2020"}
}
batch_request_2020 = BatchRequest(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
data_connector_query=data_connector_query_2020,
)
# This BatchRequest adds a query and limit to retrieve only the first 5 batches from 2020.
# Note that the limit is applied after the data_connector_query filtering. This behavior is
# different than using an index, which is applied before the other query filters.
data_connector_query_2020 = {
"batch_filter_parameters": {
"group_name_from_your_data_connector_eg_year": "2020",
}
}
batch_request_2020 = BatchRequest(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
data_connector_query=data_connector_query_2020,
limit=5,
)
# Here is an example data_connector_query filtering based on parameters from group_names
# previously defined in a regex pattern in your Data Connector:
data_connector_query_202001 = {
"batch_filter_parameters": {
"group_name_from_your_data_connector_eg_year": "2020",
"group_name_from_your_data_connector_eg_month": "01",
}
}
batch_request_202001 = BatchRequest(
datasource_name="insert_your_datasource_name_here",
data_connector_name="insert_your_data_connector_name_here",
data_asset_name="insert_your_data_asset_name_here",
data_connector_query=data_connector_query_202001,
)

You may also wish to list available batches to verify that your BatchRequest is retrieving the correct Batch(es), or to see which are available. You can use context.get_batch_list() for this purpose by passing it your BatchRequest:

batch_list = context.get_batch_list(batch_request=batch_request)

2. Get access to your Batches via a Validatorโ€‹

# Now we can review a sample of data using a Validator
context.create_expectation_suite(
expectation_suite_name="test_suite", overwrite_existing=True
)
validator = context.get_validator(
batch_request=batch_request, expectation_suite_name="test_suite"
)

3. Check your dataโ€‹

You can check that the Batch(es) that were loaded into your Validator are what you expect by running:

print(validator.batches)

You can also check that the first few lines of the Batch(es) you loaded into your Validator are what you expect by running:

print(validator.head())

Now that you have a Validator, you can use it to create Expectations or validate the data.

To view the full script used in this page, see it on GitHub: