Skip to main content

How to get a Batch of data from a configured Datasource

This guide will help you load a Batch for introspection and validation using an active Data Connector. For guides on loading batches of data from specific Datasources using a Data Connector see the Datasource specific guides in the "Connecting to your data" section.

What used to be called a “Batch” in the old API was replaced with Validator. A Validator knows how to validate a particular Batch of data on a particular Execution Engine against a particular Expectation Suite. In interactive mode, the Validator can store and update an Expectation Suite while conducting Data Discovery or Exploratory Data Analysis.

You can read more about the core classes that make Great Expectations run in our Core Concepts reference guide.

Prerequisites: This how-to guide assumes you have:

To load a Batch, the steps you will take are the same regardless of the type of Datasource or Data Connector you have set up. To learn more about Datasources, Data Connectors and Batch(es) see our Datasources Core Concepts Guide in the Core Concepts reference guide.

  1. Construct a BatchRequest

    # Here is an example BatchRequest for all batches associated with the specified DataAssetbatch_request = BatchRequest(    datasource_name="insert_your_datasource_name_here",    data_connector_name="insert_your_data_connector_name_here",    data_asset_name="insert_your_data_asset_name_here",)

    Since a BatchRequest can return multiple Batch(es), you can optionally provide additional parameters to filter the retrieved Batch(es). See Datasources Core Concepts Guide for more info on filtering besides batch_filter_parameters and limit including custom filter functions and sampling. The example BatchRequests below shows several non-exhaustive possibilities.

    # This BatchRequest adds a query and limit to retrieve only the first 5 batches from 2020data_connector_query_2020 = {    "batch_filter_parameters": {"param_1_from_your_data_connector_eg_year": "2020"}}batch_request_2020 = BatchRequest(    datasource_name="insert_your_datasource_name_here",    data_connector_name="insert_your_data_connector_name_here",    data_asset_name="insert_your_data_asset_name_here",    data_connector_query=data_connector_query_2020,    limit=5,)
    # Here is an example `data_connector_query` filtering based on parameters from `group_names`# previously defined in a regex pattern in your Data Connector:data_connector_query_202001 = {    "batch_filter_parameters": {        "param_1_from_your_data_connector_eg_year": "2020",        "param_2_from_your_data_connector_eg_month": "01",    }}# This BatchRequest will use the above filter to retrieve only the batch from Jan 2020batch_request_202001 = BatchRequest(    datasource_name="insert_your_datasource_name_here",    data_connector_name="insert_your_data_connector_name_here",    data_asset_name="insert_your_data_asset_name_here",    data_connector_query=data_connector_query_202001,)
    # Here is an example `data_connector_query` filtering based on an `index` which can be# any valid python slice. The example here is retrieving the latest batch using `-1`:data_connector_query_last_index = {    "index": -1,}last_index_batch_request = BatchRequest(    datasource_name="insert_your_datasource_name_here",    data_connector_name="insert_your_data_connector_name_here",    data_asset_name="insert_your_data_asset_name_here",    data_connector_query=data_connector_query_last_index,)

    You may also wish to list available batches to verify that your BatchRequest is retrieving the correct Batch(es), or to see which are available. You can use context.get_batch_list() for this purpose, which can take a variety of flexible input types similar to a BatchRequest. Some examples are shown below:

    # List all Batches associated with the DataAssetbatch_list_all_a = context.get_batch_list(    datasource_name="insert_your_datasource_name_here",    data_connector_name="insert_your_data_connector_name_here",    data_asset_name="insert_your_data_asset_name_here",)
    # Alternatively you can use the previously created batch_request to achieve the same thingbatch_list_all_b = context.get_batch_list(batch_request=batch_request)
    # You can use a query to filter the batch_listbatch_list_202001_query = context.get_batch_list(    datasource_name="insert_your_datasource_name_here",    data_connector_name="insert_your_data_connector_name_here",    data_asset_name="insert_your_data_asset_name_here",    data_connector_query=data_connector_query_202001,)
    # Or limit to a specific number of batchesbatch_list_all_limit_10 = context.get_batch_list(    datasource_name="insert_your_datasource_name_here",    data_connector_name="insert_your_data_connector_name_here",    data_asset_name="insert_your_data_asset_name_here",    limit=10,)
  2. Get access to your Batch via a Validator

    # First create an expectation suite to use with our validatorcontext.create_expectation_suite(    expectation_suite_name="test_suite", overwrite_existing=True)# Now create our validatorvalidator = context.get_validator(    batch_request=last_index_batch_request, expectation_suite_name="test_suite")
  3. Check your data

    You can check that the first few lines of the Batch you loaded into your Validator are what you expect by running:

    print(validator.head())

    Now that you have a Validator, you can use it to create Expectations or validate the data.