Skip to main content
Version: 1.3.0

Retrieve a Batch of sample data

Expectations can be individually validated against a Batch of data. This allows you to test newly created Expectations, or to create and validate Expectations to further your understanding of new data. But first, you must retrieve a Batch of data to validate your Expectations against.

GX provides two methods of retrieving sample data for testing or data exploration. The first is to request a Batch of data from any Batch Definition you have previously configured. The second is to use the built in pandas_default Data Source to read in a Batch of data from a datafile such as a .csv or .parquet file without first defining a corresponding Data Source, Data Asset, and Batch Definition.

Batch Definitions both organize a Data Asset's records into Batches and provide a method for retrieving those records. Any Batch Definition can be used to retrieve a Batch of records for use in testing Expectations or data exploration.

Prerequisites

Procedure

  1. Retrieve your Batch Definition.

    Update the values of data_source_name, data_asset_name, and batch_definition_name in the following code and execute it to retrieve your Batch Definition from the Data Context:

    Python
    data_source_name = "my_data_source"
    data_asset_name = "my_data_asset"
    batch_definition_name = "my_batch_definition"
    batch_definition = (
    context.data_sources.get(data_source_name)
    .get_asset(data_asset_name)
    .get_batch_definition(batch_definition_name)
    )
  2. Optional. Specify the Batch to retrieve.

    Some Batch Definitions can only provide a single Batch. Whole table batch definitions on SQL Data Assets, file path and whole directory Batch Definitions on filesystem Data Assets, and all Batch Definitions for dataframe Data Assets will provide all of the Data Asset's records as a single Batch. For these Batch Definitions there is no need to specify which Batch to retrieve because there is only one available.

    Yearly, monthly, and daily Batch Definitions subdivide the Data Asset's records by date. This allows you to retrieve the data corresponding to a specific date from the Data Asset. If you do not specify a Batch to retrieve, these Batch Definitions will return the first valid Batch they find. By default, this will be the most recent Batch (sort ascending) or the oldest Batch if the Batch Definition has been configured to sort descending.

    Sorting of records with invalid dates

    Records that are missing the date information necessary to be sorted into a Batch will be treated as the "oldest" records and will be returned first when a Batch Definition is set to sort descending.

    You are not limited to retrieving only the most recent (or oldest, if the Batch Definition is set to sort descending) Batch. You can also request a specific Batch by providing a Batch Parameter dictionary.

    The Batch Parameter dictionary is a dictionary with keys indicating the year, month, and day of the data to retrieve and with values corresponding to those date components.

    Which keys are valid Batch Parameters depends on the type of date the Batch Definition is configured for:

    • Yearly Batch Definition accept the key year.
    • Monthly Batch Definition accept the keys year and month.
    • Daily Batch Definition accept the keys year, month, and day.

    If a Batch Definition is missing a key, the returned Batch will be the first Batch (as determined by the Batch Definition's sort ascending or sort descending configuration) that matches the date components that were provided.

    The following are some sample Batch Parameter dictionaries for progressively more specific dates:

    Python
    # If you're using File Data Assets, pass values as strings
    yearly_batch_parameters = {"year": "2019"}
    monthly_batch_parameters = {"year": "2019", "month": "01"}
    daily_batch_parameters = {"year": "2019", "month": "01", "day": "01"}

    # Otherwise, pass values as integers
    integer_daily_batch_parameters = {"year": 2019, "month": 1, "day": 1}

    Note that the format depends on whether or not you are using a File Data Asset.

  3. Retrieve a Batch of data.

    The Batch Definition's .get_batch(...) method is used to retrieve a Batch of Data. The Batch Parameters provided to this method will determine if the first valid Batch is returned, or a Batch for a specific date is returned.

    Execute the following code to retrieve the first available Batch from the Batch Definition:

    Python
    batch = batch_definition.get_batch()
  4. Optional. Verify that the returned Batch is populated with records.

    You can verify that your Batch Definition was able to read in data and return a populated Batch by printing the header and first few records of the returned Batch:

    Python
    print(batch.head())