Version: 1.5.5

Retrieve a Batch of sample data

Expectations can be individually validated against a Batch of data. This allows you to test newly created Expectations, or to create and validate Expectations to further your understanding of new data. But first, you must retrieve a Batch of data to validate your Expectations against.

GX provides two methods of retrieving sample data for testing or data exploration. The first is to request a Batch of data from any Batch Definition you have previously configured. The second is to use the built in pandas_default Data Source to read in a Batch of data from a datafile such as a .csv or .parquet file without first defining a corresponding Data Source, Data Asset, and Batch Definition.

Batch Definition
pandas_default

Batch Definitions both organize a Data Asset's records into Batches and provide a method for retrieving those records. Any Batch Definition can be used to retrieve a Batch of records for use in testing Expectations or data exploration.

Prerequisites

Python version 3.9 to 3.12.
An installation of GX Core.
A preconfigured Data Context. These examples assume the variable context contains your Data Context.
A preconfigured Data Source, Data Asset, and Batch Definition connected to your data.

Procedure

Instructions
Sample code

Retrieve your Batch Definition.

Update the values of data_source_name, data_asset_name, and batch_definition_name in the following code and execute it to retrieve your Batch Definition from the Data Context:

Python
data_source_name = "my_data_source"
data_asset_name = "my_data_asset"
batch_definition_name = "my_batch_definition"
batch_definition = (
    context.data_sources.get(data_source_name)
    .get_asset(data_asset_name)
    .get_batch_definition(batch_definition_name)
)

Optional. Specify the Batch to retrieve.

Some Batch Definitions can only provide a single Batch. Whole table batch definitions on SQL Data Assets, file path and whole directory Batch Definitions on filesystem Data Assets, and all Batch Definitions for dataframe Data Assets will provide all of the Data Asset's records as a single Batch. For these Batch Definitions there is no need to specify which Batch to retrieve because there is only one available.

Yearly, monthly, and daily Batch Definitions subdivide the Data Asset's records by date. This allows you to retrieve the data corresponding to a specific date from the Data Asset. If you do not specify a Batch to retrieve, these Batch Definitions will return the first valid Batch they find. By default, this will be the most recent Batch (sort ascending) or the oldest Batch if the Batch Definition has been configured to sort descending.

Sorting of records with invalid dates
Records that are missing the date information necessary to be sorted into a Batch will be treated as the "oldest" records and will be returned first when a Batch Definition is set to sort descending.

You are not limited to retrieving only the most recent (or oldest, if the Batch Definition is set to sort descending) Batch. You can also request a specific Batch by providing a Batch Parameter dictionary.

The Batch Parameter dictionary is a dictionary with keys indicating the year, month, and day of the data to retrieve and with values corresponding to those date components.

Which keys are valid Batch Parameters depends on the type of date the Batch Definition is configured for:
- Yearly Batch Definition accept the key year.
- Monthly Batch Definition accept the keys year and month.
- Daily Batch Definition accept the keys year, month, and day.
If a Batch Definition is missing a key, the returned Batch will be the first Batch (as determined by the Batch Definition's sort ascending or sort descending configuration) that matches the date components that were provided.

The following are some sample Batch Parameter dictionaries for progressively more specific dates:
Python
```
# If you're using File Data Assets, pass values as strings
yearly_batch_parameters = {"year": "2019"}
monthly_batch_parameters = {"year": "2019", "month": "01"}
daily_batch_parameters = {"year": "2019", "month": "01", "day": "01"}

# Otherwise, pass values as integers
integer_daily_batch_parameters = {"year": 2019, "month": 1, "day": 1}
```
Note that the format depends on whether or not you are using a File Data Asset.
Retrieve a Batch of data.

The Batch Definition's .get_batch(...) method is used to retrieve a Batch of Data. The Batch Parameters provided to this method will determine if the first valid Batch is returned, or a Batch for a specific date is returned.
- First valid Batch
- Specific Batch
Execute the following code to retrieve the first available Batch from the Batch Definition:
Python
batch = batch_definition.get_batch()
Update the Batch Parameters in the following code and execute it to retrieve a specific Batch from the Batch Definition:
Python
batch = batch_definition.get_batch(batch_parameters={"year": "2019", "month": "01"})
Optional. Verify that the returned Batch is populated with records.

You can verify that your Batch Definition was able to read in data and return a populated Batch by printing the header and first few records of the returned Batch:
Python
```
print(batch.head())
```

Python
import great_expectations as gx

context = gx.get_context()
set_up_context_for_example(context)

# Retrieve the Batch Definition:
data_source_name = "my_data_source"
data_asset_name = "my_data_asset"
batch_definition_name = "my_batch_definition"
batch_definition = (
    context.data_sources.get(data_source_name)
    .get_asset(data_asset_name)
    .get_batch_definition(batch_definition_name)
)

# Retrieve the first valid Batch of data:
batch = batch_definition.get_batch()

# Or use a Batch Parameter dictionary to specify a Batch to retrieve
# These are sample Batch Parameter dictionaries:

# If you're using File Data Assets, pass values as strings
yearly_batch_parameters = {"year": "2019"}
monthly_batch_parameters = {"year": "2019", "month": "01"}
daily_batch_parameters = {"year": "2019", "month": "01", "day": "01"}

# Otherwise, pass values as integers
integer_daily_batch_parameters = {"year": 2019, "month": 1, "day": 1}


# This code retrieves the Batch from a monthly Batch Definition:

batch = batch_definition.get_batch(batch_parameters={"year": "2019", "month": "01"})

print(batch.head())

The pandas_default Data Source is built into every Data Context and can be found at .data_sources.pandas_default on your Data Context.

The pandas_default Data Source provides methods to read the contents of a single datafile in any format supported by pandas. These .read_*(...) methods do not create a Data Asset or Batch Definition for the datafile. Instead, they simply return a Batch of data.

Because the pandas_default Data Source's .read_*(...) methods only return a Batch and do not save configurations for reading files to the Data Context, they are less versatile than a fully configured Data Source, Data Asset, and Batch Definition. Therefore, the pandas_default Data Source is only intended to facilitate testing Expectations and engaging in data exploration. The pandas_default Data Source's .read_*(...) methods are less suited for use in production and automated workflows.

Prerequisites

Python version 3.9 to 3.12.
An installation of GX Core.
A preconfigured Data Context. These examples assume the variable context contains your Data Context.
Data in a file format supported by pandas, such as .csv or .parquet.

Procedure

Instructions
Sample code

Define the path to the datafile.

The simplest method is to provide an absolute path to the datafile that you will retrieve records from. However, if you are using a File Data Context you can also provide a path relative to the Data Context's base_directory.

The following example specifies a .csv datafile using a relative path:
Python
```
file_path = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"
```
Use the appropriate .read_*(...) method of the pandas_default Data Source to retrieve a Batch of data.

The pandas_default Data Source can read any file format supported by your current installation of pandas.

The .read_*(...) methods of the pandas_default Data Source will return a Batch that contains all of the records in the provided datafile.

The following example reads a .csv file into a Batch of data:
Python
```
sample_batch = context.data_sources.pandas_default.read_csv(file_path)
```
GX supports all of the pandas .read_*(...) methods. For more information on which Pandas read_* methods are available, please reference the official Pandas Input/Output documentation for the version of Pandas that you have installed.
Optional. Verify that the returned Batch is populated with records.

You can verify that your Batch Definition was able to read in data and return a populated Batch by printing the header and first few records of the returned Batch:
Python
```
print(sample_batch.head())
```

Python
import great_expectations as gx

context = gx.get_context()
set_up_context_for_example(context)

# Provide the path to a data file:
file_path = "./data/folder_with_data/yellow_tripdata_sample_2019-01.csv"

# Use the `pandas_default` Data Source to read the file:
sample_batch = context.data_sources.pandas_default.read_csv(file_path)

# Verify that data was read into `sample_batch`:
print(sample_batch.head())

Prerequisites​

Procedure​

Prerequisites​

Procedure​

Prerequisites

Procedure

Prerequisites

Procedure