Version: 1.0.5

Connect to Filesystem data

Filesystem data consists of data stored in file formats such as .csv or .parquet, and located in an environment with a folder hierarchy such as Amazon S3, Azure Blob Storage, Google Cloud Storage, or local and networked filesystems. GX can leverage either pandas or Spark to read this data.

To connect to your Filesystem data, you first create a Data Source which tells GX where your data files reside. You then configure Data Assets for your Data Source to tell GX which sets of records you want to be able to access from your Data Source. Finally, you will define Batch Definitions which allow you to request all the records retrieved from a Data Asset or further partition the returned records based on a specified date.

Create a Data Source

Data Sources tell GX where your data is located and how to connect to it. With Filesystem data this is done by directing GX to the folder or online location that contains the data files. GX supports accessing Filesystem data from Amazon S3, Azure Blob Storage, Google Cloud Storage, and local or networked filesystems.

Local or networked filesystem
Amazon S3
Azure Blob Storage
Google Cloud Storage

Prerequisites

Python version 3.8 to 3.11
An installation of GX Core
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
A preconfigured Data Context
Access to data files in a local or networked directory.

Quick access to sample data

All Data Contexts include a built in pandas_default Data Source. This Data Source gives access to all of the read_*(...) methods available in pandas.

The read_*(...) methods of the pandas_default Data Source allow you to load data into GX without first configuring a Data Source, Data Asset, and Batch Definition. However, it does not save configurations for reading files to the Data Context and provides less versatility than a fully configured Data Source, Data Asset, and Batch Definition. Therefore, the pandas_default Data Source is intended to facilitate testing Expectations and engaging in data exploration but is less suited for use in production and automated workflows.

Procedure

Instructions
Sample code

Define the Data Source's parameters.

The following information is required when you create a Filesystem Data Source for a local or networked directory:
- name: A descriptive name used to reference the Data Source. This should be unique within the Data Context.
- base_directory: The path to the folder that contains the data files, or the root folder of the directory hierarchy that contains the data files.
If you are using a File Data Context, you can provide a path that is relative to the Data Context's base_directory. Otherwise, you should provide the absolute path to the folder that contains your data.

In this example, a relative path is defined for a folder that contains taxi trip data for New York City in .csv format:
Python
```
source_folder = "./data"
data_source_name = "my_filesystem_data_source"
```

Add a Filesystem Data Source to your Data Context.

GX can leverage either pandas or Spark as the backend for your Filesystem Data Source. To create your Data Source, execute one of the following sets of code:

pandas
Spark

Python
data_source = context.data_sources.add_pandas_filesystem(
    name=data_source_name, base_directory=source_folder
)

Python
data_source = context.data_sources.add_spark_filesystem(
    name=data_source_name, base_directory=source_folder
)

Choose from the following to see the full example code for a local or networked Data Source, using either pandas or Spark to read the data files:

pandas example
Spark example

Python
import great_expectations as gx

context = gx.get_context()

# Define the Data Source's parameters:
# This path is relative to the `base_directory` of the Data Context.
source_folder = "./data"
data_source_name = "my_filesystem_data_source"

# Create the Data Source:
data_source = context.data_sources.add_pandas_filesystem(
    name=data_source_name, base_directory=source_folder
)

Python
import great_expectations as gx

context = gx.get_context()

# Define the Data Source's parameters:
# This path is relative to the `base_directory` of the Data Context.
source_folder = "./data"
data_source_name = "my_filesystem_data_source"

# Create the Data Source:
data_source = context.data_sources.add_spark_filesystem(
    name=data_source_name, base_directory=source_folder
)

Prerequisites

Python version 3.8 to 3.11
An installation of GX Core with support for Amazon S3 dependencies
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
A preconfigured Data Context
Access to data files on a S3 bucket.

Procedure

Instructions
Sample code

Define the Data Source's parameters.

The following information is required when you create an Amazon S3 Data Source:
- name: A descriptive name used to reference the Data Source. This should be unique within the Data Context.
- bucket_name: The Amazon S3 bucket name.
- boto3_options: Optional. Additional options for the Data Source. In the following examples, the default values are used.
Replace the variable values with your own and run the following Python code to define name, bucket_name and boto3_options:
Python
```
data_source_name = "my_filesystem_data_source"
bucket_name = "superconductive-docs-test"
boto3_options = {}
```
Additional options for boto3_options
The parameter boto3_options allows you to pass the following information:
- region_name: Your AWS region name.
- endpoint_url: specifies an S3 endpoint. You can provide an environment variable reference such as "${S3_ENDPOINT}" to securely include this in your code. The string "${S3_ENDPOINT}" will be replaced with the value of the environment variable S3_ENDPOINT.
For more information on secure storage and retrieval of credentials in GX see Configure credentials.

Add a S3 Filesystem Data Source to your Data Context.

GX can leverage either pandas or Spark as the backend for your S3 Filesystem Data Source. To create your Data Source, execute one of the following sets of code:

pandas
Spark

Python
data_source = context.data_sources.add_pandas_s3(
    name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)

Python
data_source = context.data_sources.add_spark_s3(
    name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)

Choose from the following to see the full example code for a S3 Filesystem Data Source, using either pandas or Spark to read the data files:

pandas example
Spark example

Python
import great_expectations as gx

context = gx.get_context()

# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
bucket_name = "superconductive-docs-test"
boto3_options = {}

# Create the Data Source:
data_source = context.data_sources.add_pandas_s3(
    name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)

Python
import great_expectations as gx

context = gx.get_context()

# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
bucket_name = "superconductive-docs-test"
boto3_options = {}

# Create the Data Source:
data_source = context.data_sources.add_spark_s3(
    name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)

Prerequisites

Python version 3.8 to 3.11
An installation of GX Core with support for Azure Blob Storage dependencies
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
A preconfigured Data Context
Access to data files in Azure Blob Storage.

Procedure

Instructions
Sample code

Define the Data Source's parameters.

The following information is required when you create a Microsoft Azure Blob Storage Data Source:
- name: A descriptive name used to reference the Data Source. This should be unique within the Data Context.
- azure_options: Authentication settings.
The azure_options parameter accepts a dictionary which should include two keys: credential and either account_url or conn_str.
- credential: An Azure Blob Storage token
- account_url: The url of your Azure Blob Storage account. If you provide this then conn_str should be left out of the dictionary.
- conn_str: The connection string for your Azure Blob Storage account. If you provide this then account_url should not be included in the dictionary.
To keep your credentials secure you can define them as environment variables or entries in config_variables.yml. For more information on secure storage and retrieval of credentials in GX see Configure credentials.

Update the variables in the following code and execute it to define name and azure_options. In this example, the value for account_url is pulled from the environment variable AZURE_STORAGE_ACCOUNT_URL and the value for credential is pulled from the environment variable AZURE_CREDENTIAL:
Python
```
data_source_name = "my_filesystem_data_source"
azure_options = {
    "account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
    "credential": "${AZURE_CREDENTIAL}",
}
```

Add an Azure Blob Storage Data Source to your Data Context.

GX can leverage either pandas or Spark as the backend for your Azure Blob Storage Data Source. To create your Data Source, execute one of the following sets of code:

pandas
Spark

Python
# Create the Data Source:
data_source = context.data_sources.add_pandas_abs(
    name=data_source_name, azure_options=azure_options
)

Python
data_source = context.data_sources.add_spark_abs(
    name=data_source_name, azure_options=azure_options
)

Choose from the following to see the full example code for a S3 Filesystem Data Source, using either pandas or Spark to read the data files:

pandas example
Spark example

Python
import great_expectations as gx

context = gx.get_context()

# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
azure_options = {
    "account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
    "credential": "${AZURE_CREDENTIAL}",
}

# Create the Data Source:
data_source = context.data_sources.add_pandas_abs(
    name=data_source_name, azure_options=azure_options
)

Python
import great_expectations as gx

context = gx.get_context()

# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
azure_options = {
    "account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
    "credential": "${AZURE_CREDENTIAL}",
}

# Create the Data Source:
data_source = context.data_sources.add_spark_abs(
    name=data_source_name, azure_options=azure_options
)

Prerequisites

Python version 3.8 to 3.11
An installation of GX Core with support for Google Cloud Storage dependencies
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
A preconfigured Data Context
Access to data files in Google Cloud Storage.

Procedure

Instructions
Sample code

Set up the Data Source's credentials.

By default, GCS credentials are handled through the gcloud command line tool and the GOOGLE_APPLICATION_CREDENTIALS environment variable. The gcloud command line tool is used to set up authentication credentials, and the GOOGLE_APPLICATION_CREDENTIALS environment variable provides the path to the json file with those credentials.

For more information on using the gcloud command line tool, see Google Cloud's Cloud Storage client libraries documentation.
Define the Data Source's parameters.

The following information is required when you create a GCS Data Source:
- name: A descriptive name used to reference the Data Source. This should be unique within the Data Context.
- bucket_or_name: The GCS bucket or instance name.
- gcs_options: Optional. A dictionary that can be used to specify an alternative method for providing GCS credentials.
The gcs_options dictionary can be left empty if the default GOOGLE_APPLICATION_CREDENTIALS environment variable is populated. Otherwise, the gcs_options dictionary should have either the key filename or the key info.
- filename: The value of this key should be the specific path to your credentials json. If you provide this then the info key should be left out of the dictionary.
- info: The actual JSON data from your credentials file in the form of a string. If you provide this then the filename key should not be included in the dictionary.
Update the variables in the following code and execute it to define name, bucket_or_name, and gcs_options. In this example the default GOOGLE_APPLICATION_CREDENTIALS environment variable points to the location of the credentials json and therefore the gcs_options dictionary is left empty:
Python
```
data_source_name = "my_filesystem_data_source"
bucket_or_name = "test_docs_data"
gcs_options = {}
```

Add a Google Cloud Storage Data Source to your Data Context.

GX can leverage either pandas or Spark as the backend for your Google Cloud Storage Data Source. To create your Data Source, execute one of the following sets of code:

pandas
Spark

Python
data_source = context.data_sources.add_pandas_gcs(
    name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)

Python
data_source = context.data_sources.add_spark_gcs(
    name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)

Choose from the following to see the full example code for a S3 Filesystem Data Source, using either pandas or Spark to read the data files:

pandas example
Spark example

Python
data_source = context.data_sources.add_pandas_gcs(
    name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)

Python
data_source = context.data_sources.add_spark_gcs(
    name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)

Create a Data Asset

A Data Asset is a collection of related records within a Data Source. These records may be located within multiple files, but each Data Asset is only capable of reading a single specific file format which is determined when it is created. However, a Data Source may contain multiple Data Assets covering different file formats and groups of records.

GX provides two types of Data Assets for Filesystem Data Sources: File Data Assets and Directory Data Assets.

File Data Asset
Directory Data Asset

File Data Assets are used to retrieve data from individual files in formats such as .csv or .parquet. The file format that can be read by a File Data Asset is determined when the File Data Asset is created. The specific file that is read is determind by Batch Definitions that are added to the Data Asset after it is created.

Both Spark and pandas Filesystem Data Sources support File Data Assets for all supported Filesystem environments.

Directory Data Assets read one or more files in formats such as .csv or .parquet. The file format that can be read by a Directory Data Asset is determined when the Directory Data Asset is created. The data in the corresponding files is concatonated into a single table which can be retrieved as a whole, or further partitioned based on the value of a datetime field.

Spark Filesystem Data Sources support Directory Data Assets for all supported Filesystem environments. However, pandas Filesystem Data Sources do not support Directory Data Assets at all.

Local or networked filesystem
Amazon S3
Azure Blob Storage
Google Cloud Storage

Prerequisites

Python version 3.8 to 3.11.
An installation of GX Core.
A preconfigured Data Context.
Access to data files (such as .csv or .parquet files) in a local or networked folder hierarchy.
A pandas or Spark Filesystem Data Source configured for local or networked data files.

Procedure

Instructions
Sample code

Retrieve your Data Source.

Replace the value of data_source_name in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:

Python
import great_expectations as gx

# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

Define your Data Asset's parameters.

A File Data Asset for files in a local or networked folder hierarchy only needs one piece of information to be created.
- name: A descriptive name with which to reference the Data Asset. This name should be unique among all Data Assets for the same Data Source.
This example uses taxi trip data stored in .csv files, so the name "taxi_csv_files" will be used for the Data Asset:
Python
```
asset_name = "taxi_csv_files"
```
Add the Data Asset to your Data Source.

A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
- To see the file formats supported by a pandas File Data Source, reference the .add_*_asset(...) methods in the API documentation for a PandasFilesystemDatasource.
- To see the file formats supported by a Spark File Data Source, reference the .add_*_asset(...) methods in the API documentation for a SparkFilesystemDatasource.
The following example creates a Data Asset that can read .csv file data:
Python
```
file_csv_asset = data_source.add_csv_asset(name=asset_name)
```

Python
import great_expectations as gx

# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

# Define the Data Asset's parameters:
asset_name = "taxi_csv_files"

# Add the Data Asset to the Data Source:
file_csv_asset = data_source.add_csv_asset(name=asset_name)

Prerequisites

Procedure

Instructions
Sample code

Retrieve your Data Source.

Replace the value of data_source_name in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:

Python
import great_expectations as gx

# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

Define your Data Asset's parameters.

A Directory Data Asset for files in a local or networked folder hierarchy only needs two pieces of information to be created.
- name: A descriptive name with which to reference the Data Asset. This name should be unique among all Data Assets for the same Data Source.
- data_directory: The path of the containing the data files for the Data Asset. This path can be relative to the Data Source's base_directory.
This example uses taxi trip data stored in .csv files in the data/ folder within the Data Source's directory tree:
Python
```
asset_name = "taxi_csv_directory"
data_directory = "./data"
```
Add the Data Asset to your Data Source.

A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.

To see the file formats supported by a Spark Directory Data Source, reference the .add_directory_*_asset(...) methods in the API documentation for a SparkFilesystemDatasource.

The following example creates a Data Asset that can read .csv file data:
Python
```
directory_csv_asset = data_source.add_directory_csv_asset(
    name=asset_name, data_directory=data_directory
)
```

Python
import great_expectations as gx

# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

# Define the Data Asset's parameters:
asset_name = "taxi_csv_directory"
data_directory = "./data"

# Add the Data Asset to the Data Source:
directory_csv_asset = data_source.add_directory_csv_asset(
    name=asset_name, data_directory=data_directory
)

Prerequisites

Python version 3.8 to 3.11.
An installation of GX Core with support for Amazon S3 dependencies.
A preconfigured Data Context.
Access to data files in S3.
A Filesystem Data Source configured to access data files in S3.

Procedure

Instructions
Sample code

Retrieve your Data Source.

Replace the value of data_source_name in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:

Python
import great_expectations as gx

context = gx.get_context()

# This example uses a File Data Context which already has
#  a Data Source defined.
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

Define your Data Asset's parameters.

To define a File Data Asset for data in an S3 bucket you provide the following elements:
- asset_name: A name by which you can reference the Data Asset in the future. This should be unique within the Data Source.
- s3_prefix: The path to the data files for the Data Asset, relative to the root of the S3 bucket.
This example uses taxi trip data stored in .csv files in the data/taxi_yellow_tripdata_samples/ folder within the Data Sources S3 bucket:
Python
```
asset_name = "s3_taxi_csv_file_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
```
Add the Data Asset to your Data Source.

A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
- To see the file formats supported by a pandas File Data Source, reference the .add_*_asset(...) methods in the API documentation for a PandasFilesystemDatasource.
- To see the file formats supported by a Spark File Data Source, reference the .add_*_asset(...) methods in the API documentation for a SparkFilesystemDatasource.
The following example creates a Data Asset that can read .csv file data:
Python
```
s3_file_data_asset = data_source.add_csv_asset(name=asset_name, s3_prefix=s3_prefix)
```

Python
import great_expectations as gx

context = gx.get_context()

# This example uses a File Data Context which already has
#  a Data Source defined.
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

# Define the Data Asset's parameters:
asset_name = "s3_taxi_csv_file_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"

# Add the Data Asset to the Data Source:
s3_file_data_asset = data_source.add_csv_asset(name=asset_name, s3_prefix=s3_prefix)

Prerequisites

Procedure

Instructions
Sample code

Retrieve your Data Source.

Replace the value of data_source_name in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:

Python
import great_expectations as gx

# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

Define your Data Asset's parameters.

To define a Directory Data Asset for data in an S3 bucket you provide the following elements:
- asset_name: A name by which you can reference the Data Asset in the future. This should be unique within the Data Source.
- s3_prefix: The path to the data files for the Data Asset, relative to the root of the S3 bucket.
- data_directory: The path of the folder containing data files for the Data asset, relative to the root of the S3 bucket.
This example uses taxi trip data stored in .csv files in the data/taxi_yellow_tripdata_samples/ folder within the Data Sources S3 bucket:
Python
```
asset_name = "s3_taxi_csv_directory_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/"
```
Add the Data Asset to your Data Source.

A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.

To see the file formats supported by a Spark File Data Source, reference the .add_*_asset(...) methods in the API documentation for a SparkFilesystemDatasource.

The following example creates a Data Asset that can read .csv file data:
Python
```
s3_file_data_asset = data_source.add_directory_csv_asset(
    name=asset_name, s3_prefix=s3_prefix, data_directory=data_directory
)
```

Python
import great_expectations as gx

# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

# Define the Data Asset's parameters:
asset_name = "s3_taxi_csv_directory_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/"

# Add the Data Asset to the Data Source:
s3_file_data_asset = data_source.add_directory_csv_asset(
    name=asset_name, s3_prefix=s3_prefix, data_directory=data_directory
)

Prerequisites

Python version 3.8 to 3.11.
An installation of GX Core with support for Azure Blob Storage dependencies.
A preconfigured Data Context.
Access to data files in Azure Blob Storage.
A pandas or Spark Filesystem Data Source configured for Azure Blob Storage data files.

Procedure

Instructions
Sample code

Retrieve your Data Source.

Replace the value of data_source_name in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:

Python
import great_expectations as gx

# This example uses a File Data Context
#  which already has a Data Source configured.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

Define your Data Asset's parameters.

To define a File Data Asset for Azure Blob Storage you provide the following elements:
- name: A name by which you can reference the Data Asset in the future.
- abs_container: The name of your Azure Blob Storage container.
- abs_name_starts_with: The path to the data files for the Data Asset, relative to the root of the abs_container.
- abs_recursive_file_discovery: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders. The default is False.
This example uses taxi trip data stored in .csv files in the data/taxi_yellow_tripdata_samples/ folder within the Azure Blob Storage container:
Python
```
asset_name = "abs_file_csv_asset"
abs_container = "superconductive-public"
abs_prefix = "data/taxi_yellow_tripdata_samples/"
```
Add the Data Asset to your Data Source.

A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
- To see the file formats supported by a pandas File Data Source, reference the .add_*_asset(...) methods in the API documentation for a PandasFilesystemDatasource.
- To see the file formats supported by a Spark File Data Source, reference the .add_*_asset(...) methods in the API documentation for a SparkFilesystemDatasource.
The following example creates a Data Asset that can read .csv file data:
Python
```
file_asset = data_source.add_csv_asset(
    name=asset_name, abs_container=abs_container, abs_name_starts_with=abs_prefix
)
```

Python
import great_expectations as gx

# This example uses a File Data Context
#  which already has a Data Source configured.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

# Define the Data Asset's parameters:
asset_name = "abs_file_csv_asset"
abs_container = "superconductive-public"
abs_prefix = "data/taxi_yellow_tripdata_samples/"

# Add the Data Asset to the Data Source:
file_asset = data_source.add_csv_asset(
    name=asset_name, abs_container=abs_container, abs_name_starts_with=abs_prefix
)

Prerequisites

Procedure

Instructions
Sample code

Retrieve your Data Source. Replace the value of data_source_name in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:

Python
import great_expectations as gx

# This example uses a File Data Context
#  which already has a Data Source configured.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

Define your Data Asset's parameters.

To define a Directory Data Asset for Azure Blob Storage you provide the following elements:
- name: A name by which you can reference the Data Asset in the future.
- abs_container: The name of your Azure Blob Storage container.
- abs_name_starts_with: The path to the data files for the Data Asset in the Azure Blob Storage container. This should be relative to the root of the abs_container.
- data_directory: The path of the folder containing data files for the Data asset, relative to the root of the abs_container.
- abs_recursive_file_discovery: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders. The default is False.
This example uses taxi trip data stored in .csv files in the data/taxi_yellow_tripdata_samples/ folder within the Azure Blob Storage container:
Python
```
asset_name = "abs_directory_asset"
abs_container = "superconductive-public"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/"
```
Add the Data Asset to your Data Source.

A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.

To see the file formats supported by a Spark Directory Data Source, reference the .add_directory_*_asset(...) methods in the API documentation for a SparkFilesystemDatasource.

The following example creates a Data Asset that can read .csv file data:
Python
```
directory_csv_asset = data_source.add_directory_csv_asset(
    name=asset_name,
    abs_container=abs_container,
    abs_name_starts_with=abs_name_starts_with,
    data_directory=data_directory,
)
```

Python
import great_expectations as gx

# This example uses a File Data Context
#  which already has a Data Source configured.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

# Define the Data Asset's parameters:
asset_name = "abs_directory_asset"
abs_container = "superconductive-public"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/"

# Add the Data Asset to the Data Source:
directory_csv_asset = data_source.add_directory_csv_asset(
    name=asset_name,
    abs_container=abs_container,
    abs_name_starts_with=abs_name_starts_with,
    data_directory=data_directory,
)

Prerequisites

Python version 3.8 to 3.11.
An installation of GX Core with support for Google Cloud Storage dependencies.
A preconfigured Data Context.
Access to data files in Google Cloud Storage.
A pandas or Spark Filesystem Data Source configured for Google Cloud Storage data files.

Procedure

Instructions
Sample code

Retrieve your Data Source.

Replace the value of data_source_name in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:

Python
import great_expectations as gx

# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

Define your Data Asset's parameters.

To define a File Data Asset for Google Cloud Storage you provide the following elements:
- name: A name by which you can reference the Data Asset in the future. This should be unique among Data Assets on the same Data Source.
- gcs_prefix: The beginning of the object key name.
- gcs_delimiter: Optional. A character used to define the hierarchical structure of object keys within a bucket (default is "/").
- gcs_recursive_file_discovery: Optional. A boolean indicating if files should be searched recursively from subfolders (default is False).
- gcs_max_results: Optional. The maximum number of keys in a single response (default is 1000).
This example uses taxi trip data stored in .csv files in the data/taxi_yellow_tripdata_samples/ folder within the Google Cloud Storage Data Source:
Python
```
asset_name = "gcs_taxi_csv_file_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
```
Add the Data Asset to your Data Source.

A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
- To see the file formats supported by a pandas File Data Source, reference the .add_*_asset(...) methods in the API documentation for a PandasFilesystemDatasource.
- To see the file formats supported by a Spark File Data Source, reference the .add_*_asset(...) methods in the API documentation for a SparkFilesystemDatasource.
The following example creates a Data Asset that can read .csv file data:
Python
```
file_csv_asset = data_source.add_csv_asset(name=asset_name, gcs_prefix=gcs_prefix)
```

Python
import great_expectations as gx

# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

# Define the Data Asset's parameters:
asset_name = "gcs_taxi_csv_file_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"

# Add the Data Asset to the Data Source:
file_csv_asset = data_source.add_csv_asset(name=asset_name, gcs_prefix=gcs_prefix)

Prerequisites

Procedure

Instructions
Sample code

Retrieve your Data Source.

Replace the value of data_source_name in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:

Python
import great_expectations as gx

# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

Define your Data Asset's parameters.

To define a Directory Data Asset for Google Cloud Storage you provide the following elements:
- name: A name by which you can reference the Data Asset in the future. This should be unique among Data Assets on the same Data Source.
- data_directory: The full path from your bucket root for the folder containing the data files.
- gcs_prefix: The beginning of the object key name.
- gcs_delimiter: Optional. A character used to define the hierarchical structure of object keys within a bucket (default is "/").
- gcs_recursive_file_discovery: Optional. A boolean indicating if files should be searched recursively from subfolders (default is False).
- gcs_max_results: Optional. The maximum number of keys in a single response (default is 1000).
This example uses taxi trip data stored in .csv files in the data/taxi_yellow_tripdata_samples/ folder within the Google Cloud Storage Data Source:
Python
```
asset_name = "gcs_taxi_csv_directory_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/"
```
Add the Data Asset to your Data Source.

A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.

To see the file formats supported by a Spark Directory Data Source, reference the .add_directory_*_asset(...) methods in the API documentation for a SparkFilesystemDatasource.

The following example creates a Data Asset that can read .csv file data:
Python
```
directory_csv_asset = data_source.add_directory_csv_asset(
    name=asset_name, gcs_prefix=gcs_prefix, data_directory=data_directory
)
```

Python
import great_expectations as gx

# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)

# Define the Data Asset's parameters:
asset_name = "gcs_taxi_csv_directory_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/"

# Add the Data Asset to the Data Source:
directory_csv_asset = data_source.add_directory_csv_asset(
    name=asset_name, gcs_prefix=gcs_prefix, data_directory=data_directory
)

Create a Batch Definition

A Batch Definition determines which records in a Data Asset are retrieved for Validation. Batch Definitions can be configured to either provide all of the records in a Data Asset, or to subdivide the Data Asset based on a date.

File Data Asset
Directory Data Asset

Batch Definitions for File Data Assets can be configured to return the content of a specific file based on either a file path or a regex match for dates in the name of the file.

Prerequisites

A preconfigured Data Context. The variable context is used for your Data Context in the following example code.
A File Data Asset on a Filesystem Data Source.

Instructions
Sample code

Retrieve your Data Asset.

Replace the value of data_source_name with the name of your Data Source and the value of asset_name with the name of your Data Asset in the following code. Then execute it to retrieve an existing Data Source and Data Asset from your Data Context:
Python
```
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
```
Add a Batch Definition to the Data Asset.

A path Batch Definition returns all of the records in a specific data file as a single Batch. A partitioned Batch Definition will return the records of a single file in the Data Asset based on which file name matches a regex.
- Path
- Partitioned
To define a path Batch Definition you need to provide the following information:
- name: A name by which you can reference the Batch Definition in the future. This should be unique within the Data Asset.
- path: The path within the Data Asset of the data file containing the records to return.
Update the batch_definition_name and batch_definition_path variables and execute the following code to add a path Batch Definition to your Data Asset:
Python
batch_definition_name = "yellow_tripdata_sample_2019-01.csv" batch_definition_path = "folder_with_data/yellow_tripdata_sample_2019-01.csv" batch_definition = file_data_asset.add_batch_definition_path( name=batch_definition_name, path=batch_definition_path )
GX Core currently supports partitioning File Data Assets based on dates. The files can be returned by year, month, or day.
Yearly
Monthly
Daily
For example, say your Data Asset contains the following files with year dates in the file names:

yellow_tripdata_sample_2019.csv

yellow_tripdata_sample_2020.csv

yellow_tripdata_sample_2021.csv

You can create a regex that will match these files by replacing the year in the file names with a named regex matching pattern. This pattern's group should be named year.
For the above three files, the regex pattern would be:
Regular Expression
yellow_tripdata_sample_(?P<year>\d{4})\.csv
Update the batch_definition_name and batch_definition_regex variables in the following code, then execute it to create a yearly Batch Definition:
batch_definition_name = "yearly_yellow_tripdata_sample" batch_definition_regex = r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})\.csv" batch_definition = file_data_asset.add_batch_definition_yearly( name=batch_definition_name, regex=batch_definition_regex )
Optional. Verify the Batch Definition is valid.
- Path
- Partitioned
A path Batch Definition always returns all records in a specific file as a single Batch. Therefore you do not need to provide any additional parameters to retrieve data from a path Batch Definition.
After retrieving your data you can verify that the Batch Definition is valid by printing the first few retrieved records with batch.head():
Python
batch = batch_definition.get_batch() batch.head()
When retrieving a Batch from a partitioned Batch Definition, you can specify the date of the data to retrieve by providing a batch_parameters dictionary with keys that correspond to the regex matching groups in the Batch Definition. If you do not specify a date, the most recent date in the data is returned by default.
After retrieving your data you can verify that the Batch Definition is valid by printing the first few retrieved records with batch.head():
Yearly
Monthly
Daily
Python
batch = batch_definition.get_batch(batch_parameters={"year": "2019"}) batch.head()
Optional. Create additional Batch Definitions.

A Data Asset can have multiple Batch Definitions as long as each Batch Definition has a unique name within that Data Asset. Repeat this procedure to add additional path or partitioned Batch Definitions to your Data Asset.

Full example code for path Batch Definitions and partitioned yearly, monthly, or daily Batch Definitions:

Path
Yearly
Monthly
Daily

Full sample code
import great_expectations as gx

context = gx.get_context()

data_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)

batch_definition_name = "yellow_tripdata_sample_2019-01.csv"
batch_definition_path = "folder_with_data/yellow_tripdata_sample_2019-01.csv"

batch_definition = file_data_asset.add_batch_definition_path(
    name=batch_definition_name, path=batch_definition_path
)

batch = batch_definition.get_batch()
batch.head()

Full sample code
import great_expectations as gx

context = gx.get_context()

data_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)

batch_definition_name = "yearly_yellow_tripdata_sample"
batch_definition_regex = r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})\.csv"

batch_definition = file_data_asset.add_batch_definition_yearly(
    name=batch_definition_name, regex=batch_definition_regex
)

batch = batch_definition.get_batch(batch_parameters={"year": "2019"})
batch.head()

Full sample code
import great_expectations as gx

context = gx.get_context()

data_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)

batch_definition_name = "monthly_yellow_tripdata_sample"
batch_definition_regex = (
    r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
)

batch_definition = file_data_asset.add_batch_definition_monthly(
    name=batch_definition_name, regex=batch_definition_regex
)

batch = batch_definition.get_batch(batch_parameters={"year": "2019", "month": "01"})
batch.head()

Full sample code
import great_expectations as gx

context = gx.get_context()

data_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)

batch_definition_name = "daily_yellow_tripdata_sample"
batch_definition_regex = r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv"

batch_definition = file_data_asset.add_batch_definition_daily(
    name=batch_definition_name, regex=batch_definition_regex
)

batch = batch_definition.get_batch(
    batch_parameters={"year": "2019", "month": "01", "day": "01"}
)
batch.head()

Batch Definitions for a Directory Data Asset can be configured to return all of the records for the files in the Data Asset, or to subdivide the Data Asset's records on the content of a Datetime field and only return the records that correspond to a specific year, month, or day.

Prerequisites

A preconfigured Data Context. The variable context is used for your Data Context in the following example code.
A File Data Asset on a Filesystem Data Source.

Procedure

Instructions
Sample code

Retrieve your Data Asset.

Replace the value of data_source_name with the name of your Data Source and the value of asset_name with the name of your Data Asset in the following code. Then execute it to retrieve an existing Data Source and Data Asset from your Data Context:
Python
```
import great_expectations as gx

context = gx.get_context()

data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
```
Add a Batch Definition to the Data Asset.

A whole directory Batch Definition returns all of the records in a data file as a single Batch. A partitioned directory Batch Definition will subdivide the records according to a datetime field and return those records that match a specified year, month, or day.
- Whole directory
- Partitioned
Because a whole directory Batch Definition returns the records from all of the files it can read in the Data Asset you only need to provide one addditional piece of information to define one:
- name: A name by which you can reference the Batch Definition in the future. This should be unique within the Data Asset.
Update the batch_definition_name variable and execute the following code to create a whole directory Batch Definition:
Python
batch_definition_name = "yellow_tripdata" batch_definition = file_data_asset.add_batch_definition_whole_directory( name=batch_definition_name )
GX Core currently supports partitioning Directory Data Assets based on a datetime field. Therefore, to define a partitioned Directory Batch Definition you need to provide two pieces of information:
- name:A name by which you can reference the Batch Definition in the future. This should be unique within the Data Asset.
- column: The datetime column that records should be subdivided on.
The Batch Definition can be configured to return records by year, month, or day.
Yearly
Monthly
Daily
Update the batch_definition_name and batch_definition_column variables in the following code, then execute it to create a Batch Definition that subdivides the records in a directory by year:
Python
batch_definition_name = "yellow_tripdata_sample_yearly" batch_definition_column = "pickup_datetime" batch_definition = file_data_asset.add_batch_definition_yearly( name=batch_definition_name, column=batch_definition_column )
Optional. Verify the Batch Definition is valid.
- Whole directory
- Partitioned
A whole directory Batch Definition always returns all available records as a single Batch. Therefore you do not need to provide any additional parameters to retrieve data from a path Batch Definition.
After retrieving your data you can verify that the Batch Definition is valid by printing the first few retrieved records with batch.head():
Python
batch = batch_definition.get_batch() batch.head()
When retrieving a Batch from a partitioned Batch Definition, you can specify the date of the data to retrieve by providing a batch_parameters dictionary with keys that correspond to the regex groups in the Batch Definition. If you do not specify a date, the most recent date in the data is returned by default.
After retrieving your data you can verify that the Batch Definition is valid by printing the first few retrieved records with batch.head():
Yearly
Monthly
Daily
Python
batch = batch_definition.get_batch(batch_parameters={"year": "2019"}) batch.head()
Optional. Create additional Batch Definitions.

A Data Asset can have multiple Batch Definitions as long as each Batch Definition has a unique name within that Data Asset. Repeat this procedure to add additional whole directory or partitioned Batch Definitions to your Data Asset.

Full example code for whole directory Batch Definitions and partitioned yearly, monthly, or daily Batch Definitions:

Whole directory
Yearly
Monthly
Daily

Full sample code
import great_expectations as gx

context = gx.get_context()

data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)

batch_definition_name = "yellow_tripdata"
batch_definition = file_data_asset.add_batch_definition_whole_directory(
    name=batch_definition_name
)

batch = batch_definition.get_batch()
batch.head()

Full sample code
import great_expectations as gx

context = gx.get_context()

data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)

batch_definition_name = "yellow_tripdata_sample_yearly"
batch_definition_column = "pickup_datetime"

batch_definition = file_data_asset.add_batch_definition_yearly(
    name=batch_definition_name, column=batch_definition_column
)

batch = batch_definition.get_batch(batch_parameters={"year": "2019"})
batch.head()

Full sample code
import great_expectations as gx

context = gx.get_context()

data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)

batch_definition_name = "yellow_tripdata_sample_monthly"
batch_definition_column = "pickup_datetime"

batch_definition = file_data_asset.add_batch_definition_monthly(
    name=batch_definition_name, column=batch_definition_column
)

batch = batch_definition.get_batch(batch_parameters={"year": "2019", "month": "01"})
batch.head()

Full sample code
import great_expectations as gx

context = gx.get_context()

data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)

batch_definition_name = "yellow_tripdata_sample_daily"
batch_definition_column = "pickup_datetime"

batch_definition = file_data_asset.add_batch_definition_daily(
    name=batch_definition_name, column=batch_definition_column
)

batch = batch_definition.get_batch(
    batch_parameters={"year": "2019", "month": "01", "day": "01"}
)
batch.head()

Create a Data Source​

Prerequisites​

Procedure​

Prerequisites​

Procedure​

Prerequisites​

Procedure​

Prerequisites​

Procedure​

Create a Data Asset​

Prerequisites​

Procedure​

Prerequisites​

Procedure​

Prerequisites​

Procedure​

Prerequisites​

Procedure​

Prerequisites​

Procedure​

Prerequisites​

Procedure​

Prerequisites​

Procedure​

Prerequisites​

Procedure​

Create a Batch Definition​

Prerequisites​

Prerequisites​

Procedure​

Create a Data Source

Prerequisites

Procedure

Prerequisites

Procedure

Prerequisites

Procedure

Prerequisites

Procedure

Create a Data Asset

Prerequisites

Procedure

Prerequisites

Procedure

Prerequisites

Procedure

Prerequisites

Procedure

Prerequisites

Procedure

Prerequisites

Procedure

Prerequisites

Procedure

Prerequisites

Procedure

Create a Batch Definition

Prerequisites

Prerequisites

Procedure