Connect to Filesystem data
Filesystem data consists of data stored in file formats such as .csv or .parquet, and located in an environment with a folder hierarchy such as Amazon S3, Azure Blob Storage, Google Cloud Storage, or local and networked filesystems.  GX can leverage either pandas or Spark to read this data.
To connect to your Filesystem data, you first create a Data Source which tells GX where your data files reside. You then configure Data Assets for your Data Source to tell GX which sets of records you want to be able to access from your Data Source. Finally, you will define Batch Definitions which allow you to request all the records retrieved from a Data Asset or further partition the returned records based on a specified date.
Create a Data Source
Data Sources tell GX where your data is located and how to connect to it. With Filesystem data this is done by directing GX to the folder or online location that contains the data files. GX supports accessing Filesystem data from Amazon S3, Azure Blob Storage, Google Cloud Storage, and local or networked filesystems.
- Local or networked filesystem
- Amazon S3
- Azure Blob Storage
- Google Cloud Storage
Prerequisites
- Python version 3.9 to 3.13
- 
An installation of GX Core
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
 
- A preconfigured Data Context
- Access to data files in a local or networked directory.
All Data Contexts include a built in pandas_default Data Source.  This Data Source gives access to all of the read_*(...) methods available in pandas.
The read_*(...) methods of the pandas_default Data Source allow you to load data into GX without first configuring a Data Source, Data Asset, and Batch Definition.  However, it does not save configurations for reading files to the Data Context and provides less versatility than a fully configured Data Source, Data Asset, and Batch Definition. Therefore, the pandas_default Data Source is intended to facilitate testing Expectations and engaging in data exploration but is less suited for use in production and automated workflows.
Procedure
- Instructions
- Sample code
- 
Define the Data Source's parameters. The following information is required when you create a Filesystem Data Source for a local or networked directory: - name: A descriptive name used to reference the Data Source. This should be unique within the Data Context.
- base_directory: The path to the folder that contains the data files, or the root folder of the directory hierarchy that contains the data files.
 If you are using a File Data Context, you can provide a path that is relative to the Data Context's base_directory. Otherwise, you should provide the absolute path to the folder that contains your data.In this example, a relative path is defined for a folder that contains taxi trip data for New York City in .csvformat:Pythonsource_folder = "./data"
 data_source_name = "my_filesystem_data_source"
- 
Add a Filesystem Data Source to your Data Context. GX can leverage either pandas or Spark as the backend for your Filesystem Data Source. To create your Data Source, execute one of the following sets of code: - pandas
- Spark
 Pythondata_source = context.data_sources.add_pandas_filesystem(
 name=data_source_name, base_directory=source_folder
 )Pythondata_source = context.data_sources.add_spark_filesystem(
 name=data_source_name, base_directory=source_folder
 )
Choose from the following to see the full example code for a local or networked Data Source, using either pandas or Spark to read the data files:
- pandas example
- Spark example
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
# This path is relative to the `base_directory` of the Data Context.
source_folder = "./data"
data_source_name = "my_filesystem_data_source"
# Create the Data Source:
data_source = context.data_sources.add_pandas_filesystem(
    name=data_source_name, base_directory=source_folder
)
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
# This path is relative to the `base_directory` of the Data Context.
source_folder = "./data"
data_source_name = "my_filesystem_data_source"
# Create the Data Source:
data_source = context.data_sources.add_spark_filesystem(
    name=data_source_name, base_directory=source_folder
)
Prerequisites
- Python version 3.9 to 3.13
- 
An installation of GX Core with support for Amazon S3 dependencies
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
 
- A preconfigured Data Context
- Access to data files on a S3 bucket.
Procedure
- Instructions
- Sample code
- 
Define the Data Source's parameters. The following information is required when you create an Amazon S3 Data Source: - name: A descriptive name used to reference the Data Source. This should be unique within the Data Context.
- bucket_name: The Amazon S3 bucket name.
- boto3_options: Optional. Additional options for the Data Source. In the following examples, the default values are used.
 Replace the variable values with your own and run the following Python code to define data_source_name,bucket_nameandboto3_options:Pythondata_source_name = "my_filesystem_data_source"
 bucket_name = "great-expectations-docs-test"
 boto3_options = {}Additional options forboto3_optionsThe parameter boto3_optionsallows you to pass the following information:- region_name: Your AWS region name.
- endpoint_url: specifies an S3 endpoint. You can provide an environment variable reference such as- "${S3_ENDPOINT}"to securely include this in your code. The string- "${S3_ENDPOINT}"will be replaced with the value of the environment variable- S3_ENDPOINT.
 For more information on secure storage and retrieval of credentials in GX see Configure credentials. 
- 
Add a S3 Filesystem Data Source to your Data Context. GX can leverage either pandas or Spark as the backend for your S3 Filesystem Data Source. To create your Data Source, execute one of the following sets of code: - pandas
- Spark
 Pythondata_source = context.data_sources.add_pandas_s3(
 name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
 )Pythondata_source = context.data_sources.add_spark_s3(
 name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
 )
Choose from the following to see the full example code for a S3 Filesystem Data Source, using either pandas or Spark to read the data files:
- pandas example
- Spark example
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
bucket_name = "great-expectations-docs-test"
boto3_options = {}
# Create the Data Source:
data_source = context.data_sources.add_pandas_s3(
    name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
bucket_name = "great-expectations-docs-test"
boto3_options = {}
# Create the Data Source:
data_source = context.data_sources.add_spark_s3(
    name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)
Prerequisites
- Python version 3.9 to 3.13
- 
An installation of GX Core with support for Azure Blob Storage dependencies
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
 
- A preconfigured Data Context
- Access to data files in Azure Blob Storage.
Procedure
- Instructions
- Sample code
- 
Define the Data Source's parameters. The following information is required when you create a Microsoft Azure Blob Storage Data Source: - name: A descriptive name used to reference the Data Source. This should be unique within the Data Context.
- azure_options: Authentication settings.
 The azure_optionsparameter accepts a dictionary which should include two keys:credentialand eitheraccount_urlorconn_str.- credential: An Azure Blob Storage token
- account_url: The url of your Azure Blob Storage account. If you provide this then- conn_strshould be left out of the dictionary.
- conn_str: The connection string for your Azure Blob Storage account. If you provide this then- account_urlshould not be included in the dictionary.
 To keep your credentials secure you can define them as environment variables or entries in config_variables.yml. For more information on secure storage and retrieval of credentials in GX see Configure credentials.Update the variables in the following code and execute it to define nameandazure_options. In this example, the value foraccount_urlis pulled from the environment variableAZURE_STORAGE_ACCOUNT_URLand the value forcredentialis pulled from the environment variableAZURE_CREDENTIAL:Pythondata_source_name = "my_filesystem_data_source"
 azure_options = {
 "account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
 "credential": "${AZURE_CREDENTIAL}",
 }
- 
Add an Azure Blob Storage Data Source to your Data Context. GX can leverage either pandas or Spark as the backend for your Azure Blob Storage Data Source. To create your Data Source, execute one of the following sets of code: - pandas
- Spark
 Python# Create the Data Source:
 data_source = context.data_sources.add_pandas_abs(
 name=data_source_name, azure_options=azure_options
 )Pythondata_source = context.data_sources.add_spark_abs(
 name=data_source_name, azure_options=azure_options
 )
Choose from the following to see the full example code for an Azure Blob Storage Filesystem Data Source, using either pandas or Spark to read the data files:
- pandas example
- Spark example
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
azure_options = {
    "account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
    "credential": "${AZURE_CREDENTIAL}",
}
# Create the Data Source:
data_source = context.data_sources.add_pandas_abs(
    name=data_source_name, azure_options=azure_options
)
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
azure_options = {
    "account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
    "credential": "${AZURE_CREDENTIAL}",
}
# Create the Data Source:
data_source = context.data_sources.add_spark_abs(
    name=data_source_name, azure_options=azure_options
)
Prerequisites
- Python version 3.9 to 3.13
- 
An installation of GX Core with support for Google Cloud Storage dependencies
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
 
- A preconfigured Data Context
- Access to data files in Google Cloud Storage.
Procedure
- Instructions
- Sample code
- 
Set up the Data Source's credentials. By default, GCS credentials are handled through the gcloud command line tool and the GOOGLE_APPLICATION_CREDENTIALSenvironment variable. The gcloud command line tool is used to set up authentication credentials, and theGOOGLE_APPLICATION_CREDENTIALSenvironment variable provides the path to thejsonfile with those credentials.For more information on using the gcloud command line tool, see Google Cloud's Cloud Storage client libraries documentation. 
- 
Define the Data Source's parameters. The following information is required when you create a GCS Data Source: - name: A descriptive name used to reference the Data Source. This should be unique within the Data Context.
- bucket_or_name: The GCS bucket or instance name.
- gcs_options: Optional. A dictionary that can be used to specify an alternative method for providing GCS credentials.
 The gcs_optionsdictionary can be left empty if the defaultGOOGLE_APPLICATION_CREDENTIALSenvironment variable is populated. Otherwise, thegcs_optionsdictionary should have either the keyfilenameor the keyinfo.- filename: The value of this key should be the specific path to your credentials json. If you provide this then the- infokey should be left out of the dictionary.
- info: The actual JSON data from your credentials file in the form of a string. If you provide this then the- filenamekey should not be included in the dictionary.
 Update the variables in the following code and execute it to define name,bucket_or_name, andgcs_options. In this example the defaultGOOGLE_APPLICATION_CREDENTIALSenvironment variable points to the location of the credentials json and therefore thegcs_optionsdictionary is left empty:Pythondata_source_name = "my_filesystem_data_source"
 bucket_or_name = "test_docs_data"
 gcs_options = {}
- 
Add a Google Cloud Storage Data Source to your Data Context. GX can leverage either pandas or Spark as the backend for your Google Cloud Storage Data Source. To create your Data Source, execute one of the following sets of code: - pandas
- Spark
 Pythondata_source = context.data_sources.add_pandas_gcs(
 name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
 )Pythondata_source = context.data_sources.add_spark_gcs(
 name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
 )
Choose from the following to see the full example code for a Google Cloud Storage Filesystem Data Source, using either pandas or Spark to read the data files:
- pandas example
- Spark example
data_source = context.data_sources.add_pandas_gcs(
    name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
data_source = context.data_sources.add_spark_gcs(
    name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
Create a Data Asset
A Data Asset is a collection of related records within a Data Source. These records may be located within multiple files, but each Data Asset is only capable of reading a single specific file format which is determined when it is created. However, a Data Source may contain multiple Data Assets covering different file formats and groups of records.
GX provides two types of Data Assets for Filesystem Data Sources: File Data Assets and Directory Data Assets.
- File Data Asset
- Directory Data Asset
File Data Assets are used to retrieve data from individual files in formats such as .csv or .parquet.  The file format that can be read by a File Data Asset is determined when the File Data Asset is created.  The specific file that is read is determind by Batch Definitions that are added to the Data Asset after it is created.
Both Spark and pandas Filesystem Data Sources support File Data Assets for all supported Filesystem environments.
Directory Data Assets read one or more files in formats such as .csv or .parquet.  The file format that can be read by a Directory Data Asset is determined when the Directory Data Asset is created.  The data in the corresponding files is concatonated into a single table which can be retrieved as a whole, or further partitioned based on the value of a datetime field.
Spark Filesystem Data Sources support Directory Data Assets for all supported Filesystem environments. However, pandas Filesystem Data Sources do not support Directory Data Assets at all.
- Local or networked filesystem
- Amazon S3
- Azure Blob Storage
- Google Cloud Storage
Prerequisites
- Python version 3.9 to 3.13.
- An installation of GX Core.
- A preconfigured Data Context.
- Access to data files (such as .csvor.parquetfiles) in a local or networked folder hierarchy.
- A pandas or Spark Filesystem Data Source configured for local or networked data files.
Procedure
- Instructions
- Sample code
- 
Retrieve your Data Source. Replace the value of data_source_namein the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
 # This example uses a File Data Context which already has
 # a Data Source defined.
 context = gx.get_context()
 data_source_name = "my_filesystem_data_source"
 data_source = context.data_sources.get(data_source_name)
- 
Define your Data Asset's parameters. A File Data Asset for files in a local or networked folder hierarchy only needs one piece of information to be created. - name: A descriptive name with which to reference the Data Asset. This name should be unique among all Data Assets for the same Data Source.
 This example uses taxi trip data stored in .csvfiles, so the name"taxi_csv_files"will be used for the Data Asset:Pythonasset_name = "taxi_csv_files"
- 
Add the Data Asset to your Data Source. A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source. - To see the file formats supported by a pandas File Data Source, refer to the .add_*_asset(...)methods in thePandasFilesystemDatasourcereference page.
- To see the file formats supported by a Spark File Data Source, refer to the .add_*_asset(...)methods in theSparkFilesystemDatasourcereference page.
 The following example creates a Data Asset that can read .csvfile data:Pythonfile_csv_asset = data_source.add_csv_asset(name=asset_name)
- To see the file formats supported by a pandas File Data Source, refer to the 
import great_expectations as gx
# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset's parameters:
asset_name = "taxi_csv_files"
# Add the Data Asset to the Data Source:
file_csv_asset = data_source.add_csv_asset(name=asset_name)
Prerequisites
- Python version 3.9 to 3.13.
- An installation of GX Core with support for Spark dependencies.
- A preconfigured Data Context.
- A Spark Filesystem Data Source configured to access data files in a local or networked folder hierarchy.
Procedure
- Instructions
- Sample code
- 
Retrieve your Data Source. Replace the value of data_source_namein the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
 # This example uses a File Data Context which already has
 # a Data Source defined.
 context = gx.get_context()
 data_source_name = "my_filesystem_data_source"
 data_source = context.data_sources.get(data_source_name)
- 
Define your Data Asset's parameters. A Directory Data Asset for files in a local or networked folder hierarchy only needs two pieces of information to be created. - name: A descriptive name with which to reference the Data Asset. This name should be unique among all Data Assets for the same Data Source.
- data_directory: The path of the containing the data files for the Data Asset. This path can be relative to the Data Source's- base_directory.
 This example uses taxi trip data stored in .csvfiles in thedata/folder within the Data Source's directory tree:Pythonasset_name = "taxi_csv_directory"
 data_directory = "./data"
- 
Add the Data Asset to your Data Source. A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source. To see the file formats supported by a Spark Directory Data Source, refer to the .add_directory_*_asset(...)methods in theSparkFilesystemDatasourcereference page.The following example creates a Data Asset that can read .csvfile data:Pythondirectory_csv_asset = data_source.add_directory_csv_asset(
 name=asset_name, data_directory=data_directory
 )
import great_expectations as gx
# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset's parameters:
asset_name = "taxi_csv_directory"
data_directory = "./data"
# Add the Data Asset to the Data Source:
directory_csv_asset = data_source.add_directory_csv_asset(
    name=asset_name, data_directory=data_directory
)
Prerequisites
- Python version 3.9 to 3.13.
- An installation of GX Core with support for Amazon S3 dependencies.
- A preconfigured Data Context.
- Access to data files in S3.
- A Filesystem Data Source configured to access data files in S3.
Procedure
- Instructions
- Sample code
- 
Retrieve your Data Source. Replace the value of data_source_namein the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
 context = gx.get_context()
 # This example uses a File Data Context which already has
 # a Data Source defined.
 data_source_name = "my_filesystem_data_source"
 data_source = context.data_sources.get(data_source_name)
- 
Define your Data Asset's parameters. To define a File Data Asset for data in an S3 bucket you provide the following elements: - asset_name: A name by which you can reference the Data Asset in the future. This should be unique within the Data Source.
- s3_prefix: The path to the data files for the Data Asset, relative to the root of the S3 bucket.
 This example uses taxi trip data stored in .csvfiles in thedata/taxi_yellow_tripdata_samples/folder within the Data Sources S3 bucket:Pythonasset_name = "s3_taxi_csv_file_asset"
 s3_prefix = "data/taxi_yellow_tripdata_samples/"
- 
Add the Data Asset to your Data Source. A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source. - To see the file formats supported by a pandas File Data Source, refer to the .add_*_asset(...)methods in thePandasFilesystemDatasourcereference page.
- To see the file formats supported by a Spark File Data Source, refer to the .add_*_asset(...)methods in theSparkFilesystemDatasourcereference page.
 The following example creates a Data Asset that can read .csvfile data:Pythons3_file_data_asset = data_source.add_csv_asset(name=asset_name, s3_prefix=s3_prefix)
- To see the file formats supported by a pandas File Data Source, refer to the 
import great_expectations as gx
context = gx.get_context()
# This example uses a File Data Context which already has
#  a Data Source defined.
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset's parameters:
asset_name = "s3_taxi_csv_file_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
s3_file_data_asset = data_source.add_csv_asset(name=asset_name, s3_prefix=s3_prefix)
Prerequisites
- Python version 3.9 to 3.13.
- An installation of GX Core with support for Amazon S3 dependencies and Spark dependencies.
- A preconfigured Data Context.
- A Filesystem Data Source configured to access data files in S3.
Procedure
- Instructions
- Sample code
- 
Retrieve your Data Source. Replace the value of data_source_namein the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
 # This example uses a File Data Context which already has
 # a Data Source defined.
 context = gx.get_context()
 data_source_name = "my_filesystem_data_source"
 data_source = context.data_sources.get(data_source_name)
- 
Define your Data Asset's parameters. To define a Directory Data Asset for data in an S3 bucket you provide the following elements: - asset_name: A name by which you can reference the Data Asset in the future. This should be unique within the Data Source.
- s3_prefix: The path to the data files for the Data Asset, relative to the root of the S3 bucket.
- data_directory: The path of the folder containing data files for the Data asset, relative to the root of the S3 bucket.
 This example uses taxi trip data stored in .csvfiles in thedata/taxi_yellow_tripdata_samples/folder within the Data Sources S3 bucket:Pythonasset_name = "s3_taxi_csv_directory_asset"
 s3_prefix = "data/taxi_yellow_tripdata_samples/"
 data_directory = "data/taxi_yellow_tripdata_samples/"
- 
Add the Data Asset to your Data Source. A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source. To see the file formats supported by a Spark File Data Source, refer to the .add_*_asset(...)methods in theSparkFilesystemDatasourcereference page.The following example creates a Data Asset that can read .csvfile data:Pythons3_file_data_asset = data_source.add_directory_csv_asset(
 name=asset_name, s3_prefix=s3_prefix, data_directory=data_directory
 )
import great_expectations as gx
# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset's parameters:
asset_name = "s3_taxi_csv_directory_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
s3_file_data_asset = data_source.add_directory_csv_asset(
    name=asset_name, s3_prefix=s3_prefix, data_directory=data_directory
)
Prerequisites
- Python version 3.9 to 3.13.
- An installation of GX Core with support for Azure Blob Storage dependencies.
- A preconfigured Data Context.
- Access to data files in Azure Blob Storage.
- A pandas or Spark Filesystem Data Source configured for Azure Blob Storage data files.
Procedure
- Instructions
- Sample code
- 
Retrieve your Data Source. Replace the value of data_source_namein the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
 # This example uses a File Data Context
 # which already has a Data Source configured.
 context = gx.get_context()
 data_source_name = "my_filesystem_data_source"
 data_source = context.data_sources.get(data_source_name)
- 
Define your Data Asset's parameters. To define a File Data Asset for Azure Blob Storage you provide the following elements: - name: A name by which you can reference the Data Asset in the future.
- abs_container: The name of your Azure Blob Storage container.
- abs_name_starts_with: The path to the data files for the Data Asset, relative to the root of the- abs_container.
- abs_recursive_file_discovery: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders. The default is False.
 This example uses taxi trip data stored in .csvfiles in thedata/taxi_yellow_tripdata_samples/folder within the Azure Blob Storage container:Pythonasset_name = "abs_file_csv_asset"
 abs_container = "superconductive-public"
 abs_prefix = "data/taxi_yellow_tripdata_samples/"
- 
Add the Data Asset to your Data Source. A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source. - To see the file formats supported by a pandas File Data Source, refer to the .add_*_asset(...)methods in thePandasFilesystemDatasourcereference page.
- To see the file formats supported by a Spark File Data Source, refer to the .add_*_asset(...)methods in theSparkFilesystemDatasourcereference page.
 The following example creates a Data Asset that can read .csvfile data:Pythonfile_asset = data_source.add_csv_asset(
 name=asset_name, abs_container=abs_container, abs_name_starts_with=abs_prefix
 )
- To see the file formats supported by a pandas File Data Source, refer to the 
import great_expectations as gx
# This example uses a File Data Context
#  which already has a Data Source configured.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset's parameters:
asset_name = "abs_file_csv_asset"
abs_container = "superconductive-public"
abs_prefix = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
file_asset = data_source.add_csv_asset(
    name=asset_name, abs_container=abs_container, abs_name_starts_with=abs_prefix
)
Prerequisites
- Python version 3.9 to 3.13.
- An installation of GX Core with support for Azure Blob Storage dependencies and Spark dependencies.
- A preconfigured Data Context.
- A Filesystem Data Source configured to access data files in Azure Blob Storage.
Procedure
- Instructions
- Sample code
- 
Retrieve your Data Source. Replace the value of data_source_namein the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
 # This example uses a File Data Context
 # which already has a Data Source configured.
 context = gx.get_context()
 data_source_name = "my_filesystem_data_source"
 data_source = context.data_sources.get(data_source_name)
- 
Define your Data Asset's parameters. To define a Directory Data Asset for Azure Blob Storage you provide the following elements: - name: A name by which you can reference the Data Asset in the future.
- abs_container: The name of your Azure Blob Storage container.
- abs_name_starts_with: The path to the data files for the Data Asset in the Azure Blob Storage container. This should be relative to the root of the- abs_container.
- data_directory: The path of the folder containing data files for the Data asset, relative to the root of the- abs_container.
- abs_recursive_file_discovery: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders. The default is False.
 This example uses taxi trip data stored in .csvfiles in thedata/taxi_yellow_tripdata_samples/folder within the Azure Blob Storage container:Pythonasset_name = "abs_directory_asset"
 abs_container = "superconductive-public"
 abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"
 data_directory = "data/taxi_yellow_tripdata_samples/"
- 
Add the Data Asset to your Data Source. A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source. To see the file formats supported by a Spark Directory Data Source, refer to the .add_directory_*_asset(...)methods in theSparkFilesystemDatasourcereference page.The following example creates a Data Asset that can read .csvfile data:Pythondirectory_csv_asset = data_source.add_directory_csv_asset(
 name=asset_name,
 abs_container=abs_container,
 abs_name_starts_with=abs_name_starts_with,
 data_directory=data_directory,
 )
import great_expectations as gx
# This example uses a File Data Context
#  which already has a Data Source configured.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset's parameters:
asset_name = "abs_directory_asset"
abs_container = "superconductive-public"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
directory_csv_asset = data_source.add_directory_csv_asset(
    name=asset_name,
    abs_container=abs_container,
    abs_name_starts_with=abs_name_starts_with,
    data_directory=data_directory,
)
Prerequisites
- Python version 3.9 to 3.13.
- An installation of GX Core with support for Google Cloud Storage dependencies.
- A preconfigured Data Context.
- Access to data files in Google Cloud Storage.
- A pandas or Spark Filesystem Data Source configured for Google Cloud Storage data files.
Procedure
- Instructions
- Sample code
- 
Retrieve your Data Source. Replace the value of data_source_namein the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
 # This example uses a File Data Context which already has
 # a Data Source defined.
 context = gx.get_context()
 data_source_name = "my_filesystem_data_source"
 data_source = context.data_sources.get(data_source_name)
- 
Define your Data Asset's parameters. To define a File Data Asset for Google Cloud Storage you provide the following elements: - name: A name by which you can reference the Data Asset in the future. This should be unique among Data Assets on the same Data Source.
- gcs_prefix: The beginning of the object key name.
- gcs_delimiter: Optional. A character used to define the hierarchical structure of object keys within a bucket (default is "/").
- gcs_recursive_file_discovery: Optional. A boolean indicating if files should be searched recursively from subfolders (default is False).
- gcs_max_results: Optional. The maximum number of keys in a single response (default is 1000).
 This example uses taxi trip data stored in .csvfiles in thedata/taxi_yellow_tripdata_samples/folder within the Google Cloud Storage Data Source:Pythonasset_name = "gcs_taxi_csv_file_asset"
 gcs_prefix = "data/taxi_yellow_tripdata_samples/"
- 
Add the Data Asset to your Data Source. A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source. - To see the file formats supported by a pandas File Data Source, refer to the .add_*_asset(...)methods in thePandasFilesystemDatasourcereference page.
- To see the file formats supported by a Spark File Data Source, refer to the .add_*_asset(...)methods in theSparkFilesystemDatasourcereference page.
 The following example creates a Data Asset that can read .csvfile data:Pythonfile_csv_asset = data_source.add_csv_asset(name=asset_name, gcs_prefix=gcs_prefix)
- To see the file formats supported by a pandas File Data Source, refer to the 
import great_expectations as gx
# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset's parameters:
asset_name = "gcs_taxi_csv_file_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
file_csv_asset = data_source.add_csv_asset(name=asset_name, gcs_prefix=gcs_prefix)
Prerequisites
- Python version 3.9 to 3.13.
- An installation of GX Core with support for Google Cloud Storage dependencies and Spark dependencies.
- A preconfigured Data Context.
- A Filesystem Data Source configured to access data files in Google Cloud Storage.
Procedure
- Instructions
- Sample code
- 
Retrieve your Data Source. Replace the value of data_source_namein the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
 # This example uses a File Data Context which already has
 # a Data Source defined.
 context = gx.get_context()
 data_source_name = "my_filesystem_data_source"
 data_source = context.data_sources.get(data_source_name)
- 
Define your Data Asset's parameters. To define a Directory Data Asset for Google Cloud Storage you provide the following elements: - name: A name by which you can reference the Data Asset in the future. This should be unique among Data Assets on the same Data Source.
- data_directory: The full path from your bucket root for the folder containing the data files.
- gcs_prefix: The beginning of the object key name.
- gcs_delimiter: Optional. A character used to define the hierarchical structure of object keys within a bucket (default is "/").
- gcs_recursive_file_discovery: Optional. A boolean indicating if files should be searched recursively from subfolders (default is False).
- gcs_max_results: Optional. The maximum number of keys in a single response (default is 1000).
 This example uses taxi trip data stored in .csvfiles in thedata/taxi_yellow_tripdata_samples/folder within the Google Cloud Storage Data Source:Pythonasset_name = "gcs_taxi_csv_directory_asset"
 gcs_prefix = "data/taxi_yellow_tripdata_samples/"
 data_directory = "data/taxi_yellow_tripdata_samples/"
- 
Add the Data Asset to your Data Source. A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source. To see the file formats supported by a Spark Directory Data Source, refer to the .add_directory_*_asset(...)methods in theSparkFilesystemDatasourcereference page.The following example creates a Data Asset that can read .csvfile data:Pythondirectory_csv_asset = data_source.add_directory_csv_asset(
 name=asset_name, gcs_prefix=gcs_prefix, data_directory=data_directory
 )
import great_expectations as gx
# This example uses a File Data Context which already has
#  a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset's parameters:
asset_name = "gcs_taxi_csv_directory_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
directory_csv_asset = data_source.add_directory_csv_asset(
    name=asset_name, gcs_prefix=gcs_prefix, data_directory=data_directory
)
Create a Batch Definition
A Batch Definition allows you to request all the records from a Data Asset or a subset based on the contents of a date and time field.
- File Data Asset
- Directory Data Asset
Batch Definitions for File Data Assets can be configured to return the content of a specific file based on either a file path or a regex match for dates in the name of the file.
Prerequisites
- A preconfigured Data Context.  The variable contextis used for your Data Context in the following example code.
- A File Data Asset on a Filesystem Data Source.
- Instructions
- Sample code
- 
Retrieve your Data Asset. Replace the value of data_source_namewith the name of your Data Source and the value ofdata_asset_namewith the name of your Data Asset in the following code. Then execute it to retrieve an existing Data Source and Data Asset from your Data Context:Pythondata_source_name = "my_filesystem_data_source"
 data_asset_name = "my_file_data_asset"
 file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
- 
Add a Batch Definition to the Data Asset. A path Batch Definition returns all of the records in a specific data file as a single Batch. A partitioned Batch Definition will return the records of a single file in the Data Asset based on which file name matches a regex. - Path
- Partitioned
 To define a path Batch Definition you need to provide the following information: - name: A name by which you can reference the Batch Definition in the future. This should be unique within the Data Asset.
- path: The path within the Data Asset of the data file containing the records to return. When using a Filesystem Data Source,- pathis relative to the Data Source parameter- base_directory.
 Update the batch_definition_nameandbatch_definition_pathvariables and execute the following code to add a path Batch Definition to your Data Asset:Pythonbatch_definition_name = "yellow_tripdata_sample_2019-01.csv"
 batch_definition_path = "folder_with_data/yellow_tripdata_sample_2019-01.csv"
 batch_definition = file_data_asset.add_batch_definition_path(
 name=batch_definition_name, path=batch_definition_path
 )GX Core currently supports partitioning File Data Assets based on dates. The files can be returned by year, month, or day. To define a Batch Definition, you need to provide the following information: - name: A name by which you can reference the Batch Definition in the future. This should be unique within the Data Asset.
- regex: A regular expression used to match against file names. When using a Filesystem Data Source,- regexis relative to the Data Source parameter- base_directory.
 - Yearly
- Monthly
- Daily
 For example, say your Data Asset contains the following files with year dates in the file names: - yellow_tripdata_sample_2019.csv
- yellow_tripdata_sample_2020.csv
- yellow_tripdata_sample_2021.csv
 You can create a regex that will match these files by replacing the year in the file names with a named regex matching pattern. This pattern's group should be named year.For the above three files, the regex pattern would be: Regular Expressionyellow_tripdata_sample_(?P<year>\d{4})\.csvUpdate the batch_definition_nameandbatch_definition_regexvariables in the following code, then execute it to create a yearly Batch Definition:Pythonbatch_definition_name = "yearly_yellow_tripdata_sample"
 batch_definition_regex = r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})\.csv"
 batch_definition = file_data_asset.add_batch_definition_yearly(
 name=batch_definition_name, regex=batch_definition_regex
 )For example, say your Data Asset contains the following files with year and month dates in their names: - yellow_tripdata_sample_2019-01.csv
- yellow_tripdata_sample_2019-02.csv
- yellow_tripdata_sample_2019-03.csv
 You can create a regex that will match these files by replacing the year and month in the file names with named regex matching patterns. These patterns should be correspondingly named yearandmonth.For the above three files, the regex pattern would be: Regular Expressionyellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csvUpdate the batch_definition_nameandbatch_definition_regexvariables in the following code, then execute it to create a monthly Batch Definition:Pythonbatch_definition_name = "monthly_yellow_tripdata_sample"
 batch_definition_regex = (
 r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
 )
 batch_definition = file_data_asset.add_batch_definition_monthly(
 name=batch_definition_name, regex=batch_definition_regex
 )For example, say your Data Asset contains the following files with year, month, and day dates in their names: - yellow_tripdata_sample_2019-01-15.csv
- yellow_tripdata_sample_2019-01-16.csv
- yellow_tripdata_sample_2019-01-17.csv
 You can create a regex that will match these files by replacing the year, month, and day in the file names with named regex matching patterns. These patterns should be correspondingly named year,month, andday.For the above three files, the regex pattern would be: Regular Expressionyellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csvUpdate the batch_definition_nameandbatch_definition_regexvariables in the following code, then execute it to create a daily Batch Definition:Pythonbatch_definition_name = "daily_yellow_tripdata_sample"
 batch_definition_regex = r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv"
 batch_definition = file_data_asset.add_batch_definition_daily(
 name=batch_definition_name, regex=batch_definition_regex
 )
- 
Optional. Verify the Batch Definition is valid. A path Batch Definition always returns all records in a specific file as a single Batch. Therefore you do not need to provide any additional parameters to retrieve data from a path Batch Definition. After retrieving your data you can verify that the Batch Definition is valid by printing the first few retrieved records with batch.head():Pythonbatch = batch_definition.get_batch()
 print(batch.head())When retrieving a Batch from a partitioned Batch Definition, you can specify the date of the data to retrieve by providing a batch_parametersdictionary with keys that correspond to the regex matching groups in the Batch Definition. If you do not specify a date, the most recent date in the data is returned by default.After retrieving your data you can verify that the Batch Definition is valid by printing the first few retrieved records with batch.head():- Yearly
- Monthly
- Daily
 Pythonbatch = batch_definition.get_batch(batch_parameters={"year": "2019"})
 print(batch.head())Pythonbatch = batch_definition.get_batch(batch_parameters={"year": "2019", "month": "01"})
 print(batch.head())Pythonbatch = batch_definition.get_batch(
 batch_parameters={"year": "2019", "month": "01", "day": "01"}
 )
 print(batch.head())
- 
Optional. Create additional Batch Definitions. A Data Asset can have multiple Batch Definitions as long as each Batch Definition has a unique name within that Data Asset. Repeat this procedure to add additional path or partitioned Batch Definitions to your Data Asset. 
Full example code for path Batch Definitions and partitioned yearly, monthly, or daily Batch Definitions:
- Path
- Yearly
- Monthly
- Daily
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "yellow_tripdata_sample_2019-01.csv"
batch_definition_path = "folder_with_data/yellow_tripdata_sample_2019-01.csv"
batch_definition = file_data_asset.add_batch_definition_path(
    name=batch_definition_name, path=batch_definition_path
)
batch = batch_definition.get_batch()
print(batch.head())
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "yearly_yellow_tripdata_sample"
batch_definition_regex = r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})\.csv"
batch_definition = file_data_asset.add_batch_definition_yearly(
    name=batch_definition_name, regex=batch_definition_regex
)
batch = batch_definition.get_batch(batch_parameters={"year": "2019"})
print(batch.head())
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "monthly_yellow_tripdata_sample"
batch_definition_regex = (
    r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
)
batch_definition = file_data_asset.add_batch_definition_monthly(
    name=batch_definition_name, regex=batch_definition_regex
)
batch = batch_definition.get_batch(batch_parameters={"year": "2019", "month": "01"})
print(batch.head())
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "daily_yellow_tripdata_sample"
batch_definition_regex = r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv"
batch_definition = file_data_asset.add_batch_definition_daily(
    name=batch_definition_name, regex=batch_definition_regex
)
batch = batch_definition.get_batch(
    batch_parameters={"year": "2019", "month": "01", "day": "01"}
)
print(batch.head())
Batch Definitions for a Directory Data Asset can be configured to return all of the records for the files in the Data Asset, or to subdivide the Data Asset's records on the content of a Datetime field and only return the records that correspond to a specific year, month, or day.
Prerequisites
- A preconfigured Data Context.  The variable contextis used for your Data Context in the following example code.
- A File Data Asset on a Filesystem Data Source.
Procedure
- Instructions
- Sample code
- 
Retrieve your Data Asset. Replace the value of data_source_namewith the name of your Data Source and the value ofdata_asset_namewith the name of your Data Asset in the following code. Then execute it to retrieve an existing Data Source and Data Asset from your Data Context:Pythonimport great_expectations as gx
 context = gx.get_context()
 data_source_name = "my_filesystem_data_source"
 data_asset_name = "my_directory_data_asset"
 file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
- 
Add a Batch Definition to the Data Asset. A whole directory Batch Definition returns all of the records in a data file as a single Batch. A partitioned directory Batch Definition will subdivide the records according to a datetime field and return those records that match a specified year, month, or day. - Whole directory
- Partitioned
 Because a whole directory Batch Definition returns the records from all of the files it can read in the Data Asset you only need to provide one addditional piece of information to define one: - name: A name by which you can reference the Batch Definition in the future. This should be unique within the Data Asset.
 Update the batch_definition_namevariable and execute the following code to create a whole directory Batch Definition:Pythonbatch_definition_name = "yellow_tripdata"
 batch_definition = file_data_asset.add_batch_definition_whole_directory(
 name=batch_definition_name
 )GX Core currently supports partitioning Directory Data Assets based on a datetime field. Therefore, to define a partitioned Directory Batch Definition you need to provide two pieces of information: - name:A name by which you can reference the Batch Definition in the future. This should be unique within the Data Asset.
- column: The datetime column that records should be subdivided on.
 The Batch Definition can be configured to return records by year, month, or day. - Yearly
- Monthly
- Daily
 Update the batch_definition_nameandbatch_definition_columnvariables in the following code, then execute it to create a Batch Definition that subdivides the records in a directory by year:Pythonbatch_definition_name = "yellow_tripdata_sample_yearly"
 batch_definition_column = "pickup_datetime"
 batch_definition = file_data_asset.add_batch_definition_yearly(
 name=batch_definition_name, column=batch_definition_column
 )Update the batch_definition_nameandbatch_definition_columnvariables in the following code, then execute it to create a Batch Definition that subdivides the records in a directory by month:Pythonbatch_definition_name = "yellow_tripdata_sample_monthly"
 batch_definition_column = "pickup_datetime"
 batch_definition = file_data_asset.add_batch_definition_monthly(
 name=batch_definition_name, column=batch_definition_column
 )Update the batch_definition_nameandbatch_definition_columnvariables in the following code, then execute it to create a Batch Definition that subdivides the records in a directory by day:Pythonbatch_definition_name = "yellow_tripdata_sample_daily"
 batch_definition_column = "pickup_datetime"
 batch_definition = file_data_asset.add_batch_definition_daily(
 name=batch_definition_name, column=batch_definition_column
 )
- 
Optional. Verify the Batch Definition is valid. A whole directory Batch Definition always returns all available records as a single Batch. Therefore you do not need to provide any additional parameters to retrieve data from a path Batch Definition. After retrieving your data you can verify that the Batch Definition is valid by printing the first few retrieved records with batch.head():Pythonbatch = batch_definition.get_batch()
 print(batch.head())When retrieving a Batch from a partitioned Batch Definition, you can specify the date of the data to retrieve by providing a batch_parametersdictionary with keys that correspond to the regex groups in the Batch Definition. If you do not specify a date, the most recent date in the data is returned by default.After retrieving your data you can verify that the Batch Definition is valid by printing the first few retrieved records with batch.head():- Yearly
- Monthly
- Daily
 Pythonbatch = batch_definition.get_batch(batch_parameters={"year": 2019})
 print(batch.head())Pythonbatch = batch_definition.get_batch(batch_parameters={"year": 2019, "month": 1})
 print(batch.head())Pythonbatch = batch_definition.get_batch(
 batch_parameters={"year": 2019, "month": 1, "day": 1}
 )
 print(batch.head())
- 
Optional. Create additional Batch Definitions. A Data Asset can have multiple Batch Definitions as long as each Batch Definition has a unique name within that Data Asset. Repeat this procedure to add additional whole directory or partitioned Batch Definitions to your Data Asset. 
Full example code for whole directory Batch Definitions and partitioned yearly, monthly, or daily Batch Definitions:
- Whole directory
- Yearly
- Monthly
- Daily
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "yellow_tripdata"
batch_definition = file_data_asset.add_batch_definition_whole_directory(
    name=batch_definition_name
)
batch = batch_definition.get_batch()
print(batch.head())
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "yellow_tripdata_sample_yearly"
batch_definition_column = "pickup_datetime"
batch_definition = file_data_asset.add_batch_definition_yearly(
    name=batch_definition_name, column=batch_definition_column
)
batch = batch_definition.get_batch(batch_parameters={"year": 2019})
print(batch.head())
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "yellow_tripdata_sample_monthly"
batch_definition_column = "pickup_datetime"
batch_definition = file_data_asset.add_batch_definition_monthly(
    name=batch_definition_name, column=batch_definition_column
)
batch = batch_definition.get_batch(batch_parameters={"year": 2019, "month": 1})
print(batch.head())
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "yellow_tripdata_sample_daily"
batch_definition_column = "pickup_datetime"
batch_definition = file_data_asset.add_batch_definition_daily(
    name=batch_definition_name, column=batch_definition_column
)
batch = batch_definition.get_batch(
    batch_parameters={"year": 2019, "month": 1, "day": 1}
)
print(batch.head())