Connect to Filesystem data
Filesystem data consists of data stored in file formats such as .csv
or .parquet
, and located in an environment with a folder hierarchy such as Amazon S3, Azure Blob Storage, Google Cloud Storage, or local and networked filesystems. GX can leverage either pandas or Spark to read this data.
To connect to your Filesystem data, you first create a Data Source which tells GX where your data files reside. You then configure Data Assets for your Data Source to tell GX which sets of records you want to be able to access from your Data Source. Finally, you will define Batch Definitions which allow you to request all the records retrieved from a Data Asset or further partition the returned records based on a specified date.
Create a Data Source
Data Sources tell GX where your data is located and how to connect to it. With Filesystem data this is done by directing GX to the folder or online location that contains the data files. GX supports accessing Filesystem data from Amazon S3, Azure Blob Storage, Google Cloud Storage, and local or networked filesystems.
- Local or networked filesystem
- Amazon S3
- Azure Blob Storage
- Google Cloud Storage
Prerequisites
- Python version 3.8 to 3.11
-
An installation of GX Core
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
- A preconfigured Data Context
- Access to data files in a local or networked directory.
All Data Contexts include a built in pandas_default
Data Source. This Data Source gives access to all of the read_*(...)
methods available in pandas.
The read_*(...)
methods of the pandas_default
Data Source allow you to load data into GX without first configuring a Data Source, Data Asset, and Batch Definition. However, it does not save configurations for reading files to the Data Context and provides less versatility than a fully configured Data Source, Data Asset, and Batch Definition. Therefore, the pandas_default
Data Source is intended to facilitate testing Expectations and engaging in data exploration but is less suited for use in production and automated workflows.
Procedure
- Instructions
- Sample code
-
Define the Data Source's parameters.
The following information is required when you create a Filesystem Data Source for a local or networked directory:
name
: A descriptive name used to reference the Data Source. This should be unique within the Data Context.base_directory
: The path to the folder that contains the data files, or the root folder of the directory hierarchy that contains the data files.
If you are using a File Data Context, you can provide a path that is relative to the Data Context's
base_directory
. Otherwise, you should provide the absolute path to the folder that contains your data.In this example, a relative path is defined for a folder that contains taxi trip data for New York City in
.csv
format:Pythonsource_folder = "./data"
data_source_name = "my_filesystem_data_source" -
Add a Filesystem Data Source to your Data Context.
GX can leverage either pandas or Spark as the backend for your Filesystem Data Source. To create your Data Source, execute one of the following sets of code:
- pandas
- Spark
Pythondata_source = context.data_sources.add_pandas_filesystem(
name=data_source_name, base_directory=source_folder
)Pythondata_source = context.data_sources.add_spark_filesystem(
name=data_source_name, base_directory=source_folder
)
Choose from the following to see the full example code for a local or networked Data Source, using either pandas or Spark to read the data files:
- pandas example
- Spark example
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
# This path is relative to the `base_directory` of the Data Context.
source_folder = "./data"
data_source_name = "my_filesystem_data_source"
# Create the Data Source:
data_source = context.data_sources.add_pandas_filesystem(
name=data_source_name, base_directory=source_folder
)
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
# This path is relative to the `base_directory` of the Data Context.
source_folder = "./data"
data_source_name = "my_filesystem_data_source"
# Create the Data Source:
data_source = context.data_sources.add_spark_filesystem(
name=data_source_name, base_directory=source_folder
)
Prerequisites
- Python version 3.8 to 3.11
-
An installation of GX Core with support for Amazon S3 dependencies
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
- A preconfigured Data Context
- Access to data files on a S3 bucket.
Procedure
- Instructions
- Sample code
-
Define the Data Source's parameters.
The following information is required when you create an Amazon S3 Data Source:
name
: A descriptive name used to reference the Data Source. This should be unique within the Data Context.bucket_name
: The Amazon S3 bucket name.boto3_options
: Optional. Additional options for the Data Source. In the following examples, the default values are used.
Replace the variable values with your own and run the following Python code to define
name
,bucket_name
andboto3_options
:Pythondata_source_name = "my_filesystem_data_source"
bucket_name = "superconductive-docs-test"
boto3_options = {}Additional options forboto3_options
The parameter
boto3_options
allows you to pass the following information:-
region_name
: Your AWS region name. -
endpoint_url
: specifies an S3 endpoint. You can provide an environment variable reference such as"${S3_ENDPOINT}"
to securely include this in your code. The string"${S3_ENDPOINT}"
will be replaced with the value of the environment variableS3_ENDPOINT
.
For more information on secure storage and retrieval of credentials in GX see Configure credentials.
-
Add a S3 Filesystem Data Source to your Data Context.
GX can leverage either pandas or Spark as the backend for your S3 Filesystem Data Source. To create your Data Source, execute one of the following sets of code:
- pandas
- Spark
Pythondata_source = context.data_sources.add_pandas_s3(
name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)Pythondata_source = context.data_sources.add_spark_s3(
name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)
Choose from the following to see the full example code for a S3 Filesystem Data Source, using either pandas or Spark to read the data files:
- pandas example
- Spark example
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
bucket_name = "superconductive-docs-test"
boto3_options = {}
# Create the Data Source:
data_source = context.data_sources.add_pandas_s3(
name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
bucket_name = "superconductive-docs-test"
boto3_options = {}
# Create the Data Source:
data_source = context.data_sources.add_spark_s3(
name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)
Prerequisites
- Python version 3.8 to 3.11
-
An installation of GX Core with support for Azure Blob Storage dependencies
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
- A preconfigured Data Context
- Access to data files in Azure Blob Storage.
Procedure
- Instructions
- Sample code
-
Define the Data Source's parameters.
The following information is required when you create a Microsoft Azure Blob Storage Data Source:
name
: A descriptive name used to reference the Data Source. This should be unique within the Data Context.azure_options
: Authentication settings.
The
azure_options
parameter accepts a dictionary which should include two keys:credential
and eitheraccount_url
orconn_str
.credential
: An Azure Blob Storage tokenaccount_url
: The url of your Azure Blob Storage account. If you provide this thenconn_str
should be left out of the dictionary.conn_str
: The connection string for your Azure Blob Storage account. If you provide this thenaccount_url
should not be included in the dictionary.
To keep your credentials secure you can define them as environment variables or entries in
config_variables.yml
. For more information on secure storage and retrieval of credentials in GX see Configure credentials.Update the variables in the following code and execute it to define
name
andazure_options
. In this example, the value foraccount_url
is pulled from the environment variableAZURE_STORAGE_ACCOUNT_URL
and the value forcredential
is pulled from the environment variableAZURE_CREDENTIAL
:Pythondata_source_name = "my_filesystem_data_source"
azure_options = {
"account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
"credential": "${AZURE_CREDENTIAL}",
} -
Add an Azure Blob Storage Data Source to your Data Context.
GX can leverage either pandas or Spark as the backend for your Azure Blob Storage Data Source. To create your Data Source, execute one of the following sets of code:
- pandas
- Spark
Python# Create the Data Source:
data_source = context.data_sources.add_pandas_abs(
name=data_source_name, azure_options=azure_options
)Pythondata_source = context.data_sources.add_spark_abs(
name=data_source_name, azure_options=azure_options
)
Choose from the following to see the full example code for a S3 Filesystem Data Source, using either pandas or Spark to read the data files:
- pandas example
- Spark example
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
azure_options = {
"account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
"credential": "${AZURE_CREDENTIAL}",
}
# Create the Data Source:
data_source = context.data_sources.add_pandas_abs(
name=data_source_name, azure_options=azure_options
)
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
azure_options = {
"account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
"credential": "${AZURE_CREDENTIAL}",
}
# Create the Data Source:
data_source = context.data_sources.add_spark_abs(
name=data_source_name, azure_options=azure_options
)
Prerequisites
- Python version 3.8 to 3.11
-
An installation of GX Core with support for Google Cloud Storage dependencies
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
- A preconfigured Data Context
- Access to data files in Google Cloud Storage.
Procedure
- Instructions
- Sample code
-
Set up the Data Source's credentials.
By default, GCS credentials are handled through the gcloud command line tool and the
GOOGLE_APPLICATION_CREDENTIALS
environment variable. The gcloud command line tool is used to set up authentication credentials, and theGOOGLE_APPLICATION_CREDENTIALS
environment variable provides the path to thejson
file with those credentials.For more information on using the gcloud command line tool, see Google Cloud's Cloud Storage client libraries documentation.
-
Define the Data Source's parameters.
The following information is required when you create a GCS Data Source:
name
: A descriptive name used to reference the Data Source. This should be unique within the Data Context.bucket_or_name
: The GCS bucket or instance name.gcs_options
: Optional. A dictionary that can be used to specify an alternative method for providing GCS credentials.
The
gcs_options
dictionary can be left empty if the defaultGOOGLE_APPLICATION_CREDENTIALS
environment variable is populated. Otherwise, thegcs_options
dictionary should have either the keyfilename
or the keyinfo
.filename
: The value of this key should be the specific path to your credentials json. If you provide this then theinfo
key should be left out of the dictionary.info
: The actual JSON data from your credentials file in the form of a string. If you provide this then thefilename
key should not be included in the dictionary.
Update the variables in the following code and execute it to define
name
,bucket_or_name
, andgcs_options
. In this example the defaultGOOGLE_APPLICATION_CREDENTIALS
environment variable points to the location of the credentials json and therefore thegcs_options
dictionary is left empty:Pythondata_source_name = "my_filesystem_data_source"
bucket_or_name = "test_docs_data"
gcs_options = {} -
Add a Google Cloud Storage Data Source to your Data Context.
GX can leverage either pandas or Spark as the backend for your Google Cloud Storage Data Source. To create your Data Source, execute one of the following sets of code:
- pandas
- Spark
Pythondata_source = context.data_sources.add_pandas_gcs(
name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)Pythondata_source = context.data_sources.add_spark_gcs(
name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
Choose from the following to see the full example code for a S3 Filesystem Data Source, using either pandas or Spark to read the data files:
- pandas example
- Spark example
data_source = context.data_sources.add_pandas_gcs(
name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
data_source = context.data_sources.add_spark_gcs(
name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
Create a Data Asset
A Data Asset is a collection of related records within a Data Source. These records may be located within multiple files, but each Data Asset is only capable of reading a single specific file format which is determined when it is created. However, a Data Source may contain multiple Data Assets covering different file formats and groups of records.
GX provides two types of Data Assets for Filesystem Data Sources: File Data Assets and Directory Data Assets.
- File Data Asset
- Directory Data Asset
File Data Assets are used to retrieve data from individual files in formats such as .csv
or .parquet
. The file format that can be read by a File Data Asset is determined when the File Data Asset is created. The specific file that is read is determind by Batch Definitions that are added to the Data Asset after it is created.
Both Spark and pandas Filesystem Data Sources support File Data Assets for all supported Filesystem environments.
Directory Data Assets read one or more files in formats such as .csv
or .parquet
. The file format that can be read by a Directory Data Asset is determined when the Directory Data Asset is created. The data in the corresponding files is concatonated into a single table which can be retrieved as a whole, or further partitioned based on the value of a datetime field.
Spark Filesystem Data Sources support Directory Data Assets for all supported Filesystem environments. However, pandas Filesystem Data Sources do not support Directory Data Assets at all.
- Local or networked filesystem
- Amazon S3
- Azure Blob Storage
- Google Cloud Storage
Prerequisites
- Python version 3.8 to 3.11.
- An installation of GX Core.
- A preconfigured Data Context.
- Access to data files (such as
.csv
or.parquet
files) in a local or networked folder hierarchy. - A pandas or Spark Filesystem Data Source configured for local or networked data files.
Procedure
- Instructions
- Sample code
-
Retrieve your Data Source.
Replace the value of
data_source_name
in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name) -
Define your Data Asset's parameters.
A File Data Asset for files in a local or networked folder hierarchy only needs one piece of information to be created.
name
: A descriptive name with which to reference the Data Asset. This name should be unique among all Data Assets for the same Data Source.
This example uses taxi trip data stored in
.csv
files, so the name"taxi_csv_files"
will be used for the Data Asset:Pythonasset_name = "taxi_csv_files"
-
Add the Data Asset to your Data Source.
A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
- To see the file formats supported by a pandas File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aPandasFilesystemDatasource
. - To see the file formats supported by a Spark File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aSparkFilesystemDatasource
.
The following example creates a Data Asset that can read
.csv
file data:Pythonfile_csv_asset = data_source.add_csv_asset(name=asset_name)
- To see the file formats supported by a pandas File Data Source, reference the
import great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)
# Define the Data Asset's parameters:
asset_name = "taxi_csv_files"
# Add the Data Asset to the Data Source:
file_csv_asset = data_source.add_csv_asset(name=asset_name)
Prerequisites
- Python version 3.8 to 3.11.
- An installation of GX Core with support for Spark dependencies.
- A preconfigured Data Context.
- A Spark Filesystem Data Source configured to access data files in a local or networked folder hierarchy.
Procedure
- Instructions
- Sample code
-
Retrieve your Data Source.
Replace the value of
data_source_name
in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name) -
Define your Data Asset's parameters.
A Directory Data Asset for files in a local or networked folder hierarchy only needs two pieces of information to be created.
name
: A descriptive name with which to reference the Data Asset. This name should be unique among all Data Assets for the same Data Source.data_directory
: The path of the containing the data files for the Data Asset. This path can be relative to the Data Source'sbase_directory
.
This example uses taxi trip data stored in
.csv
files in thedata/
folder within the Data Source's directory tree:Pythonasset_name = "taxi_csv_directory"
data_directory = "./data" -
Add the Data Asset to your Data Source.
A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
To see the file formats supported by a Spark Directory Data Source, reference the
.add_directory_*_asset(...)
methods in the API documentation for aSparkFilesystemDatasource
.The following example creates a Data Asset that can read
.csv
file data:Pythondirectory_csv_asset = data_source.add_directory_csv_asset(
name=asset_name, data_directory=data_directory
)
import great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)
# Define the Data Asset's parameters:
asset_name = "taxi_csv_directory"
data_directory = "./data"
# Add the Data Asset to the Data Source:
directory_csv_asset = data_source.add_directory_csv_asset(
name=asset_name, data_directory=data_directory
)
Prerequisites
- Python version 3.8 to 3.11.
- An installation of GX Core with support for Amazon S3 dependencies.
- A preconfigured Data Context.
- Access to data files in S3.
- A Filesystem Data Source configured to access data files in S3.
Procedure
- Instructions
- Sample code
-
Retrieve your Data Source.
Replace the value of
data_source_name
in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
context = gx.get_context()
# This example uses a File Data Context which already has
# a Data Source defined.
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name) -
Define your Data Asset's parameters.
To define a File Data Asset for data in an S3 bucket you provide the following elements:
asset_name
: A name by which you can reference the Data Asset in the future. This should be unique within the Data Source.s3_prefix
: The path to the data files for the Data Asset, relative to the root of the S3 bucket.
This example uses taxi trip data stored in
.csv
files in thedata/taxi_yellow_tripdata_samples/
folder within the Data Sources S3 bucket:Pythonasset_name = "s3_taxi_csv_file_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/" -
Add the Data Asset to your Data Source.
A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
- To see the file formats supported by a pandas File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aPandasFilesystemDatasource
. - To see the file formats supported by a Spark File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aSparkFilesystemDatasource
.
The following example creates a Data Asset that can read
.csv
file data:Pythons3_file_data_asset = data_source.add_csv_asset(name=asset_name, s3_prefix=s3_prefix)
- To see the file formats supported by a pandas File Data Source, reference the
import great_expectations as gx
context = gx.get_context()
# This example uses a File Data Context which already has
# a Data Source defined.
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)
# Define the Data Asset's parameters:
asset_name = "s3_taxi_csv_file_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
s3_file_data_asset = data_source.add_csv_asset(name=asset_name, s3_prefix=s3_prefix)
Prerequisites
- Python version 3.8 to 3.11.
- An installation of GX Core with support for Amazon S3 dependencies and Spark dependencies.
- A preconfigured Data Context.
- A Filesystem Data Source configured to access data files in S3.
Procedure
- Instructions
- Sample code
-
Retrieve your Data Source.
Replace the value of
data_source_name
in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name) -
Define your Data Asset's parameters.
To define a Directory Data Asset for data in an S3 bucket you provide the following elements:
asset_name
: A name by which you can reference the Data Asset in the future. This should be unique within the Data Source.s3_prefix
: The path to the data files for the Data Asset, relative to the root of the S3 bucket.data_directory
: The path of the folder containing data files for the Data asset, relative to the root of the S3 bucket.
This example uses taxi trip data stored in
.csv
files in thedata/taxi_yellow_tripdata_samples/
folder within the Data Sources S3 bucket:Pythonasset_name = "s3_taxi_csv_directory_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/" -
Add the Data Asset to your Data Source.
A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
To see the file formats supported by a Spark File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aSparkFilesystemDatasource
.The following example creates a Data Asset that can read
.csv
file data:Pythons3_file_data_asset = data_source.add_directory_csv_asset(
name=asset_name, s3_prefix=s3_prefix, data_directory=data_directory
)
import great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)
# Define the Data Asset's parameters:
asset_name = "s3_taxi_csv_directory_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
s3_file_data_asset = data_source.add_directory_csv_asset(
name=asset_name, s3_prefix=s3_prefix, data_directory=data_directory
)
Prerequisites
- Python version 3.8 to 3.11.
- An installation of GX Core with support for Azure Blob Storage dependencies.
- A preconfigured Data Context.
- Access to data files in Azure Blob Storage.
- A pandas or Spark Filesystem Data Source configured for Azure Blob Storage data files.
Procedure
- Instructions
- Sample code
-
Retrieve your Data Source.
Replace the value of
data_source_name
in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
# This example uses a File Data Context
# which already has a Data Source configured.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name) -
Define your Data Asset's parameters.
To define a File Data Asset for Azure Blob Storage you provide the following elements:
name
: A name by which you can reference the Data Asset in the future.abs_container
: The name of your Azure Blob Storage container.abs_name_starts_with
: The path to the data files for the Data Asset, relative to the root of theabs_container
.abs_recursive_file_discovery
: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders. The default is False.
This example uses taxi trip data stored in
.csv
files in thedata/taxi_yellow_tripdata_samples/
folder within the Azure Blob Storage container:Pythonasset_name = "abs_file_csv_asset"
abs_container = "superconductive-public"
abs_prefix = "data/taxi_yellow_tripdata_samples/" -
Add the Data Asset to your Data Source.
A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
- To see the file formats supported by a pandas File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aPandasFilesystemDatasource
. - To see the file formats supported by a Spark File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aSparkFilesystemDatasource
.
The following example creates a Data Asset that can read
.csv
file data:Pythonfile_asset = data_source.add_csv_asset(
name=asset_name, abs_container=abs_container, abs_name_starts_with=abs_prefix
) - To see the file formats supported by a pandas File Data Source, reference the
import great_expectations as gx
# This example uses a File Data Context
# which already has a Data Source configured.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)
# Define the Data Asset's parameters:
asset_name = "abs_file_csv_asset"
abs_container = "superconductive-public"
abs_prefix = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
file_asset = data_source.add_csv_asset(
name=asset_name, abs_container=abs_container, abs_name_starts_with=abs_prefix
)
Prerequisites
- Python version 3.8 to 3.11.
- An installation of GX Core with support for Azure Blob Storage dependencies and Spark dependencies.
- A preconfigured Data Context.
- A Filesystem Data Source configured to access data files in Azure Blob Storage.
Procedure
- Instructions
- Sample code
-
Retrieve your Data Source. Replace the value of
data_source_name
in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
# This example uses a File Data Context
# which already has a Data Source configured.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name) -
Define your Data Asset's parameters.
To define a Directory Data Asset for Azure Blob Storage you provide the following elements:
name
: A name by which you can reference the Data Asset in the future.abs_container
: The name of your Azure Blob Storage container.abs_name_starts_with
: The path to the data files for the Data Asset in the Azure Blob Storage container. This should be relative to the root of theabs_container
.data_directory
: The path of the folder containing data files for the Data asset, relative to the root of theabs_container
.abs_recursive_file_discovery
: (Optional) A boolean (True/False) indicating if files should be searched recursively from subfolders. The default is False.
This example uses taxi trip data stored in
.csv
files in thedata/taxi_yellow_tripdata_samples/
folder within the Azure Blob Storage container:Pythonasset_name = "abs_directory_asset"
abs_container = "superconductive-public"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/" -
Add the Data Asset to your Data Source.
A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
To see the file formats supported by a Spark Directory Data Source, reference the
.add_directory_*_asset(...)
methods in the API documentation for aSparkFilesystemDatasource
.The following example creates a Data Asset that can read
.csv
file data:Pythondirectory_csv_asset = data_source.add_directory_csv_asset(
name=asset_name,
abs_container=abs_container,
abs_name_starts_with=abs_name_starts_with,
data_directory=data_directory,
)
import great_expectations as gx
# This example uses a File Data Context
# which already has a Data Source configured.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)
# Define the Data Asset's parameters:
asset_name = "abs_directory_asset"
abs_container = "superconductive-public"
abs_name_starts_with = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
directory_csv_asset = data_source.add_directory_csv_asset(
name=asset_name,
abs_container=abs_container,
abs_name_starts_with=abs_name_starts_with,
data_directory=data_directory,
)
Prerequisites
- Python version 3.8 to 3.11.
- An installation of GX Core with support for Google Cloud Storage dependencies.
- A preconfigured Data Context.
- Access to data files in Google Cloud Storage.
- A pandas or Spark Filesystem Data Source configured for Google Cloud Storage data files.
Procedure
- Instructions
- Sample code
-
Retrieve your Data Source.
Replace the value of
data_source_name
in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name) -
Define your Data Asset's parameters.
To define a File Data Asset for Google Cloud Storage you provide the following elements:
name
: A name by which you can reference the Data Asset in the future. This should be unique among Data Assets on the same Data Source.gcs_prefix
: The beginning of the object key name.gcs_delimiter
: Optional. A character used to define the hierarchical structure of object keys within a bucket (default is "/").gcs_recursive_file_discovery
: Optional. A boolean indicating if files should be searched recursively from subfolders (default is False).gcs_max_results
: Optional. The maximum number of keys in a single response (default is 1000).
This example uses taxi trip data stored in
.csv
files in thedata/taxi_yellow_tripdata_samples/
folder within the Google Cloud Storage Data Source:Pythonasset_name = "gcs_taxi_csv_file_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/" -
Add the Data Asset to your Data Source.
A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
- To see the file formats supported by a pandas File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aPandasFilesystemDatasource
. - To see the file formats supported by a Spark File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aSparkFilesystemDatasource
.
The following example creates a Data Asset that can read
.csv
file data:Pythonfile_csv_asset = data_source.add_csv_asset(name=asset_name, gcs_prefix=gcs_prefix)
- To see the file formats supported by a pandas File Data Source, reference the
import great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)
# Define the Data Asset's parameters:
asset_name = "gcs_taxi_csv_file_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
file_csv_asset = data_source.add_csv_asset(name=asset_name, gcs_prefix=gcs_prefix)
Prerequisites
- Python version 3.8 to 3.11.
- An installation of GX Core with support for Google Cloud Storage dependencies and Spark dependencies.
- A preconfigured Data Context.
- A Filesystem Data Source configured to access data files in Google Cloud Storage.
Procedure
- Instructions
- Sample code
-
Retrieve your Data Source.
Replace the value of
data_source_name
in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name) -
Define your Data Asset's parameters.
To define a Directory Data Asset for Google Cloud Storage you provide the following elements:
name
: A name by which you can reference the Data Asset in the future. This should be unique among Data Assets on the same Data Source.data_directory
: The full path from your bucket root for the folder containing the data files.gcs_prefix
: The beginning of the object key name.gcs_delimiter
: Optional. A character used to define the hierarchical structure of object keys within a bucket (default is "/").gcs_recursive_file_discovery
: Optional. A boolean indicating if files should be searched recursively from subfolders (default is False).gcs_max_results
: Optional. The maximum number of keys in a single response (default is 1000).
This example uses taxi trip data stored in
.csv
files in thedata/taxi_yellow_tripdata_samples/
folder within the Google Cloud Storage Data Source:Pythonasset_name = "gcs_taxi_csv_directory_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/" -
Add the Data Asset to your Data Source.
A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
To see the file formats supported by a Spark Directory Data Source, reference the
.add_directory_*_asset(...)
methods in the API documentation for aSparkFilesystemDatasource
.The following example creates a Data Asset that can read
.csv
file data:Pythondirectory_csv_asset = data_source.add_directory_csv_asset(
name=asset_name, gcs_prefix=gcs_prefix, data_directory=data_directory
)
import great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.get_datasource(data_source_name)
# Define the Data Asset's parameters:
asset_name = "gcs_taxi_csv_directory_asset"
gcs_prefix = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
directory_csv_asset = data_source.add_directory_csv_asset(
name=asset_name, gcs_prefix=gcs_prefix, data_directory=data_directory
)
Create a Batch Definition
A Batch Definition determines which records in a Data Asset are retrieved for Validation. Batch Definitions can be configured to either provide all of the records in a Data Asset, or to subdivide the Data Asset based on a date.
- File Data Asset
- Directory Data Asset
Batch Definitions for File Data Assets can be configured to return the content of a specific file based on either a file path or a regex match for dates in the name of the file.
Prerequisites
- A preconfigured Data Context. The variable
context
is used for your Data Context in the following example code. - A File Data Asset on a Filesystem Data Source.
- Instructions
- Sample code
-
Retrieve your Data Asset.
Replace the value of
data_source_name
with the name of your Data Source and the value ofasset_name
with the name of your Data Asset in the following code. Then execute it to retrieve an existing Data Source and Data Asset from your Data Context:Pythondata_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name) -
Add a Batch Definition to the Data Asset.
A path Batch Definition returns all of the records in a specific data file as a single Batch. A partitioned Batch Definition will return the records of a single file in the Data Asset based on which file name matches a regex.
- Path
- Partitioned
To define a path Batch Definition you need to provide the following information:
name
: A name by which you can reference the Batch Definition in the future. This should be unique within the Data Asset.path
: The path within the Data Asset of the data file containing the records to return.
Update the
batch_definition_name
andbatch_definition_path
variables and execute the following code to add a path Batch Definition to your Data Asset:Pythonbatch_definition_name = "yellow_tripdata_sample_2019-01.csv"
batch_definition_path = "folder_with_data/yellow_tripdata_sample_2019-01.csv"
batch_definition = file_data_asset.add_batch_definition_path(
name=batch_definition_name, path=batch_definition_path
)GX Core currently supports partitioning File Data Assets based on dates. The files can be returned by year, month, or day.
- Yearly
- Monthly
- Daily
For example, say your Data Asset contains the following files with year dates in the file names:
- yellow_tripdata_sample_2019.csv
- yellow_tripdata_sample_2020.csv
- yellow_tripdata_sample_2021.csv
You can create a regex that will match these files by replacing the year in the file names with a named regex matching pattern. This pattern's group should be named
year
.For the above three files, the regex pattern would be:
Regular Expressionyellow_tripdata_sample_(?P<year>\d{4})\.csv
Update the
batch_definition_name
andbatch_definition_regex
variables in the following code, then execute it to create a yearly Batch Definition:batch_definition_name = "yearly_yellow_tripdata_sample"
batch_definition_regex = r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})\.csv"
batch_definition = file_data_asset.add_batch_definition_yearly(
name=batch_definition_name, regex=batch_definition_regex
)For example, say your Data Asset contains the following files with year and month dates in their names:
- yellow_tripdata_sample_2019-01.csv
- yellow_tripdata_sample_2019-02.csv
- yellow_tripdata_sample_2019-03.csv
You can create a regex that will match these files by replacing the year and month in the file names with named regex matching patterns. These patterns should be correspondingly named
year
andmonth
.For the above three files, the regex pattern would be:
Regular Expressionyellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv
Update the
batch_definition_name
andbatch_definition_regex
variables in the following code, then execute it to create a monthly Batch Definition:batch_definition_name = "monthly_yellow_tripdata_sample"
batch_definition_regex = (
r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
)
batch_definition = file_data_asset.add_batch_definition_monthly(
name=batch_definition_name, regex=batch_definition_regex
)For example, say your Data Asset contains the following files with year, month, and day dates in their names:
- yellow_tripdata_sample_2019-01-15.csv
- yellow_tripdata_sample_2019-01-16.csv
- yellow_tripdata_sample_2019-01-17.csv
You can create a regex that will match these files by replacing the year, month, and day in the file names with named regex matching patterns. These patterns should be correspondingly named
year
,month
, andday
.For the above three files, the regex pattern would be:
Regular Expressionyellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv
Update the
batch_definition_name
andbatch_definition_regex
variables in the following code, then execute it to create a daily Batch Definition:batch_definition_name = "daily_yellow_tripdata_sample"
batch_definition_regex = r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv"
batch_definition = file_data_asset.add_batch_definition_daily(
name=batch_definition_name, regex=batch_definition_regex
) -
Optional. Verify the Batch Definition is valid.
A path Batch Definition always returns all records in a specific file as a single Batch. Therefore you do not need to provide any additional parameters to retrieve data from a path Batch Definition.
After retrieving your data you can verify that the Batch Definition is valid by printing the first few retrieved records with
batch.head()
:Pythonbatch = batch_definition.get_batch()
batch.head()When retrieving a Batch from a partitioned Batch Definition, you can specify the date of the data to retrieve by providing a
batch_parameters
dictionary with keys that correspond to the regex matching groups in the Batch Definition. If you do not specify a date, the most recent date in the data is returned by default.After retrieving your data you can verify that the Batch Definition is valid by printing the first few retrieved records with
batch.head()
:- Yearly
- Monthly
- Daily
Pythonbatch = batch_definition.get_batch(batch_parameters={"year": "2019"})
batch.head()Pythonbatch = batch_definition.get_batch(batch_parameters={"year": "2019", "month": "01"})
batch.head()Pythonbatch = batch_definition.get_batch(
batch_parameters={"year": "2019", "month": "01", "day": "01"}
)
batch.head() -
Optional. Create additional Batch Definitions.
A Data Asset can have multiple Batch Definitions as long as each Batch Definition has a unique name within that Data Asset. Repeat this procedure to add additional path or partitioned Batch Definitions to your Data Asset.
Full example code for path Batch Definitions and partitioned yearly, monthly, or daily Batch Definitions:
- Path
- Yearly
- Monthly
- Daily
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "yellow_tripdata_sample_2019-01.csv"
batch_definition_path = "folder_with_data/yellow_tripdata_sample_2019-01.csv"
batch_definition = file_data_asset.add_batch_definition_path(
name=batch_definition_name, path=batch_definition_path
)
batch = batch_definition.get_batch()
batch.head()
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "yearly_yellow_tripdata_sample"
batch_definition_regex = r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})\.csv"
batch_definition = file_data_asset.add_batch_definition_yearly(
name=batch_definition_name, regex=batch_definition_regex
)
batch = batch_definition.get_batch(batch_parameters={"year": "2019"})
batch.head()
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "monthly_yellow_tripdata_sample"
batch_definition_regex = (
r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})\.csv"
)
batch_definition = file_data_asset.add_batch_definition_monthly(
name=batch_definition_name, regex=batch_definition_regex
)
batch = batch_definition.get_batch(batch_parameters={"year": "2019", "month": "01"})
batch.head()
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_file_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "daily_yellow_tripdata_sample"
batch_definition_regex = r"folder_with_data/yellow_tripdata_sample_(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})\.csv"
batch_definition = file_data_asset.add_batch_definition_daily(
name=batch_definition_name, regex=batch_definition_regex
)
batch = batch_definition.get_batch(
batch_parameters={"year": "2019", "month": "01", "day": "01"}
)
batch.head()
Batch Definitions for a Directory Data Asset can be configured to return all of the records for the files in the Data Asset, or to subdivide the Data Asset's records on the content of a Datetime field and only return the records that correspond to a specific year, month, or day.
Prerequisites
- A preconfigured Data Context. The variable
context
is used for your Data Context in the following example code. - A File Data Asset on a Filesystem Data Source.
Procedure
- Instructions
- Sample code
-
Retrieve your Data Asset.
Replace the value of
data_source_name
with the name of your Data Source and the value ofasset_name
with the name of your Data Asset in the following code. Then execute it to retrieve an existing Data Source and Data Asset from your Data Context:Pythonimport great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name) -
Add a Batch Definition to the Data Asset.
A whole directory Batch Definition returns all of the records in a data file as a single Batch. A partitioned directory Batch Definition will subdivide the records according to a datetime field and return those records that match a specified year, month, or day.
- Whole directory
- Partitioned
Because a whole directory Batch Definition returns the records from all of the files it can read in the Data Asset you only need to provide one addditional piece of information to define one:
name
: A name by which you can reference the Batch Definition in the future. This should be unique within the Data Asset.
Update the
batch_definition_name
variable and execute the following code to create a whole directory Batch Definition:Pythonbatch_definition_name = "yellow_tripdata"
batch_definition = file_data_asset.add_batch_definition_whole_directory(
name=batch_definition_name
)GX Core currently supports partitioning Directory Data Assets based on a datetime field. Therefore, to define a partitioned Directory Batch Definition you need to provide two pieces of information:
name
:A name by which you can reference the Batch Definition in the future. This should be unique within the Data Asset.column
: The datetime column that records should be subdivided on.
The Batch Definition can be configured to return records by year, month, or day.
- Yearly
- Monthly
- Daily
Update the
batch_definition_name
andbatch_definition_column
variables in the following code, then execute it to create a Batch Definition that subdivides the records in a directory by year:Pythonbatch_definition_name = "yellow_tripdata_sample_yearly"
batch_definition_column = "pickup_datetime"
batch_definition = file_data_asset.add_batch_definition_yearly(
name=batch_definition_name, column=batch_definition_column
)Update the
batch_definition_name
andbatch_definition_column
variables in the following code, then execute it to create a Batch Definition that subdivides the records in a directory by month:Pythonbatch_definition_name = "yellow_tripdata_sample_monthly"
batch_definition_column = "pickup_datetime"
batch_definition = file_data_asset.add_batch_definition_monthly(
name=batch_definition_name, column=batch_definition_column
)Update the
batch_definition_name
andbatch_definition_column
variables in the following code, then execute it to create a Batch Definition that subdivides the records in a directory by day:Pythonbatch_definition_name = "yellow_tripdata_sample_daily"
batch_definition_column = "pickup_datetime"
batch_definition = file_data_asset.add_batch_definition_daily(
name=batch_definition_name, column=batch_definition_column
) -
Optional. Verify the Batch Definition is valid.
A whole directory Batch Definition always returns all available records as a single Batch. Therefore you do not need to provide any additional parameters to retrieve data from a path Batch Definition.
After retrieving your data you can verify that the Batch Definition is valid by printing the first few retrieved records with
batch.head()
:Pythonbatch = batch_definition.get_batch()
batch.head()When retrieving a Batch from a partitioned Batch Definition, you can specify the date of the data to retrieve by providing a
batch_parameters
dictionary with keys that correspond to the regex groups in the Batch Definition. If you do not specify a date, the most recent date in the data is returned by default.After retrieving your data you can verify that the Batch Definition is valid by printing the first few retrieved records with
batch.head()
:- Yearly
- Monthly
- Daily
Pythonbatch = batch_definition.get_batch(batch_parameters={"year": "2019"})
batch.head()Pythonbatch = batch_definition.get_batch(batch_parameters={"year": "2019", "month": "01"})
batch.head()Pythonbatch = batch_definition.get_batch(
batch_parameters={"year": "2019", "month": "01", "day": "01"}
)
batch.head() -
Optional. Create additional Batch Definitions.
A Data Asset can have multiple Batch Definitions as long as each Batch Definition has a unique name within that Data Asset. Repeat this procedure to add additional whole directory or partitioned Batch Definitions to your Data Asset.
Full example code for whole directory Batch Definitions and partitioned yearly, monthly, or daily Batch Definitions:
- Whole directory
- Yearly
- Monthly
- Daily
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "yellow_tripdata"
batch_definition = file_data_asset.add_batch_definition_whole_directory(
name=batch_definition_name
)
batch = batch_definition.get_batch()
batch.head()
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "yellow_tripdata_sample_yearly"
batch_definition_column = "pickup_datetime"
batch_definition = file_data_asset.add_batch_definition_yearly(
name=batch_definition_name, column=batch_definition_column
)
batch = batch_definition.get_batch(batch_parameters={"year": "2019"})
batch.head()
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "yellow_tripdata_sample_monthly"
batch_definition_column = "pickup_datetime"
batch_definition = file_data_asset.add_batch_definition_monthly(
name=batch_definition_name, column=batch_definition_column
)
batch = batch_definition.get_batch(batch_parameters={"year": "2019", "month": "01"})
batch.head()
import great_expectations as gx
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_asset_name = "my_directory_data_asset"
file_data_asset = context.data_sources.get(data_source_name).get_asset(data_asset_name)
batch_definition_name = "yellow_tripdata_sample_daily"
batch_definition_column = "pickup_datetime"
batch_definition = file_data_asset.add_batch_definition_daily(
name=batch_definition_name, column=batch_definition_column
)
batch = batch_definition.get_batch(
batch_parameters={"year": "2019", "month": "01", "day": "01"}
)
batch.head()