Connect to Filesystem data
Filesystem data consists of data stored in file formats such as .csv
or .parquet
, and located in an environment with a folder hierarchy such as Amazon S3, Azure Blob Storage, Google Cloud Storage, or local and networked filesystems. GX can leverage either pandas or Spark to read this data.
To connect to your Filesystem data, you first create a Data Source which tells GX where your data files reside. You then configure Data Assets for your Data Source to tell GX which sets of records you want to be able to access from your Data Source. Finally, you will define Batch Definitions which allow you to request all the records retrieved from a Data Asset or further partition the returned records based on a specified date.
Create a Data Source
Data Sources tell GX where your data is located and how to connect to it. With Filesystem data this is done by directing GX to the folder or online location that contains the data files. GX supports accessing Filesystem data from Amazon S3, Azure Blob Storage, Google Cloud Storage, and local or networked filesystems.
- Local or networked filesystem
- Amazon S3
- Azure Blob Storage
- Google Cloud Storage
Prerequisites
- Python version 3.9 to 3.12
-
An installation of GX Core
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
- A preconfigured Data Context
- Access to data files in a local or networked directory.
All Data Contexts include a built in pandas_default
Data Source. This Data Source gives access to all of the read_*(...)
methods available in pandas.
The read_*(...)
methods of the pandas_default
Data Source allow you to load data into GX without first configuring a Data Source, Data Asset, and Batch Definition. However, it does not save configurations for reading files to the Data Context and provides less versatility than a fully configured Data Source, Data Asset, and Batch Definition. Therefore, the pandas_default
Data Source is intended to facilitate testing Expectations and engaging in data exploration but is less suited for use in production and automated workflows.
Procedure
- Instructions
- Sample code
-
Define the Data Source's parameters.
The following information is required when you create a Filesystem Data Source for a local or networked directory:
name
: A descriptive name used to reference the Data Source. This should be unique within the Data Context.base_directory
: The path to the folder that contains the data files, or the root folder of the directory hierarchy that contains the data files.
If you are using a File Data Context, you can provide a path that is relative to the Data Context's
base_directory
. Otherwise, you should provide the absolute path to the folder that contains your data.In this example, a relative path is defined for a folder that contains taxi trip data for New York City in
.csv
format:Pythonsource_folder = "./data"
data_source_name = "my_filesystem_data_source" -
Add a Filesystem Data Source to your Data Context.
GX can leverage either pandas or Spark as the backend for your Filesystem Data Source. To create your Data Source, execute one of the following sets of code:
- pandas
- Spark
Pythondata_source = context.data_sources.add_pandas_filesystem(
name=data_source_name, base_directory=source_folder
)Pythondata_source = context.data_sources.add_spark_filesystem(
name=data_source_name, base_directory=source_folder
)
Choose from the following to see the full example code for a local or networked Data Source, using either pandas or Spark to read the data files:
- pandas example
- Spark example
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
# This path is relative to the `base_directory` of the Data Context.
source_folder = "./data"
data_source_name = "my_filesystem_data_source"
# Create the Data Source:
data_source = context.data_sources.add_pandas_filesystem(
name=data_source_name, base_directory=source_folder
)
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
# This path is relative to the `base_directory` of the Data Context.
source_folder = "./data"
data_source_name = "my_filesystem_data_source"
# Create the Data Source:
data_source = context.data_sources.add_spark_filesystem(
name=data_source_name, base_directory=source_folder
)
Prerequisites
- Python version 3.9 to 3.12
-
An installation of GX Core with support for Amazon S3 dependencies
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
- A preconfigured Data Context
- Access to data files on a S3 bucket.
Procedure
- Instructions
- Sample code
-
Define the Data Source's parameters.
The following information is required when you create an Amazon S3 Data Source:
name
: A descriptive name used to reference the Data Source. This should be unique within the Data Context.bucket_name
: The Amazon S3 bucket name.boto3_options
: Optional. Additional options for the Data Source. In the following examples, the default values are used.
Replace the variable values with your own and run the following Python code to define
data_source_name
,bucket_name
andboto3_options
:Pythondata_source_name = "my_filesystem_data_source"
bucket_name = "great-expectations-docs-test"
boto3_options = {}Additional options forboto3_options
The parameter
boto3_options
allows you to pass the following information:region_name
: Your AWS region name.endpoint_url
: specifies an S3 endpoint. You can provide an environment variable reference such as"${S3_ENDPOINT}"
to securely include this in your code. The string"${S3_ENDPOINT}"
will be replaced with the value of the environment variableS3_ENDPOINT
.
For more information on secure storage and retrieval of credentials in GX see Configure credentials.
-
Add a S3 Filesystem Data Source to your Data Context.
GX can leverage either pandas or Spark as the backend for your S3 Filesystem Data Source. To create your Data Source, execute one of the following sets of code:
- pandas
- Spark
Pythondata_source = context.data_sources.add_pandas_s3(
name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)Pythondata_source = context.data_sources.add_spark_s3(
name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)
Choose from the following to see the full example code for a S3 Filesystem Data Source, using either pandas or Spark to read the data files:
- pandas example
- Spark example
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
bucket_name = "great-expectations-docs-test"
boto3_options = {}
# Create the Data Source:
data_source = context.data_sources.add_pandas_s3(
name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
bucket_name = "great-expectations-docs-test"
boto3_options = {}
# Create the Data Source:
data_source = context.data_sources.add_spark_s3(
name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)
Prerequisites
- Python version 3.9 to 3.12
-
An installation of GX Core with support for Azure Blob Storage dependencies
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
- A preconfigured Data Context
- Access to data files in Azure Blob Storage.
Procedure
- Instructions
- Sample code
-
Define the Data Source's parameters.
The following information is required when you create a Microsoft Azure Blob Storage Data Source:
name
: A descriptive name used to reference the Data Source. This should be unique within the Data Context.azure_options
: Authentication settings.
The
azure_options
parameter accepts a dictionary which should include two keys:credential
and eitheraccount_url
orconn_str
.credential
: An Azure Blob Storage tokenaccount_url
: The url of your Azure Blob Storage account. If you provide this thenconn_str
should be left out of the dictionary.conn_str
: The connection string for your Azure Blob Storage account. If you provide this thenaccount_url
should not be included in the dictionary.
To keep your credentials secure you can define them as environment variables or entries in
config_variables.yml
. For more information on secure storage and retrieval of credentials in GX see Configure credentials.Update the variables in the following code and execute it to define
name
andazure_options
. In this example, the value foraccount_url
is pulled from the environment variableAZURE_STORAGE_ACCOUNT_URL
and the value forcredential
is pulled from the environment variableAZURE_CREDENTIAL
:Pythondata_source_name = "my_filesystem_data_source"
azure_options = {
"account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
"credential": "${AZURE_CREDENTIAL}",
} -
Add an Azure Blob Storage Data Source to your Data Context.
GX can leverage either pandas or Spark as the backend for your Azure Blob Storage Data Source. To create your Data Source, execute one of the following sets of code:
- pandas
- Spark
Python# Create the Data Source:
data_source = context.data_sources.add_pandas_abs(
name=data_source_name, azure_options=azure_options
)Pythondata_source = context.data_sources.add_spark_abs(
name=data_source_name, azure_options=azure_options
)
Choose from the following to see the full example code for a S3 Filesystem Data Source, using either pandas or Spark to read the data files:
- pandas example
- Spark example
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
azure_options = {
"account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
"credential": "${AZURE_CREDENTIAL}",
}
# Create the Data Source:
data_source = context.data_sources.add_pandas_abs(
name=data_source_name, azure_options=azure_options
)
import great_expectations as gx
context = gx.get_context()
# Define the Data Source's parameters:
data_source_name = "my_filesystem_data_source"
azure_options = {
"account_url": "${AZURE_STORAGE_ACCOUNT_URL}",
"credential": "${AZURE_CREDENTIAL}",
}
# Create the Data Source:
data_source = context.data_sources.add_spark_abs(
name=data_source_name, azure_options=azure_options
)
Prerequisites
- Python version 3.9 to 3.12
-
An installation of GX Core with support for Google Cloud Storage dependencies
- Optional. To create a Spark Filesystem Data Source you will also need to install the Spark Python dependencies.
- A preconfigured Data Context
- Access to data files in Google Cloud Storage.
Procedure
- Instructions
- Sample code
-
Set up the Data Source's credentials.
By default, GCS credentials are handled through the gcloud command line tool and the
GOOGLE_APPLICATION_CREDENTIALS
environment variable. The gcloud command line tool is used to set up authentication credentials, and theGOOGLE_APPLICATION_CREDENTIALS
environment variable provides the path to thejson
file with those credentials.For more information on using the gcloud command line tool, see Google Cloud's Cloud Storage client libraries documentation.
-
Define the Data Source's parameters.
The following information is required when you create a GCS Data Source:
name
: A descriptive name used to reference the Data Source. This should be unique within the Data Context.bucket_or_name
: The GCS bucket or instance name.gcs_options
: Optional. A dictionary that can be used to specify an alternative method for providing GCS credentials.
The
gcs_options
dictionary can be left empty if the defaultGOOGLE_APPLICATION_CREDENTIALS
environment variable is populated. Otherwise, thegcs_options
dictionary should have either the keyfilename
or the keyinfo
.filename
: The value of this key should be the specific path to your credentials json. If you provide this then theinfo
key should be left out of the dictionary.info
: The actual JSON data from your credentials file in the form of a string. If you provide this then thefilename
key should not be included in the dictionary.
Update the variables in the following code and execute it to define
name
,bucket_or_name
, andgcs_options
. In this example the defaultGOOGLE_APPLICATION_CREDENTIALS
environment variable points to the location of the credentials json and therefore thegcs_options
dictionary is left empty:Pythondata_source_name = "my_filesystem_data_source"
bucket_or_name = "test_docs_data"
gcs_options = {} -
Add a Google Cloud Storage Data Source to your Data Context.
GX can leverage either pandas or Spark as the backend for your Google Cloud Storage Data Source. To create your Data Source, execute one of the following sets of code:
- pandas
- Spark
Pythondata_source = context.data_sources.add_pandas_gcs(
name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)Pythondata_source = context.data_sources.add_spark_gcs(
name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
Choose from the following to see the full example code for a S3 Filesystem Data Source, using either pandas or Spark to read the data files:
- pandas example
- Spark example
data_source = context.data_sources.add_pandas_gcs(
name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
data_source = context.data_sources.add_spark_gcs(
name=data_source_name, bucket_or_name=bucket_or_name, gcs_options=gcs_options
)
Create a Data Asset
A Data Asset is a collection of related records within a Data Source. These records may be located within multiple files, but each Data Asset is only capable of reading a single specific file format which is determined when it is created. However, a Data Source may contain multiple Data Assets covering different file formats and groups of records.
GX provides two types of Data Assets for Filesystem Data Sources: File Data Assets and Directory Data Assets.
- File Data Asset
- Directory Data Asset
File Data Assets are used to retrieve data from individual files in formats such as .csv
or .parquet
. The file format that can be read by a File Data Asset is determined when the File Data Asset is created. The specific file that is read is determind by Batch Definitions that are added to the Data Asset after it is created.
Both Spark and pandas Filesystem Data Sources support File Data Assets for all supported Filesystem environments.
Directory Data Assets read one or more files in formats such as .csv
or .parquet
. The file format that can be read by a Directory Data Asset is determined when the Directory Data Asset is created. The data in the corresponding files is concatonated into a single table which can be retrieved as a whole, or further partitioned based on the value of a datetime field.
Spark Filesystem Data Sources support Directory Data Assets for all supported Filesystem environments. However, pandas Filesystem Data Sources do not support Directory Data Assets at all.
- Local or networked filesystem
- Amazon S3
- Azure Blob Storage
- Google Cloud Storage
Prerequisites
- Python version 3.9 to 3.12.
- An installation of GX Core.
- A preconfigured Data Context.
- Access to data files (such as
.csv
or.parquet
files) in a local or networked folder hierarchy. - A pandas or Spark Filesystem Data Source configured for local or networked data files.
Procedure
- Instructions
- Sample code
-
Retrieve your Data Source.
Replace the value of
data_source_name
in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name) -
Define your Data Asset's parameters.
A File Data Asset for files in a local or networked folder hierarchy only needs one piece of information to be created.
name
: A descriptive name with which to reference the Data Asset. This name should be unique among all Data Assets for the same Data Source.
This example uses taxi trip data stored in
.csv
files, so the name"taxi_csv_files"
will be used for the Data Asset:Pythonasset_name = "taxi_csv_files"
-
Add the Data Asset to your Data Source.
A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
- To see the file formats supported by a pandas File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aPandasFilesystemDatasource
. - To see the file formats supported by a Spark File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aSparkFilesystemDatasource
.
The following example creates a Data Asset that can read
.csv
file data:Pythonfile_csv_asset = data_source.add_csv_asset(name=asset_name)
- To see the file formats supported by a pandas File Data Source, reference the
import great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset's parameters:
asset_name = "taxi_csv_files"
# Add the Data Asset to the Data Source:
file_csv_asset = data_source.add_csv_asset(name=asset_name)
Prerequisites
- Python version 3.9 to 3.12.
- An installation of GX Core with support for Spark dependencies.
- A preconfigured Data Context.
- A Spark Filesystem Data Source configured to access data files in a local or networked folder hierarchy.
Procedure
- Instructions
- Sample code
-
Retrieve your Data Source.
Replace the value of
data_source_name
in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name) -
Define your Data Asset's parameters.
A Directory Data Asset for files in a local or networked folder hierarchy only needs two pieces of information to be created.
name
: A descriptive name with which to reference the Data Asset. This name should be unique among all Data Assets for the same Data Source.data_directory
: The path of the containing the data files for the Data Asset. This path can be relative to the Data Source'sbase_directory
.
This example uses taxi trip data stored in
.csv
files in thedata/
folder within the Data Source's directory tree:Pythonasset_name = "taxi_csv_directory"
data_directory = "./data" -
Add the Data Asset to your Data Source.
A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
To see the file formats supported by a Spark Directory Data Source, reference the
.add_directory_*_asset(...)
methods in the API documentation for aSparkFilesystemDatasource
.The following example creates a Data Asset that can read
.csv
file data:Pythondirectory_csv_asset = data_source.add_directory_csv_asset(
name=asset_name, data_directory=data_directory
)
import great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset's parameters:
asset_name = "taxi_csv_directory"
data_directory = "./data"
# Add the Data Asset to the Data Source:
directory_csv_asset = data_source.add_directory_csv_asset(
name=asset_name, data_directory=data_directory
)
Prerequisites
- Python version 3.9 to 3.12.
- An installation of GX Core with support for Amazon S3 dependencies.
- A preconfigured Data Context.
- Access to data files in S3.
- A Filesystem Data Source configured to access data files in S3.
Procedure
- Instructions
- Sample code
-
Retrieve your Data Source.
Replace the value of
data_source_name
in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
context = gx.get_context()
# This example uses a File Data Context which already has
# a Data Source defined.
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name) -
Define your Data Asset's parameters.
To define a File Data Asset for data in an S3 bucket you provide the following elements:
asset_name
: A name by which you can reference the Data Asset in the future. This should be unique within the Data Source.s3_prefix
: The path to the data files for the Data Asset, relative to the root of the S3 bucket.
This example uses taxi trip data stored in
.csv
files in thedata/taxi_yellow_tripdata_samples/
folder within the Data Sources S3 bucket:Pythonasset_name = "s3_taxi_csv_file_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/" -
Add the Data Asset to your Data Source.
A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
- To see the file formats supported by a pandas File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aPandasFilesystemDatasource
. - To see the file formats supported by a Spark File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aSparkFilesystemDatasource
.
The following example creates a Data Asset that can read
.csv
file data:Pythons3_file_data_asset = data_source.add_csv_asset(name=asset_name, s3_prefix=s3_prefix)
- To see the file formats supported by a pandas File Data Source, reference the
import great_expectations as gx
context = gx.get_context()
# This example uses a File Data Context which already has
# a Data Source defined.
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset's parameters:
asset_name = "s3_taxi_csv_file_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
s3_file_data_asset = data_source.add_csv_asset(name=asset_name, s3_prefix=s3_prefix)
Prerequisites
- Python version 3.9 to 3.12.
- An installation of GX Core with support for Amazon S3 dependencies and Spark dependencies.
- A preconfigured Data Context.
- A Filesystem Data Source configured to access data files in S3.
Procedure
- Instructions
- Sample code
-
Retrieve your Data Source.
Replace the value of
data_source_name
in the following code with the name of your Data Source and execute it to retrieve your Data Source from the Data Context:Pythonimport great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name) -
Define your Data Asset's parameters.
To define a Directory Data Asset for data in an S3 bucket you provide the following elements:
asset_name
: A name by which you can reference the Data Asset in the future. This should be unique within the Data Source.s3_prefix
: The path to the data files for the Data Asset, relative to the root of the S3 bucket.data_directory
: The path of the folder containing data files for the Data asset, relative to the root of the S3 bucket.
This example uses taxi trip data stored in
.csv
files in thedata/taxi_yellow_tripdata_samples/
folder within the Data Sources S3 bucket:Pythonasset_name = "s3_taxi_csv_directory_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/" -
Add the Data Asset to your Data Source.
A new Data Asset is created and added to a Data Source simultaneously. The file format that the Data Asset can read is determined by the method used when the Data Asset is added to the Data Source.
To see the file formats supported by a Spark File Data Source, reference the
.add_*_asset(...)
methods in the API documentation for aSparkFilesystemDatasource
.The following example creates a Data Asset that can read
.csv
file data:Pythons3_file_data_asset = data_source.add_directory_csv_asset(
name=asset_name, s3_prefix=s3_prefix, data_directory=data_directory
)
import great_expectations as gx
# This example uses a File Data Context which already has
# a Data Source defined.
context = gx.get_context()
data_source_name = "my_filesystem_data_source"
data_source = context.data_sources.get(data_source_name)
# Define the Data Asset's parameters:
asset_name = "s3_taxi_csv_directory_asset"
s3_prefix = "data/taxi_yellow_tripdata_samples/"
data_directory = "data/taxi_yellow_tripdata_samples/"
# Add the Data Asset to the Data Source:
s3_file_data_asset = data_source.add_directory_csv_asset(
name=asset_name, s3_prefix=s3_prefix, data_directory=data_directory
)
Prerequisites
- Python version 3.9 to 3.12.
- An installation of GX Core with support for Azure Blob Storage dependencies.
- A preconfigured Data Context.
- Access to data files in Azure Blob Storage.
- A pandas or Spark Filesystem Data Source configured for Azure Blob Storage data files.