How to configure a DataConnector to introspect and partition a file system or blob store
This guide will help you introspect and partition any file type data store (e.g., filesystem, cloud blob storage) using
an Active Data Connector
. For background on connecting to different backends, please see the
Datasource
specific guides in the "Connecting to your data" section.
File-based introspection and partitioning are useful for:
- Exploring the types, subdirectory location, and filepath naming structures of the files in your dataset, and
- Organizing the discovered files into Data AssetsA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. according to the identified structures.
Partitioning
enables you to select the desired subsets of your dataset for ValidationThe act of applying an Expectation Suite to a Batch.>.
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Configured and loaded a Data Context
- Configured a Datasource and Data Connector
We will use the "Yellow Taxi" dataset to walk you through the configuration of Data Connectors
. Starting with the
bare-bones version of either an Inferred Asset Data Connector
or a Configured Asset Data Connector
, we gradually
build out the configuration to achieve the introspection of your files with the semantics consistent with your goals.
To learn more about DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems., Data ConnectorsProvides a standard API for accessing and interacting with data from a wide variety of source systems., and Batch(es)A selection of records from a Data Asset., please see our Datasources Guide.
Preliminary Steps
1. Instantiate your project's DataContext
Import Great Expectations.
import great_expectations as gx
2. Obtain DataContext
Load your DataContext into memory using the get_context()
method.
context = gx.get_context()
Configuring Inferred Asset Data Connector and Configured Asset Data Connector
- Inferred Asset Data Connector
- Configured Asset Data Connector
1. Configure your Datasource
Start with an elementary Datasource
configuration, containing just one general Inferred Asset Data Connector
component:
datasource_yaml = f"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <path_to_your_data_here>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
"""
Using the above example configuration, add in the path to a directory that contains your data. Then run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Given that the glob_directive
in the example configuration is *.csv
, if you specified a directory containing CSV
files, then you will see them listed as Available data_asset_names
in the output of test_yaml_config()
.
Feel free to adjust your configuration and re-run test_yaml_config()
to experiment as pertinent to your case.
An integral part of the recommended approach, illustrated as part of this exercise, will be the use of the internal Great Expectations utility
context.test_yaml_config(
yaml_string, pretty_print: bool = True,
return_mode: str = "instantiated_class",
shorten_tracebacks: bool = False,
)
to ensure the correctness of the proposed YAML
configuration prior to incorporating it and trying to use it.
For instance, try the following erroneous DataConnector
configuration as part of your Datasource
(you can
paste it directly underneath -- or instead of -- the default_inferred_data_connector_name
configuration section):
buggy_data_connector_yaml = f"""
buggy_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <path_to_your_data_here>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names: # required "data_asset_name" reserved group name for "InferredAssetFilePathDataConnector" is absent
- nonexistent_group_name
"""
Then add in the path to a directory that contains your data, and again run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Notice that the output reports only one data_asset_name
, called DEFAULT_ASSET_NAME
, signaling a misconfiguration.
Now try another erroneous DataConnector
configuration as part of your Datasource
(you can paste it directly
underneath -- or instead of -- your existing DataConnector
configuration sections):
another_buggy_data_connector_yaml = f"""
buggy_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <path_to_bad_data_directory_here>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
"""
where you would add in the path to a directory that does not exist; then run this code again to test your configuration:
context.test_yaml_config(datasource_yaml)
You will see that the list of Data Assets
is empty. Feel free to experiment with the arguments to
context.test_yaml_config(
yaml_string, pretty_print: bool = True,
return_mode: str = "instantiated_class",
shorten_tracebacks: bool = False,
)
For instance, running
context.test_yaml_config(yaml_string, return_mode="report_object")
will return the information appearing in standard output converted to the Python
dictionary format.
Any structural errors (e.g., indentation, typos in class and configuration key names, etc.) will result in an exception raised and sent to standard error. This can be converted to an exception trace by running
context.test_yaml_config(yaml_string, shorten_tracebacks=True)
showing the line numbers, where the exception occurred, most likely caused by the failure of the required class (in this
case InferredAssetFilesystemDataConnector
) from being successfully instantiated.
2. Save the Datasource configuration to your DataContext
Once the basic Datasource
configuration is error-free and satisfies your requirements, save it into your DataContext
by using the add_datasource()
function.
context.add_datasource(**yaml.load(datasource_yaml))
3. Get names of available Data Assets
Getting names of available data assets using an Inferred Asset Data Connector
affords you the visibility into types
and naming structures of files in your filesystem or blob storage:
available_data_asset_names = context.datasources[
"taxi_datasource"
].get_available_data_asset_names(
data_connector_names="default_inferred_data_connector_name"
)[
"default_inferred_data_connector_name"
]
assert len(available_data_asset_names) == 36
1. Add Configured Asset Data Connector to your Datasource
Set up the bare-bones Configured Asset Data Connector
to gradually apply structure to the discovered assets and
partition them according to this structure. To begin, add the following configured_data_connector_name
section to
your Datasource
configuration (please feel free to change the name as you deem appropriate for your use case):
datasource_yaml = f"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: <path_to_your_data_here>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
assets: {{}}
"""
Now run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
The message Available data_asset_names (0 of 0)
, corresponding to the configured_data_connector_name
Data
Connector
, should appear in standard output, correctly reflecting the fact that the assets
section of the
configuration is empty.
2. Add a Data Asset for Configured Asset Data Connector to partition only by file name and type
You can employ a data asset that reflects a relatively general file structure (e.g., taxi_data_flat
in the example
configuration) to represent files in a directory, which contain a certain prefix (e.g., yellow_tripdata_sample_
) and
whose contents are of the desired type (e.g., CSV).
configured_data_connector_yaml = f"""
configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: <path_to_your_data_here>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
assets:
taxi_data_flat:
base_directory: samples_2020
pattern: (yellow_tripdata_sample_.+)\\.csv
group_names:
- filename
"""
Now run test_yaml_config()
as part of evolving and testing components of Great Expectations YAML
configuration:
context.test_yaml_config(datasource_yaml)
Verify that exactly one Data Asset
is reported for the configured_data_connector_name
Data Connector
and that the
structure of the file names corresponding to the Data Asset
identified, taxi_data_flat
, is consistent with the
regular expressions pattern specified in the configuration for this Data Asset
.
3. Add a Data Asset for Configured Asset Data Connector to partition by year and month
In recognition of a finer observed file path structure, you can refine the partitioning strategy. For instance, the
taxi_data_year_month
in the following example configuration identifies three parts of a file path: name
(as in
"company name"), year
, and month
:
configured_data_connector_yaml = f"""
configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: <path_to_your_data_here>
glob_directive: "*.csv"
default_regex:
pattern: (.*)
group_names:
- data_asset_name
assets:
taxi_data_flat:
base_directory: samples_2020
pattern: (yellow_tripdata_sample_.+)\\.csv
group_names:
- filename
taxi_data_year_month:
base_directory: samples_2020
pattern: ([\\w]+)_tripdata_sample_(\\d{{4}})-(\\d{{2}})\\.csv
group_names:
- name
- year
- month
"""
and run
context.test_yaml_config(datasource_yaml)
Verify that now two Data Assets
(taxi_data_flat
and taxi_data_year_month
) are reported for the
configured_data_connector_name
Data Connector
and that the structures of the file names corresponding to the two
Data Assets
identified are consistent with the regular expressions patterns specified in the configuration for these
Data Assets
.
This partitioning affords a rich set of filtering capabilities ranging from specifying the exact values of the file name structure's components to allowing Python functions for implementing custom criteria.
Finally, once your Data Connector
configuration satisfies your requirements, save the enclosing Datasource
into your
DataContext
using
context.add_datasource(**yaml.load(datasource_yaml))
Consult the
How to get one or more Batches of data from a configured Datasource
guide for examples of considerable flexibility in querying Batch
objects along the different dimensions materialized
as a result of partitioning the dataset as specified by the taxi_data_flat
and taxi_data_year_month
Data Assets
.
To view the full scripts used in this page, see them on GitHub: