Skip to main content

How to choose which DataConnector to use

This guide demonstrates how to choose which Data ConnectorsProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets. to configure within your DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems..

Prerequisites: This how-to guide assumes you have:

Great Expectations provides three types of DataConnector classes. Two classes are for connecting to Data AssetsA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. stored as file-system-like data (this includes files on disk, but also S3 object stores, etc) as well as relational database data:

  • An InferredAssetDataConnector infers data_asset_name by using a regex that takes advantage of patterns that exist in the filename or folder structure.
  • A ConfiguredAssetDataConnector allows users to have the most fine-tuning, and requires an explicit listing of each Data Asset you want to connect to.
InferredAssetDataConnectorsConfiguredAssetDataConnectors
InferredAssetFilesystemDataConnectorConfiguredAssetFilesystemDataConnector
InferredAssetFilePathDataConnectorConfiguredAssetFilePathDataConnector
InferredAssetAzureDataConnectorConfiguredAssetAzureDataConnector
InferredAssetGCSDataConnectorConfiguredAssetGCSDataConnector
InferredAssetS3DataConnectorConfiguredAssetS3DataConnector
InferredAssetSqlDataConnectorConfiguredAssetSqlDataConnector
InferredAssetDBFSDataConnectorConfiguredAssetDBFSDataConnector

InferredAssetDataConnectors and ConfiguredAssetDataConnectors are used to define Data Assets and their associated data_references. A Data Asset is an abstraction that can consist of one or more data_references to CSVs or relational database tables. For instance, you might have a yellow_tripdata Data Asset containing information about taxi rides, which consists of twelve data_references to twelve CSVs, each consisting of one month of data.

The third type of DataConnector class is for providing a Batch'sA selection of records from a Data Asset. data directly at runtime:

  • A RuntimeDataConnector enables you to use a RuntimeBatchRequest to wrap either an in-memory dataframe, filepath, or SQL query, and must include batch identifiers that uniquely identify the data (e.g. a run_id from an AirFlow DAG run).

If you know for example, that your Pipeline Runner will already have your batch data in memory at runtime, you can choose to configure a RuntimeDataConnector with unique batch identifiers. Reference the documents on How to configure a RuntimeDataConnector and How to create a Batch of data from an in-memory Spark or Pandas dataframe to get started with RuntimeDataConnectors.

If you aren't sure which type of the remaining DataConnectors to use, the following examples will use DataConnector classes designed to connect to files on disk, namely InferredAssetFilesystemDataConnector and ConfiguredAssetFilesystemDataConnector to demonstrate the difference between these types of DataConnectors.

When to use an InferredAssetDataConnector

If you have the following <MY DIRECTORY>/ directory in your filesystem, and you want to treat the yellow_tripdata_*.csv files as batches within the yellow_tripdata Data Asset, and also do the same for files in the green_tripdata directory:

<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-01.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-02.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-03.csv
<MY DIRECTORY>/green_tripdata/2019-01.csv
<MY DIRECTORY>/green_tripdata/2019-02.csv
<MY DIRECTORY>/green_tripdata/2019-03.csv

This configuration:

datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
base_directory: <MY DIRECTORY>/
glob_directive: "*/*.csv"
default_regex:
group_names:
- data_asset_name
- year
- month
pattern: (.*)/.*(\d{4})-(\d{2})\.csv
"""

will make available the following Data Assets and data_references:

Available data_asset_names (2 of 2):
green_tripdata (3 of 3): ['green_tripdata/*2019-01.csv', 'green_tripdata/*2019-02.csv', 'green_tripdata/*2019-03.csv']
yellow_tripdata (3 of 3): ['yellow_tripdata/*2019-01.csv', 'yellow_tripdata/*2019-02.csv', 'yellow_tripdata/*2019-03.csv']

Unmatched data_references (0 of 0):[]

Note that the InferredAssetFileSystemDataConnector infers data_asset_names from the regex you provide. This is the key difference between InferredAssetDataConnector and ConfiguredAssetDataConnector, and also requires that one of the group_names in the default_regex configuration be data_asset_name.

The glob_directive is provided to give the DataConnector information about the directory structure to expect for each Data Asset. The default glob_directive for the InferredAssetFileSystemDataConnector is "*" and therefore must be overridden when your data_references exist in subdirectories.

When to use a ConfiguredAssetDataConnector

On the other hand, ConfiguredAssetFilesSystemDataConnector requires an explicit listing of each Data Asset you want to connect to. This tends to be helpful when the naming conventions for your Data Assets are less standardized, but the user has a strong understanding of the semantics governing the segmentation of data (files, database tables).

If you have the same <MY DIRECTORY>/ directory in your filesystem,

<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-01.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-02.csv
<MY DIRECTORY>/yellow_tripdata/yellow_tripdata_2019-03.csv
<MY DIRECTORY>/green_tripdata/2019-01.csv
<MY DIRECTORY>/green_tripdata/2019-02.csv
<MY DIRECTORY>/green_tripdata/2019-03.csv

Then this configuration:

datasource_yaml = r"""
name: taxi_datasource
class_name: Datasource
module_name: great_expectations.datasource
execution_engine:
module_name: great_expectations.execution_engine
class_name: PandasExecutionEngine
data_connectors:
default_configured_data_connector_name:
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: <MY DIRECTORY>/
assets:
yellow_tripdata:
base_directory: yellow_tripdata/
pattern: yellow_tripdata_(\d{4})-(\d{2})\.csv
group_names:
- year
- month
green_tripdata:
base_directory: green_tripdata/
pattern: (\d{4})-(\d{2})\.csv
group_names:
- year
- month
"""

will make available the following Data Assets and data_references:

Available data_asset_names (2 of 2):
green_tripdata (3 of 3): ['2019-01.csv', '2019-02.csv', '2019-03.csv']
yellow_tripdata (3 of 3): ['yellow_tripdata_2019-01.csv', 'yellow_tripdata_2019-02.csv', 'yellow_tripdata_2019-03.csv']

Unmatched data_references (0 of 0):[]

Additional Notes

To view the full script used in this page, see it on GitHub: