How to configure a ConfiguredAssetDataConnector

This guide demonstrates how to configure a ConfiguredAssetDataConnector, and provides several examples you can use for configuration.

Great Expectations provides two DataConnector classes for connecting to file-system-like data. This includes files on disk, but also S3 object stores, etc:

  • A ConfiguredAssetDataConnector requires an explicit listing of each DataAsset you want to connect to. This allows more fine-tuning, but also requires more setup.

  • An InferredAssetDataConnector infers data_asset_name by using a regex that takes advantage of patterns that exist in the filename or folder structure.

If you’re not sure which one to use, please check out How to choose which DataConnector to use.

Set up a Datasource

All of the examples below assume you’re testing configuration using something like:

import great_expectations as ge
context = ge.get_context()
config = f"""
  class_name: Datasource
  execution_engine:
    class_name: PandasExecutionEngine
  data_connectors:
    my_filesystem_data_connector:
      {data_connector configuration goes here}
"""
context.test_yaml_config(
    name="my_pandas_datasource",
    yaml_config=config
)

If you’re not familiar with the test_yaml_config method, please check out: How to configure Data Context components using test_yaml_config.

Choose a DataConnector

ConfiguredAssetDataConnectors like ConfiguredAssetFilesystemDataConnector and ConfiguredAssetS3DataConnector require DataAssets to be explicitly named. Each DataAsset can have their own regex pattern and group_names, and if configured, will override any pattern or group_names under default_regex.

Imagine you have the following files in my_directory/:

my_directory/alpha-1.csv
my_directory/alpha-2.csv
my_directory/alpha-3.csv

We could create a DataAsset alpha that contains 3 data_references (alpha-1.csv, alpha-2.csv, and alpha-3.csv). In that case, the configuration would look like the following:

my_data_source:
  class_name: Datasource
  execution_engine:
    class_name: PandasExecutionEngine
  data_connectors:
    my_filesystem_data_connector:
      class_name: ConfiguredAssetFilesystemDataConnector
      base_directory: my_directory/
      default_regex:
      assets:
        alpha:
          pattern: alpha-(.*)\.csv
          group_names:
            - index

Notice that we have specified a pattern that captures the number after alpha- in the filename and assigns it to the group_name index.

The configuration would also work with a regex capturing the entire filename (ie pattern: (.*)\\.csv). However, capturing the index on its own allows for batch_identifiers to be used to retrieve a specific Batch of the Data Asset.

Later on we could retrieve the data in alpha-2.csv of alpha as its own batch using context.get_batch() by specifying {"index": "2"} as the batch_identifier.

my_batch = context.get_batch(
    datasource_name="my_data_source",
    data_connector_name="my_filesystem_data_connector",
    data_asset_name="alpha",
    batch_identifiers={"index": "2"}
    )

This ability to access specific Batches using batch_identifiers is very useful when validating DataAssets that span multiple files. For more information on batches and batch_identifiers, please refer to the Core Concepts document.

A corresponding configuration for ConfiguredAssetS3DataConnector would look similar but would require bucket and prefix values instead of base_directory.

class_name: ConfiguredAssetS3DataConnector
bucket: MY_S3_BUCKET
prefix: MY_S3_BUCKET_PREFIX
default_regex:
assets:
    alpha:
      pattern: alpha-(.*)\.csv
      group_names:
        - index

The following examples will show scenarios that ConfiguredAssetDataConnectors can help you analyze, using ConfiguredAssetFilesystemDataConnector.

Note: The examples will only only show the configuration for data_connectors for simplicity.

Example 1: Basic Configuration for a single DataAsset

Continuing the example above, imagine you have the following files in the directory my_directory/:

test/alpha-1.csv
test/alpha-2.csv
test/alpha-3.csv

Then this configuration…

class_name: ConfiguredAssetFilesystemDataConnector
base_directory: test/
default_regex:
assets:
    alpha:
      pattern: alpha-(.*)\.csv
      group_names:
        - index

…will make available alpha as a single DataAsset with the following data_references:

Available data_asset_names (1 of 1):
    alpha (3 of 3): [
        'alpha-1.csv',
        'alpha-2.csv',
        'alpha-3.csv'
    ]

Once configured, you can get a Validator from the Data Context as follows:

my_validator = context.get_validator(
    datasource_name="my_data_source",
    data_connector_name="my_filesystem_data_connector",
    data_asset_name="alpha",
    batch_identifiers={
        "index": "2"
    },
    expectation_suite_name="my_expectation_suite" # the suite with this name must exist by the time of this call

)

But what if the regex does not match any files in the directory?

Then this configuration…

class_name: ConfiguredAssetFilesystemDataConnector
base_directory: test/
default_regex:
assets:
    alpha:
      pattern: beta-(.*)\.csv
      group_names:
        - index

…will give you this output

Successfully instantiated ConfiguredAssetFilesystemDataConnector
Available data_asset_names (1 of 1):
    alpha (0 of 0): []

Unmatched data_references (3 of 3): ['alpha-1.csv', 'alpha-2.csv', 'alpha-3.csv']

Notice that alpha has 0 data_references, and there are 3 Unmatched data_references listed. This would indicate that some part of the configuration is incorrect and would need to be reviewed. In our case, changing pattern to : alpha-(.*)\\.csv will fix our problem and give the same output to above.

Example 2: Basic configuration with more than one DataAsset

Here’s a similar example, but this time two data_assets are mixed together in one folder.

Note: For an equivalent configuration using InferredAssetFileSystemDataConnector, please see Example 2 in How to configure an InferredAssetDataConnector

test_data/alpha-2020-01-01.csv
test_data/beta-2020-01-01.csv
test_data/alpha-2020-01-02.csv
test_data/beta-2020-01-02.csv
test_data/alpha-2020-01-03.csv
test_data/beta-2020-01-03.csv

Then this configuration…

class_name: ConfiguredAssetFilesystemDataConnector
base_directory: test_data/
assets:
    alpha:
        group_names:
            - name
            - year
            - month
            - day
        pattern: alpha-(\d{4})-(\d{2})-(\d{2})\.csv
    beta:
        group_names:
            - name
            - year
            - month
            - day
        pattern: beta-(\d{4})-(\d{2})-(\d{2})\.csv

…will now make alpha and beta both available a DataAssets, with the following data_references:

Available data_asset_names (2 of 2):
    alpha (3 of 3): [
        'alpha-2020-01-01.csv',
        'alpha-2020-01-02.csv',
        'alpha-2020-01-03.csv'
    ]

    beta (3 of 3): [
        'beta-2020-01-01.csv',
        'beta-2020-01-02.csv',
        'beta-2020-01-03.csv'
    ]

Unmatched data_references (0 of 0): []

Example 3: Example with Nested Folders

In the following example, files are placed folders that match the data_asset_names we want: A, B, C, and D.

test_dir/A/A-1.csv
test_dir/A/A-2.csv
test_dir/A/A-3.csv
test_dir/B/B-1.txt
test_dir/B/B-2.txt
test_dir/B/B-3.txt
test_dir/C/C-2017.csv
test_dir/C/C-2018.csv
test_dir/C/C-2019.csv
test_dir/D/D-aaa.csv
test_dir/D/D-bbb.csv
test_dir/D/D-ccc.csv
test_dir/D/D-ddd.csv
test_dir/D/D-eee.csv
module_name: great_expectations.datasource.data_connector
class_name: ConfiguredAssetFilesystemDataConnector
base_directory: test_dir/
assets:
    A:
        base_directory: A/
    B:
        base_directory: B/
        pattern: (.*)-(.*)\.txt
        group_names:
            - part_1
            - part_2
    C:
        glob_directive: "*"
        base_directory: C/
    D:
        glob_directive: "*"
        base_directory: D/
default_regex:
    pattern: (.*)-(.*)\.csv
    group_names:
        - part_1
        - part_2

…will now make A, B, C and D available a DataAssets, with the following data_references:

Available data_asset_names (4 of 4):
    A (3 of 3): [
        'A-1.csv',
        'A-2.csv',
        'A-3.csv',
    ]
    B (3 of 3):  [
        'B-1',
        'B-2',
        'B-3',
    ]
    C (3 of 3): [
        'C-2017',
        'C-2018',
        'C-2019',
    ]
    D (5 of 5): [
        'D-aaa.csv',
        'D-bbb.csv',
        'D-ccc.csv',
        'D-ddd.csv',
        'D-eee.csv',
    ]

Example 4: Example with Explicit data_asset_names and more complex nesting

In this example, the assets alpha, beta and gamma are being explicitly defined in the configuration, and have a more complex nesting pattern.

my_base_directory/alpha/files/go/here/alpha-202001.csv
my_base_directory/alpha/files/go/here/alpha-202002.csv
my_base_directory/alpha/files/go/here/alpha-202003.csv
my_base_directory/beta_here/beta-202001.txt
my_base_directory/beta_here/beta-202002.txt
my_base_directory/beta_here/beta-202003.txt
my_base_directory/beta_here/beta-202004.txt
my_base_directory/gamma-202001.csv
my_base_directory/gamma-202002.csv
my_base_directory/gamma-202003.csv
my_base_directory/gamma-202004.csv
my_base_directory/gamma-202005.csv

The following configuration…

class_name: ConfiguredAssetFilesystemDataConnector
base_directory: my_base_directory/
default_regex:
    pattern: ^(.+)-(\d{4})(\d{2})\.(csv|txt)$
    group_names:
        - data_asset_name
        - year_dir
        - month_dir
assets:
    alpha:
        base_directory: my_base_directory/alpha/files/go/here/
        glob_directive: "*.csv"
    beta:
        base_directory: my_base_directory/beta_here/
        glob_directive: "*.txt"
    gamma:
        glob_directive: "*.csv"

…will make alpha, beta and gamma available a DataAssets, with the following data_references:

Available data_asset_names (3 of 3):
    alpha (3 of 3): [
        'alpha-202001.csv',
        'alpha-202002.csv',
        'alpha-202003.csv'
    ]
    beta (4 of 4):  [
        'beta-202001.txt',
        'beta-202002.txt',
        'beta-202003.txt',
        'beta-202004.txt'
    ]
    gamma (5 of 5): [
        'gamma-202001.csv',
        'gamma-202002.csv',
        'gamma-202003.csv',
        'gamma-202004.csv',
        'gamma-202005.csv',
    ]

Additional Resources