How to configure a Pandas/S3 Datasource

This guide shows how to connect to a Pandas Datasource such that the data is accessible in the form of files located on the AWS S3 service.


Show Docs for Stable API (up to 0.12.x)

Prerequisites: This how-to guide assumes you have already:

To add an S3-backed Pandas datasource do the following:

  1. Edit your great_expectations/great_expectations.yml file

    Update your datasources: section to include a PandasDatasource.

        class_name: PandasDatasource
  2. Load data from S3 using native S3 path-based Batch Kwargs.

    Because Pandas provides native support for reading from S3 paths, this simple configuration will allow loading datasources from S3 using native S3 paths.

    context = DataContext()
    batch_kwargs = {
        "datasource": "pandas_s3",
        "path": "s3a://my_bucket/my_prefix/key.csv",
    batch = context.get_batch(batch_kwargs, "existing_expectation_suite_name")
  3. Optionally, configure a BatchKwargsGenerator that will allow you to generate Data Assets and Partitions from your S3 bucket.

    Update your datasource configuration to include the new Batch Kwargs Generator:

        class_name: PandasDatasource
            class_name: S3GlobReaderBatchKwargsGenerator
            bucket: your_s3_bucket # Only the bucket name here (i.e., no prefix)
                prefix: prefix_to_folder_containing_your_first_data_asset_files/ # trailing slash is important
                regex_filter: .*  # The regex filter will filter the results returned by S3 for the key and prefix to only those matching the regex
                prefix: prefix_to_folder_containing_your_second_data_asset_files/ # trailing slash is important
                regex_filter: .*  # The regex filter will filter the results returned by S3 for the key and prefix to only those matching the regex
                prefix: prefix_to_folder_containing_your_third_data_asset_files/ # trailing slash is important
                regex_filter: .*  # The regex filter will filter the results returned by S3 for the prefix to only those matching the regex. Note: construct your regex to match the entire S3 key (including the prefix).
        module_name: great_expectations.datasource
          class_name: PandasDataset
          module_name: great_expectations.dataset

    Update the configuration of the assets: section to reflect your project’s data storage system. There is no limit on the number of data assets, but you should only keep the ones that are actually used in the configuration file (i.e., delete the unused ones from the above template and/or add as many as needed for your project).

    Note: Multiple data sources can easily be configured in the Data Context by adding a new configuration block for each in the data sources section. Each data source name should be at the same level of indentation.

  4. Optionally, run ``great_expectations suite scaffold`` to verify your new Datasource and BatchKwargsGenerator configurations.

    Since you edited the Great Expectations configuration file, the updated configuration should be tested to make sure that no errors were introduced.

    1. From the command line, run:

      great_expectations suite scaffold name_of_new_expectation_suite
      Select a datasource
          1. local_filesystem
          2. some_sql_db
          3. pandas_s3
      : 3

      Note: If pandas_s3 is the only available data source, then you will not be offered a choice of the data source; in this case, the pandas_s3 data source will be chosen automatically.

    2. Choose to see “a list of data assets in this datasource”

      Would you like to:
          1. choose from a list of data assets in this datasource
          2. enter the path of a data file
      : 1
    3. Verify that all your data assets appear in the list

      Which data would you like to use?
          1. your_first_data_asset_name (file)
          2. your_second_data_asset_name (file)
          3. your_third_data_asset_name (file)
          Don't see the name of the data asset in the list above? Just type it

      When you select the number corresponding to a data asset, a Jupyter notebook will open, pre-populated with the code for adding expectations to the expectation suite specified on the command line against the data set you selected.

      Check the composition of the batch_kwargs variable at the top of the notebook to make sure that the S3 file used appropriately corresponds to the data set you selected. Repeat this check for all data sets you configured. An inconsistency is likely due to an incorrect regular expression pattern in the respective data set configuration.

Show Docs for Experimental API (0.13)

Prerequisites: This how-to guide assumes you have already:

To add an S3-backed Pandas datasource do the following:

  1. Install the required modules

    If you haven’t already, install these modules for connecting to S3.

    pip install boto3
    pip install fsspec
    pip install s3fs
  2. Instantiate a DataContext

    import great_expectations as ge
    context = ge.get_context()
  3. Create or copy a yaml config

    Parameters can be set as strings, or passed in as environment variables. In the following example, a yaml config is configured for a DataSource, with a ConfiguredAssetS3DataConnector and a PandasExecutionEngine. The S3-bucket name and prefix are passed in as environment variables.

    Note: The ConfiguredAssetS3DataConnector used in this example is closely related to the InferreddAssetS3DataConnector with some key differences. More information can be found in the Core Great Expectations Concepts document.

    config = f"""
            class_name: DataSource
                class_name: PandasExecutionEngine
                    class_name: ConfiguredAssetS3DataConnector
                    bucket: {bucket}
                    prefix: {prefix}
                        pattern: (.+)\\.csv
                            - full_name

    Additional examples of yaml configurations for various filesystems and databases can be found in the following document: How to configure DataContext components using test_yaml_config

  4. Run context.test_yaml_config.


    When executed, test_yaml_config will instantiate the component and run through a self_check procedure to verify that the component works as expected.

    The resulting output will look something like this:

    Attempting to instantiate class from config...
        Instantiating as a Datasource, since class_name is Datasource
    Instantiating class from config without an explicit class_name is dangerous. Consider adding an explicit class_name for None
        Successfully instantiated Datasource
    Execution engine: PandasExecutionEngine
    Data connectors:
        my_data_connector : ConfiguredAssetS3DataConnector
        Available data_asset_names (1 of 1):
            test_asset (1 of 1): ['abe_20201119_200.csv']
        Unmatched data_references (0 of 0): []
        Choosing an example data reference...
            Reference chosen: abe_20201119_200.csv
            Fetching batch data...
            Showing 5 rows
       Unnamed: 0                                           Name PClass    Age     Sex  Survived  SexCode
    0           1                   Allen, Miss Elisabeth Walton    1st  29.00  female         1        1
    1           2                    Allison, Miss Helen Loraine    1st   2.00  female         0        1
    2           3            Allison, Mr Hudson Joshua Creighton    1st  30.00    male         0        0
    3           4  Allison, Mrs Hudson JC (Bessie Waldo Daniels)    1st  25.00  female         0        1
    4           5                  Allison, Master Hudson Trevor    1st   0.92    male         1        0

    Note : In the current example, the yaml config will only create a connector to the datasource for the current session. After you exit python, the datasource and configuration will be gone. To make the datasource and configuration persistent, please copy-paste your yaml_config string into the data_sources section in your great_expectations/great_expectations.yml config file.

    If something about your configuration wasn’t set up correctly, test_yaml_config will raise an error. Whenever possible, test_yaml_config provides helpful warnings and error messages. It can’t solve every problem, but it can solve many.

    raise error_class(parsed_response, operation_name)
    botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the ListObjectsV2 operation: Access Denied

Additional Notes

  1. Additional options are available for a more fine-grained customization of the S3-backed Pandas data sources.

    delimiter: "/"  # This is the delimiter for the bucket keys (paths inside the buckets).  By default, it is "/".
      endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
    reader_options:  # Note that reader options can be specified globally or per-asset.
        sep: ","
    max_keys: 100  # The maximum number of keys to fetch in a single request to S3 (default is 100).
  2. Errors in generated BatchKwargs during configuration of the S3GlobReaderBatchKwargsGenerator are likely due to an incorrect regular expression pattern in the respective data set configuration.

  3. The default values of the various options satisfy the vast majority of scenarios. However, in certain cases, the developers may need to override them. For instance, reader_options, which can be specified globally and/or at the per-asset level, provide a mechanism for customizing the separator character inside CSV files.

  4. Note that specifying the --no-jupyter flag on the command line will initialize the specified expectation suite in the great_expectations/expectations directory, but suppress the launching of the Jupyter notebook.

    great_expectations suite scaffold name_of_new_expectation_suite --no-jupyter

    If you resume editing the given expectation suite at a later time, please first verify that the batch_kwargs contain the correct S3 path for the intended data source.