great_expectations.datasource.batch_kwargs_generator.s3_batch_kwargs_generator

Module Contents

Classes

S3GlobReaderBatchKwargsGenerator(name=’default’, datasource=None, bucket=None, reader_options=None, assets=None, delimiter=’/’, reader_method=None, boto3_options=None, max_keys=1000)

S3 BatchKwargGenerator provides support for generating batches of data from an S3 bucket. For the S3 batch kwargs generator, assets must

great_expectations.datasource.batch_kwargs_generator.s3_batch_kwargs_generator.boto3
great_expectations.datasource.batch_kwargs_generator.s3_batch_kwargs_generator.logger
class great_expectations.datasource.batch_kwargs_generator.s3_batch_kwargs_generator.S3GlobReaderBatchKwargsGenerator(name='default', datasource=None, bucket=None, reader_options=None, assets=None, delimiter='/', reader_method=None, boto3_options=None, max_keys=1000)

Bases: great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator

S3 BatchKwargGenerator provides support for generating batches of data from an S3 bucket. For the S3 batch kwargs generator, assets must be individually defined using a prefix and glob, although several additional configuration parameters are available for assets (see below).

Example configuration:

datasources:
  my_datasource:
    ...
    batch_kwargs_generator:
      my_s3_generator:
        class_name: S3GlobReaderBatchKwargsGenerator
        bucket: my_bucket.my_organization.priv
        reader_method: parquet  # This will be automatically inferred from suffix where possible, but can be explicitly specified as well
        reader_options:  # Note that reader options can be specified globally or per-asset
          sep: ","
        delimiter: "/"  # Note that this is the delimiter for the BUCKET KEYS. By default it is "/"
        boto3_options:
          endpoint_url: $S3_ENDPOINT  # Use the S3_ENDPOINT environment variable to determine which endpoint to use
        max_keys: 100  # The maximum number of keys to fetch in a single list_objects request to s3. When accessing batch_kwargs through an iterator, the iterator will silently refetch if more keys were available
        assets:
          my_first_asset:
            prefix: my_first_asset/
            regex_filter: .*  # The regex filter will filter the results returned by S3 for the key and prefix to only those matching the regex
            directory_assets: True
          access_logs:
            prefix: access_logs
            regex_filter: access_logs/2019.*\.csv.gz
            sep: "~"
            max_keys: 100
recognized_batch_parameters
property reader_options(self)
property assets(self)
property bucket(self)
get_available_data_asset_names(self)

Return the list of asset names known by this batch kwargs generator.

Returns

A list of available names

_get_iterator(self, data_asset_name, reader_method=None, reader_options=None, limit=None)
_build_batch_kwargs_path_iter(self, path_list, reader_options=None, limit=None)
_build_batch_kwargs(self, batch_parameters)
_build_batch_kwargs_from_key(self, key, asset_config=None, reader_method=None, reader_options=None, limit=None)
_get_asset_options(self, asset_config, iterator_dict)
_build_asset_iterator(self, asset_config, iterator_dict, reader_method=None, reader_options=None, limit=None)
get_available_partition_ids(self, generator_asset=None, data_asset_name=None)

Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

data_asset_name – the data asset whose partitions should be returned.

Returns

A list of partition_id strings

_partitioner(self, key, asset_config)