great_expectations.datasource.batch_kwargs_generator.glob_reader_batch_kwargs_generator

Module Contents

Classes

GlobReaderBatchKwargsGenerator(name=’default’, datasource=None, base_directory=’/data’, reader_options=None, asset_globs=None, reader_method=None)

GlobReaderBatchKwargsGenerator processes files in a directory according to glob patterns to produce batches of data.

great_expectations.datasource.batch_kwargs_generator.glob_reader_batch_kwargs_generator.logger
class great_expectations.datasource.batch_kwargs_generator.glob_reader_batch_kwargs_generator.GlobReaderBatchKwargsGenerator(name='default', datasource=None, base_directory='/data', reader_options=None, asset_globs=None, reader_method=None)

Bases: great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator

GlobReaderBatchKwargsGenerator processes files in a directory according to glob patterns to produce batches of data.

A more interesting asset_glob might look like the following:

daily_logs:
  glob: daily_logs/*.csv
  partition_regex: daily_logs/((19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01]))_(.*)\.csv

The “glob” key ensures that every csv file in the daily_logs directory is considered a batch for this data asset. The “partition_regex” key ensures that files whose basename begins with a date (with components hyphen, space, forward slash, period, or null separated) will be identified by a partition_id equal to just the date portion of their name.

A fully configured GlobReaderBatchKwargsGenerator in yml might look like the following:

my_datasource:
  class_name: PandasDatasource
  batch_kwargs_generators:
    my_generator:
      class_name: GlobReaderBatchKwargsGenerator
      base_directory: /var/log
      reader_options:
        sep: %
        header: 0
      reader_method: csv
      asset_globs:
        wifi_logs:
          glob: wifi*.log
          partition_regex: wifi-((0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])-20\d\d).*\.log
          reader_method: csv
recognized_batch_parameters
property reader_options(self)
property asset_globs(self)
property reader_method(self)
property base_directory(self)
get_available_data_asset_names(self)

Return the list of asset names known by this batch kwargs generator.

Returns

A list of available names

get_available_partition_ids(self, generator_asset=None, data_asset_name=None)

Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

data_asset_name – the data asset whose partitions should be returned.

Returns

A list of partition_id strings

_build_batch_kwargs(self, batch_parameters)
_get_data_asset_paths(self, data_asset_name)

Returns a list of filepaths associated with the given data_asset_name

Parameters

data_asset_name

Returns

paths (list)

_get_data_asset_config(self, data_asset_name)
_get_iterator(self, data_asset_name, reader_method=None, reader_options=None, limit=None)
_build_batch_kwargs_path_iter(self, path_list, glob_config, reader_method=None, reader_options=None, limit=None)
_build_batch_kwargs_from_path(self, path, glob_config, reader_method=None, reader_options=None, limit=None)
_partitioner(self, path, glob_config)