great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator

Module Contents

Classes

BatchKwargsGenerator(name, datasource)

BatchKwargsGenerators produce identifying information, called “batch_kwargs” that datasources

great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.logger
class great_expectations.datasource.batch_kwargs_generator.batch_kwargs_generator.BatchKwargsGenerator(name, datasource)

Bases: object

BatchKwargsGenerators produce identifying information, called “batch_kwargs” that datasources can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the datasource.

For example, a batch kwargs generator could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.

A batch is a sample from a data asset, sliced according to a particular rule. For example, an hourly slide of the Events table or “most recent users records.”

A Batch is the primary unit of validation in the Great Expectations DataContext. Batches include metadata that identifies how they were constructed–the same “batch_kwargs” assembled by the batch kwargs generator, While not every datasource will enable re-fetching a specific batch of data, GE can store snapshots of batches or store metadata from an external data version control system.

Example Generator Configurations follow:

my_datasource_1:
  class_name: PandasDatasource
  batch_kwargs_generators:
    # This generator will provide two data assets, corresponding to the globs defined under the "file_logs"
    # and "data_asset_2" keys. The file_logs asset will be partitioned according to the match group
    # defined in partition_regex
    default:
      class_name: GlobReaderBatchKwargsGenerator
      base_directory: /var/logs
      reader_options:
        sep: "
      globs:
        file_logs:
          glob: logs/*.gz
          partition_regex: logs/file_(\d{0,4})_\.log\.gz
        data_asset_2:
          glob: data/*.csv

my_datasource_2:
  class_name: PandasDatasource
  batch_kwargs_generators:
    # This generator will create one data asset per subdirectory in /data
    # Each asset will have partitions corresponding to the filenames in that subdirectory
    default:
      class_name: SubdirReaderBatchKwargsGenerator
      reader_options:
        sep: "
      base_directory: /data

my_datasource_3:
  class_name: SqlalchemyDatasource
  batch_kwargs_generators:
    # This generator will search for a file named with the name of the requested data asset and the
    # .sql suffix to open with a query to use to generate data
     default:
        class_name: QueryBatchKwargsGenerator
_batch_kwargs_type
recognized_batch_parameters
property name(self)
abstract _get_iterator(self, data_asset_name, **kwargs)
abstract get_available_data_asset_names(self)

Return the list of asset names known by this batch kwargs generator.

Returns

A list of available names

abstract get_available_partition_ids(self, generator_asset=None, data_asset_name=None)

Applies the current _partitioner to the batches available on data_asset_name and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

data_asset_name – the data asset whose partitions should be returned.

Returns

A list of partition_id strings

get_config(self)
reset_iterator(self, generator_asset=None, data_asset_name=None, **kwargs)
get_iterator(self, generator_asset=None, data_asset_name=None, **kwargs)
build_batch_kwargs(self, data_asset_name=None, partition_id=None, **kwargs)
abstract _build_batch_kwargs(self, batch_parameters)
yield_batch_kwargs(self, generator_asset=None, data_asset_name=None, **kwargs)