Generator Module

class great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator(name, datasource)

BatchKwargsGenerators produce identifying information, called “batch_kwargs” that datasources can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the datasource.

For example, a generator could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.

A batch is a sample from a data asset, sliced according to a particular rule. For example, an hourly slide of the Events table or “most recent users records.”

A Batch is the primary unit of validation in the Great Expectations DataContext. Batches include metadata that identifies how they were constructed–the same “batch_kwargs” assembled by the generator, While not every datasource will enable re-fetching a specific batch of data, GE can store snapshots of batches or store metadata from an external data version control system.

Example Generator Configurations follow:

my_datasource_1:
  class_name: PandasDatasource
  generators:
    # This generator will provide two data assets, corresponding to the globs defined under the "file_logs"
    # and "data_asset_2" keys. The file_logs asset will be partitioned according to the match group
    # defined in partition_regex
    default:
      class_name: GlobReaderBatchKwargsGenerator
      base_directory: /var/logs
      reader_options:
        sep: "
      globs:
        file_logs:
          glob: logs/*.gz
          partition_regex: logs/file_(\d{0,4})_\.log\.gz
        data_asset_2:
          glob: data/*.csv

my_datasource_2:
  class_name: PandasDatasource
  generators:
    # This generator will create one data asset per subdirectory in /data
    # Each asset will have partitions corresponding to the filenames in that subdirectory
    default:
      class_name: SubdirReaderBatchKwargsGenerator
      reader_options:
        sep: "
      base_directory: /data

my_datasource_3:
  class_name: SqlalchemyDatasource
  generators:
    # This generator will search for a file named with the name of the requested generator asset and the
    # .sql suffix to open with a query to use to generate data
     default:
        class_name: QueryBatchKwargsGenerator
recognized_batch_parameters = {}
property name
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

get_config()
reset_iterator(generator_asset, **kwargs)
get_iterator(generator_asset, **kwargs)
build_batch_kwargs(name=None, partition_id=None, **kwargs)

The key workhorse. Docs forthcoming.

yield_batch_kwargs(generator_asset, **kwargs)

InMemoryGenerator

QueryBatchKwargsGenerator

class great_expectations.datasource.generator.query_generator.QueryBatchKwargsGenerator(name='default', datasource=None, query_store_backend=None, queries=None)

Bases: great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator

Produce query-style batch_kwargs from sql files stored on disk

recognized_batch_parameters = {'name', 'partition_id', 'query_parameters'}
add_query(generator_asset, query)
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

TableBatchKwargsGenerator

class great_expectations.datasource.generator.table_generator.TableBatchKwargsGenerator(name='default', datasource=None, assets=None)

Bases: great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator

Provide access to already materialized tables or views in a database.

TableBatchKwargsGenerator can be used to define specific data asset names that take and substitute parameters, for example to support referring to the same data asset but with different schemas depending on provided batch_kwargs.

The python template language is used to substitute table name portions. For example, consider the following configurations:

my_generator:
  class_name: TableBatchKwargsGenerator
  assets:
    my_table:
      schema: $schema
      table: my_table

In that case, the asset my_datasource/my_generator/my_asset will refer to a table called my_table in a schema defined in batch_kwargs.

recognized_batch_parameters = {'limit', 'name', 'offset', 'query_parameters'}
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

SubdirReaderBatchKwargsGenerator

class great_expectations.datasource.generator.subdir_reader_generator.SubdirReaderBatchKwargsGenerator(name='default', datasource=None, base_directory='/data', reader_options=None, known_extensions=None, reader_method=None)

Bases: great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator

The SubdirReaderBatchKwargsGenerator inspects a filesystem and produces path-based batch_kwargs.

SubdirReaderBatchKwargsGenerator recognizes generator_assets using two criteria:
  • for files directly in ‘base_directory’ with recognized extensions (.csv, .tsv, .parquet, .xls, .xlsx, .json), it uses the name of the file without the extension

  • for other files or directories in ‘base_directory’, is uses the file or directory name

SubdirReaderBatchKwargsGenerator sees all files inside a directory of base_directory as batches of one datasource.

SubdirReaderBatchKwargsGenerator can also include configured reader_options which will be added to batch_kwargs generated by this generator.

recognized_batch_parameters = {'name', 'partition_id'}
property reader_options
property known_extensions
property reader_method
property base_directory
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

GlobReaderBatchKwargsGenerator

class great_expectations.datasource.generator.glob_reader_generator.GlobReaderBatchKwargsGenerator(name='default', datasource=None, base_directory='/data', reader_options=None, asset_globs=None, reader_method=None)

Bases: great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator

GlobReaderBatchKwargsGenerator processes files in a directory according to glob patterns to produce batches of data.

A more interesting asset_glob might look like the following:

daily_logs:
  glob: daily_logs/*.csv
  partition_regex: daily_logs/((19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01]))_(.*)\.csv

The “glob” key ensures that every csv file in the daily_logs directory is considered a batch for this data asset. The “partition_regex” key ensures that files whose basename begins with a date (with components hyphen, space, forward slash, period, or null separated) will be identified by a partition_id equal to just the date portion of their name.

A fully configured GlobReaderBatchKwargsGenerator in yml might look like the following:

my_datasource:
  class_name: PandasDatasource
  generators:
    my_generator:
      class_name: GlobReaderBatchKwargsGenerator
      base_directory: /var/log
      reader_options:
        sep: %
        header: 0
      reader_method: csv
      asset_globs:
        wifi_logs:
          glob: wifi*.log
          partition_regex: wifi-((0[1-9]|1[012])-(0[1-9]|[12][0-9]|3[01])-20\d\d).*\.log
          reader_method: csv
recognized_batch_parameters = {'limit', 'name', 'reader_method', 'reader_options'}
property reader_options
property asset_globs
property reader_method
property base_directory
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

S3GlobReaderBatchKwargsGenerator

class great_expectations.datasource.generator.s3_generator.S3GlobReaderBatchKwargsGenerator(name='default', datasource=None, bucket=None, reader_options=None, assets=None, delimiter='/', reader_method=None, boto3_options=None, max_keys=1000)

Bases: great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator

S3 Generator provides support for generating batches of data from an S3 bucket. For the S3 generator, assets must be individually defined using a prefix and glob, although several additional configuration parameters are available for assets (see below).

Example configuration:

datasources:
  my_datasource:
    ...
    generators:
      my_s3_generator:
        class_name: S3GlobReaderBatchKwargsGenerator
        bucket: my_bucket.my_organization.priv
        reader_method: parquet  # This will be automatically inferred from suffix where possible, but can be explicitly specified as well
        reader_options:  # Note that reader options can be specified globally or per-asset
          sep: ","
        delimiter: "/"  # Note that this is the delimiter for the BUCKET KEYS. By default it is "/"
        boto3_options:
          endpoint_url: $S3_ENDPOINT  # Use the S3_ENDPOINT environment variable to determine which endpoint to use
        max_keys: 100  # The maximum number of keys to fetch in a single list_objects request to s3. When accessing batch_kwargs through an iterator, the iterator will silently refetch if more keys were available
        assets:
          my_first_asset:
            prefix: my_first_asset/
            regex_filter: .*  # The regex filter will filter the results returned by S3 for the key and prefix to only those matching the regex
            dictionary_assets: True
          access_logs:
            prefix: access_logs
            regex_filter: access_logs/2019.*\.csv.gz
            sep: "~"
            max_keys: 100
property reader_options
property assets
property bucket
get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names

build_batch_kwargs_from_partition_id(generator_asset, partition_id=None, reader_options=None, limit=None)
get_available_partition_ids(generator_asset)

Applies the current _partitioner to the batches available on generator_asset and returns a list of valid partition_id strings that can be used to identify batches of data.

Parameters

generator_asset – the generator asset whose partitions should be returned.

Returns

A list of partition_id strings

DatabricksTableBatchKwargsGenerator

class great_expectations.datasource.generator.databricks_generator.DatabricksTableBatchKwargsGenerator(name='default', datasource=None, database='default')

Bases: great_expectations.datasource.generator.batch_kwargs_generator.BatchKwargsGenerator

Meant to be used in a Databricks notebook

get_available_data_asset_names()

Return the list of asset names known by this generator.

Returns

A list of available names