great_expectations.datasource

Subpackages

Package Contents

Classes

DataConnector(name: str, datasource_name: str, execution_engine: Optional[ExecutionEngine] = None)

DataConnectors produce identifying information, called “batch_spec” that ExecutionEngines

LegacyDatasource(name, data_context=None, data_asset_type=None, batch_kwargs_generators=None, **kwargs)

A Datasource connects to a compute environment and one or more storage environments and produces batches of data

BaseDatasource(name: str, execution_engine=None, data_context_root_directory: Optional[str] = None)

An Datasource is the glue between an ExecutionEngine and a DataConnector.

Datasource(name: str, execution_engine=None, data_connectors=None, data_context_root_directory: Optional[str] = None)

An Datasource is the glue between an ExecutionEngine and a DataConnector.

PandasDatasource(name=’pandas’, data_context=None, data_asset_type=None, batch_kwargs_generators=None, boto3_options=None, reader_method=None, reader_options=None, limit=None, **kwargs)

The PandasDatasource produces PandasDataset objects and supports generators capable of

SimpleSqlalchemyDatasource(name: str, connection_string: str = None, url: str = None, credentials: dict = None, engine=None, introspection: dict = None, tables: dict = None)

A specialized Datasource for SQL backends

SparkDFDatasource(name=’default’, data_context=None, data_asset_type=None, batch_kwargs_generators=None, spark_config=None, **kwargs)

The SparkDFDatasource produces SparkDFDatasets and supports generators capable of interacting with local

SqlAlchemyDatasource(name=’default’, data_context=None, data_asset_type=None, credentials=None, batch_kwargs_generators=None, **kwargs)

A SqlAlchemyDatasource will provide data_assets converting batch_kwargs using the following rules:

class great_expectations.datasource.DataConnector(name: str, datasource_name: str, execution_engine: Optional[ExecutionEngine] = None)

DataConnectors produce identifying information, called “batch_spec” that ExecutionEngines can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the Datasource.

For example, a DataConnector could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.

A batch is a sample from a data asset, sliced according to a particular rule. For example, an hourly slide of the Events table or “most recent users records.”

A Batch is the primary unit of validation in the Great Expectations DataContext. Batches include metadata that identifies how they were constructed–the same “batch_spec” assembled by the data connector, While not every Datasource will enable re-fetching a specific batch of data, GE can store snapshots of batches or store metadata from an external data version control system.

property name(self)
property datasource_name(self)
property data_context_root_directory(self)
get_batch_data_and_metadata(self, batch_definition: BatchDefinition)
build_batch_spec(self, batch_definition: BatchDefinition)
abstract _refresh_data_references_cache(self)
abstract _get_data_reference_list(self, data_asset_name: Optional[str] = None)

List objects in the underlying data store to create a list of data_references. This method is used to refresh the cache.

abstract _get_data_reference_list_from_cache_by_data_asset_name(self, data_asset_name: str)

Fetch data_references corresponding to data_asset_name from the cache.

abstract get_data_reference_list_count(self)
abstract get_unmatched_data_references(self)
abstract get_available_data_asset_names(self)

Return the list of asset names known by this data connector.

Returns

A list of available names

abstract get_batch_definition_list_from_batch_request(self, batch_request: BatchRequest)
abstract _map_data_reference_to_batch_definition_list(self, data_reference: Any, data_asset_name: Optional[str] = None)
abstract _map_batch_definition_to_data_reference(self, batch_definition: BatchDefinition)
abstract _generate_batch_spec_parameters_from_batch_definition(self, batch_definition: BatchDefinition)
self_check(self, pretty_print=True, max_examples=3)
_self_check_fetch_batch(self, pretty_print: bool, example_data_reference, data_asset_name: str)
_validate_batch_request(self, batch_request: BatchRequest)
class great_expectations.datasource.LegacyDatasource(name, data_context=None, data_asset_type=None, batch_kwargs_generators=None, **kwargs)

A Datasource connects to a compute environment and one or more storage environments and produces batches of data that Great Expectations can validate in that compute environment.

Each Datasource provides Batches connected to a specific compute environment, such as a SQL database, a Spark cluster, or a local in-memory Pandas DataFrame.

Datasources use Batch Kwargs to specify instructions for how to access data from relevant sources such as an existing object from a DAG runner, a SQL database, S3 bucket, or local filesystem.

To bridge the gap between those worlds, Datasources interact closely with generators which are aware of a source of data and can produce produce identifying information, called “batch_kwargs” that datasources can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the datasource.

For example, a batch kwargs generator could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.

Since opinionated DAG managers such as airflow, dbt, prefect.io, dagster can also act as datasources and/or batch kwargs generators for a more generic datasource.

When adding custom expectations by subclassing an existing DataAsset type, use the data_asset_type parameter to configure the datasource to load and return DataAssets of the custom type.

Feature Maturity

icon-3ebce1e0354d11eb87f20242ac110002 Datasource - S3 - How-to Guide
Support for connecting to Amazon Web Services S3 as an external datasource.
Maturity: Production
Details:
API Stability: medium
Implementation Completeness: Complete
Unit Test Coverage: : Complete
Integration Infrastructure/Test Coverage: None
Documentation Completeness: Minimal/Spotty
Bug Risk: Low
icon-3ebce3b6354d11eb87f20242ac110002 Datasource - Filesystem - How-to Guide
Support for using a mounted filesystem as an external datasource.
Maturity: Production
Details:
API Stability: Medium
Implementation Completeness: Complete
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: Partial
Documentation Completeness: Partial
Bug Risk: Low (Moderate for Windows users because of path issues)
icon-3ebce492354d11eb87f20242ac110002 Datasource - GCS - How-to Guide
Support for Google Cloud Storage as an external datasource
Maturity: Experimental
Details:
API Stability: Medium (supported via native ‘gs://’ syntax in Pandas and Pyspark; medium because we expect configuration to evolve)
Implementation Completeness: Medium (works via passthrough, not via CLI)
Unit Test Coverage: Minimal
Integration Infrastructure/Test Coverage: Minimal
Documentation Completeness: Minimal
Bug Risk: Moderate
icon-3ebce56e354d11eb87f20242ac110002 Datasource - Azure Blob Storage - How-to Guide
Support for Microsoft Azure Blob Storage as an external datasource
Maturity: In Roadmap (Sub-Experimental - “Not Impossible”)
Details:
API Stability: N/A (Supported on Databricks Spark via ‘wasb://’ / ‘wasps://’ url; requires local download first for Pandas)
Implementation Completeness: Minimal
Unit Test Coverage: N/A
Integration Infrastructure/Test Coverage: N/A
Documentation Completeness: Minimal
Bug Risk: Unknown
recognized_batch_parameters
classmethod from_configuration(cls, **kwargs)

Build a new datasource from a configuration dictionary.

Parameters

**kwargs – configuration key-value pairs

Returns

the newly-created datasource

Return type

datasource (Datasource)

classmethod build_configuration(cls, class_name, module_name='great_expectations.datasource', data_asset_type=None, batch_kwargs_generators=None, **kwargs)

Build a full configuration object for a datasource, potentially including batch kwargs generators with defaults.

Parameters
  • class_name – The name of the class for which to build the config

  • module_name – The name of the module in which the datasource class is located

  • data_asset_type – A ClassConfig dictionary

  • batch_kwargs_generators – BatchKwargGenerators configuration dictionary

  • **kwargs – Additional kwargs to be part of the datasource constructor’s initialization

Returns

A complete datasource configuration.

property name(self)

Property for datasource name

property config(self)
property data_context(self)

Property for attached DataContext

_build_generators(self)

Build batch kwargs generator objects from the datasource configuration.

Returns

None

add_batch_kwargs_generator(self, name, class_name, **kwargs)

Add a BatchKwargGenerator to the datasource.

Parameters
  • name (str) – the name of the new BatchKwargGenerator to add

  • class_name – class of the BatchKwargGenerator to add

  • kwargs – additional keyword arguments will be passed directly to the new BatchKwargGenerator’s constructor

Returns

BatchKwargGenerator (BatchKwargGenerator)

_build_batch_kwargs_generator(self, **kwargs)

Build a BatchKwargGenerator using the provided configuration and return the newly-built generator.

get_batch_kwargs_generator(self, name)

Get the (named) BatchKwargGenerator from a datasource)

Parameters

name (str) – name of BatchKwargGenerator (default value is ‘default’)

Returns

BatchKwargGenerator (BatchKwargGenerator)

list_batch_kwargs_generators(self)

List currently-configured BatchKwargGenerator for this datasource.

Returns

each dictionary includes “name” and “type” keys

Return type

List(dict)

process_batch_parameters(self, limit=None, dataset_options=None)

Use datasource-specific configuration to translate any batch parameters into batch kwargs at the datasource level.

Parameters
  • limit (int) – a parameter all datasources must accept to allow limiting a batch to a smaller number of rows.

  • dataset_options (dict) – a set of kwargs that will be passed to the constructor of a dataset built using these batch_kwargs

Returns

Result will include both parameters passed via argument and configured parameters.

Return type

batch_kwargs

abstract get_batch(self, batch_kwargs, batch_parameters=None)

Get a batch of data from the datasource.

Parameters
  • batch_kwargs – the BatchKwargs to use to construct the batch

  • batch_parameters – optional parameters to store as the reference description of the batch. They should reflect parameters that would provide the passed BatchKwargs.

Returns

Batch

get_available_data_asset_names(self, batch_kwargs_generator_names=None)

Returns a dictionary of data_asset_names that the specified batch kwarg generator can provide. Note that some batch kwargs generators may not be capable of describing specific named data assets, and some (such as filesystem glob batch kwargs generators) require the user to configure data asset names.

Parameters

batch_kwargs_generator_names – the BatchKwargGenerator for which to get available data asset names.

Returns

{
  generator_name: {
    names: [ (data_asset_1, data_asset_1_type), (data_asset_2, data_asset_2_type) ... ]
  }
  ...
}

Return type

dictionary consisting of sets of generator assets available for the specified generators

build_batch_kwargs(self, batch_kwargs_generator, data_asset_name=None, partition_id=None, **kwargs)
class great_expectations.datasource.BaseDatasource(name: str, execution_engine=None, data_context_root_directory: Optional[str] = None)

An Datasource is the glue between an ExecutionEngine and a DataConnector.

recognized_batch_parameters :set
get_batch_from_batch_definition(self, batch_definition: BatchDefinition, batch_data: Any = None)

Note: this method should not be used when getting a Batch from a BatchRequest, since it does not capture BatchRequest metadata.

get_single_batch_from_batch_request(self, batch_request: BatchRequest)
get_batch_list_from_batch_request(self, batch_request: BatchRequest)

Processes batch_request and returns the (possibly empty) list of batch objects.

:param : batch_request encapsulation of request parameters necessary to identify the (possibly multiple) batches :param : returns possibly empty list of batch objects; each batch object contains a dataset and associated metatada

_build_data_connector_from_config(self, name: str, config: Dict[str, Any])

Build a DataConnector using the provided configuration and return the newly-built DataConnector.

get_available_data_asset_names(self, data_connector_names: Optional[Union[list, str]] = None)

Returns a dictionary of data_asset_names that the specified data connector can provide. Note that some data_connectors may not be capable of describing specific named data assets, and some (such as inferred_asset_data_connector) require the user to configure data asset names.

Parameters

data_connector_names – the DataConnector for which to get available data asset names.

Returns

{
  data_connector_name: {
    names: [ (data_asset_1, data_asset_1_type), (data_asset_2, data_asset_2_type) ... ]
  }
  ...
}

Return type

dictionary consisting of sets of data assets available for the specified data connectors

get_available_batch_definitions(self, batch_request: BatchRequest)
self_check(self, pretty_print=True, max_examples=3)
_validate_batch_request(self, batch_request: BatchRequest)
property name(self)

Property for datasource name

property execution_engine(self)
property data_connectors(self)
property config(self)
class great_expectations.datasource.Datasource(name: str, execution_engine=None, data_connectors=None, data_context_root_directory: Optional[str] = None)

Bases: great_expectations.datasource.new_datasource.BaseDatasource

An Datasource is the glue between an ExecutionEngine and a DataConnector.

recognized_batch_parameters :set
_init_data_connectors(self, data_connector_configs: Dict[str, Dict[str, Any]])
class great_expectations.datasource.PandasDatasource(name='pandas', data_context=None, data_asset_type=None, batch_kwargs_generators=None, boto3_options=None, reader_method=None, reader_options=None, limit=None, **kwargs)

Bases: great_expectations.datasource.datasource.LegacyDatasource

The PandasDatasource produces PandasDataset objects and supports generators capable of interacting with the local filesystem (the default subdir_reader generator), and from existing in-memory dataframes.

recognized_batch_parameters
classmethod build_configuration(cls, data_asset_type=None, batch_kwargs_generators=None, boto3_options=None, reader_method=None, reader_options=None, limit=None, **kwargs)

Build a full configuration object for a datasource, potentially including generators with defaults.

Parameters
  • data_asset_type – A ClassConfig dictionary

  • batch_kwargs_generators – Generator configuration dictionary

  • boto3_options – Optional dictionary with key-value pairs to pass to boto3 during instantiation.

  • reader_method – Optional default reader_method for generated batches

  • reader_options – Optional default reader_options for generated batches

  • limit – Optional default limit for generated batches

  • **kwargs – Additional kwargs to be part of the datasource constructor’s initialization

Returns

A complete datasource configuration.

process_batch_parameters(self, reader_method=None, reader_options=None, limit=None, dataset_options=None)

Use datasource-specific configuration to translate any batch parameters into batch kwargs at the datasource level.

Parameters
  • limit (int) – a parameter all datasources must accept to allow limiting a batch to a smaller number of rows.

  • dataset_options (dict) – a set of kwargs that will be passed to the constructor of a dataset built using these batch_kwargs

Returns

Result will include both parameters passed via argument and configured parameters.

Return type

batch_kwargs

get_batch(self, batch_kwargs, batch_parameters=None)

Get a batch of data from the datasource.

Parameters
  • batch_kwargs – the BatchKwargs to use to construct the batch

  • batch_parameters – optional parameters to store as the reference description of the batch. They should reflect parameters that would provide the passed BatchKwargs.

Returns

Batch

static guess_reader_method_from_path(path)
_infer_default_options(self, reader_fn: Callable, reader_options: dict)

Allows reader options to be customized based on file context before loading to a DataFrame

Parameters
  • reader_method (str) – pandas reader method

  • reader_options – Current options and defaults set to pass to the reader method

Returns

A copy of the reader options post-inference

Return type

dict

_get_reader_fn(self, reader_method=None, path=None)

Static helper for parsing reader types. If reader_method is not provided, path will be used to guess the correct reader_method.

Parameters
  • reader_method (str) – the name of the reader method to use, if available.

  • path (str) – the to use to guess

Returns

ReaderMethod to use for the filepath

class great_expectations.datasource.SimpleSqlalchemyDatasource(name: str, connection_string: str = None, url: str = None, credentials: dict = None, engine=None, introspection: dict = None, tables: dict = None)

Bases: great_expectations.datasource.new_datasource.BaseDatasource

A specialized Datasource for SQL backends

SimpleSqlalchemyDatasource is designed to minimize boilerplate configuration and new concepts

_init_data_connectors(self, introspection_configs: dict, table_configs: dict)
class great_expectations.datasource.SparkDFDatasource(name='default', data_context=None, data_asset_type=None, batch_kwargs_generators=None, spark_config=None, **kwargs)

Bases: great_expectations.datasource.datasource.LegacyDatasource

The SparkDFDatasource produces SparkDFDatasets and supports generators capable of interacting with local

filesystem (the default subdir_reader batch kwargs generator) and databricks notebooks.

Accepted Batch Kwargs:
  • PathBatchKwargs (“path” or “s3” keys)

  • InMemoryBatchKwargs (“dataset” key)

  • QueryBatchKwargs (“query” key)

Feature Maturity

icon-3ebfa29a354d11eb87f20242ac110002 Datasource - HDFS - How-to Guide
Use HDFS as an external datasource in conjunction with Spark.
Maturity: Experimental
Details:
API Stability: Stable
Implementation Completeness: Unknown
Unit Test Coverage: Minimal (none)
Integration Infrastructure/Test Coverage: Minimal (none)
Documentation Completeness: Minimal (none)
Bug Risk: Unknown
recognized_batch_parameters
classmethod build_configuration(cls, data_asset_type=None, batch_kwargs_generators=None, spark_config=None, **kwargs)

Build a full configuration object for a datasource, potentially including generators with defaults.

Parameters
  • data_asset_type – A ClassConfig dictionary

  • batch_kwargs_generators – Generator configuration dictionary

  • spark_config – dictionary of key-value pairs to pass to the spark builder

  • **kwargs – Additional kwargs to be part of the datasource constructor’s initialization

Returns

A complete datasource configuration.

process_batch_parameters(self, reader_method=None, reader_options=None, limit=None, dataset_options=None)

Use datasource-specific configuration to translate any batch parameters into batch kwargs at the datasource level.

Parameters
  • limit (int) – a parameter all datasources must accept to allow limiting a batch to a smaller number of rows.

  • dataset_options (dict) – a set of kwargs that will be passed to the constructor of a dataset built using these batch_kwargs

Returns

Result will include both parameters passed via argument and configured parameters.

Return type

batch_kwargs

get_batch(self, batch_kwargs, batch_parameters=None)

class-private implementation of get_data_asset

static guess_reader_method_from_path(path)
_get_reader_fn(self, reader, reader_method=None, path=None)

Static helper for providing reader_fn

Parameters
  • reader – the base spark reader to use; this should have had reader_options applied already

  • reader_method – the name of the reader_method to use, if specified

  • path (str) – the path to use to guess reader_method if it was not specified

Returns

ReaderMethod to use for the filepath

class great_expectations.datasource.SqlAlchemyDatasource(name='default', data_context=None, data_asset_type=None, credentials=None, batch_kwargs_generators=None, **kwargs)

Bases: great_expectations.datasource.LegacyDatasource

A SqlAlchemyDatasource will provide data_assets converting batch_kwargs using the following rules:
  • if the batch_kwargs include a table key, the datasource will provide a dataset object connected to that table

  • if the batch_kwargs include a query key, the datasource will create a temporary table usingthat query. The query can be parameterized according to the standard python Template engine, which uses $parameter, with additional kwargs passed to the get_batch method.

Feature Maturity

icon-3ec0affa354d11eb87f20242ac110002 Datasource - PostgreSQL - How-to Guide
Support for using the open source PostgresQL database as an external datasource and execution engine.
Maturity: Production
Details:
API Stability: High
Implementation Completeness: Complete
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: Complete
Documentation Completeness: Medium (does not have a specific how-to, but easy to use overall)
Bug Risk: Low
Expectation Completeness: Moderate
icon-3ec0b1c6354d11eb87f20242ac110002 Datasource - BigQuery - How-to Guide
Use Google BigQuery as an execution engine and external datasource to validate data.
Maturity: Beta
Details:
API Stability: Unstable (table generator inability to work with triple-dotted, temp table usability, init flow calls setup “other”)
Implementation Completeness: Moderate
Unit Test Coverage: Partial (no test coverage for temp table creation)
Integration Infrastructure/Test Coverage: Minimal
Documentation Completeness: Partial (how-to does not cover all cases)
Bug Risk: High (we know of several bugs, including inability to list tables, SQLAlchemy URL incomplete)
Expectation Completeness: Moderate
icon-3ec0b2ac354d11eb87f20242ac110002 Datasource - Amazon Redshift - How-to Guide
Use Amazon Redshift as an execution engine and external datasource to validate data.
Maturity: Beta
Details:
API Stability: Moderate (potential metadata/introspection method special handling for performance)
Implementation Completeness: Complete
Unit Test Coverage: Minimal
Integration Infrastructure/Test Coverage: Minimal (none automated)
Documentation Completeness: Moderate
Bug Risk: Moderate
Expectation Completeness: Moderate
icon-3ec0b37e354d11eb87f20242ac110002 Datasource - Snowflake - How-to Guide
Use Snowflake Computing as an execution engine and external datasource to validate data.
Maturity: Production
Details:
API Stability: High
Implementation Completeness: Complete
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: Minimal (manual only)
Documentation Completeness: Complete
Bug Risk: Low
Expectation Completeness: Complete
icon-3ec0b43c354d11eb87f20242ac110002 Datasource - Microsoft SQL Server - How-to Guide
Use Microsoft SQL Server as an execution engine and external datasource to validate data.
Maturity: Experimental
Details:
API Stability: High
Implementation Completeness: Moderate
Unit Test Coverage: Minimal (none)
Integration Infrastructure/Test Coverage: Minimal (none)
Documentation Completeness: Minimal
Bug Risk: High
Expectation Completeness: Low (some required queries do not generate properly, such as related to nullity)
icon-3ec0b4fa354d11eb87f20242ac110002 Datasource - MySQL - How-to Guide
Use MySQL as an execution engine and external datasource to validate data.
Maturity: Experimental
Details:
API Stability: Low (no consideration for temp tables)
Implementation Completeness: Low (no consideration for temp tables)
Unit Test Coverage: Minimal (none)
Integration Infrastructure/Test Coverage: Minimal (none)
Documentation Completeness: Minimal (none)
Bug Risk: Unknown
Expectation Completeness: Unknown
icon-3ec0b5b8354d11eb87f20242ac110002 Datasource - MariaDB - How-to Guide
Use MariaDB as an execution engine and external datasource to validate data.
Maturity: Experimental
Details:
API Stability: Low (no consideration for temp tables)
Implementation Completeness: Low (no consideration for temp tables)
Unit Test Coverage: Minimal (none)
Integration Infrastructure/Test Coverage: Minimal (none)
Documentation Completeness: Minimal (none)
Bug Risk: Unknown
Expectation Completeness: Unknown
recognized_batch_parameters
classmethod build_configuration(cls, data_asset_type=None, batch_kwargs_generators=None, **kwargs)

Build a full configuration object for a datasource, potentially including generators with defaults.

Parameters
  • data_asset_type – A ClassConfig dictionary

  • batch_kwargs_generators – Generator configuration dictionary

  • **kwargs – Additional kwargs to be part of the datasource constructor’s initialization

Returns

A complete datasource configuration.

_get_sqlalchemy_connection_options(self, **kwargs)
_get_sqlalchemy_key_pair_auth_url(self, drivername, credentials)
get_batch(self, batch_kwargs, batch_parameters=None)

Get a batch of data from the datasource.

Parameters
  • batch_kwargs – the BatchKwargs to use to construct the batch

  • batch_parameters – optional parameters to store as the reference description of the batch. They should reflect parameters that would provide the passed BatchKwargs.

Returns

Batch

process_batch_parameters(self, query_parameters=None, limit=None, dataset_options=None)

Use datasource-specific configuration to translate any batch parameters into batch kwargs at the datasource level.

Parameters
  • limit (int) – a parameter all datasources must accept to allow limiting a batch to a smaller number of rows.

  • dataset_options (dict) – a set of kwargs that will be passed to the constructor of a dataset built using these batch_kwargs

Returns

Result will include both parameters passed via argument and configured parameters.

Return type

batch_kwargs