great_expectations.execution_engine

Package Contents

Classes

ExecutionEngine(name=None, caching=True, batch_spec_defaults=None, batch_data_dict=None, validator=None)

Helper class that provides a standard way to create an ABC using

PandasExecutionEngine(*args, **kwargs)

PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame.

SparkDFExecutionEngine(*args, persist=True, spark_config=None, force_reuse_spark_context=False, **kwargs)

This class holds an attribute spark_df which is a spark.sql.DataFrame.

SqlAlchemyExecutionEngine(name=None, credentials=None, data_context=None, engine=None, connection_string=None, url=None, batch_data_dict=None, create_temp_table=True, **kwargs)

Helper class that provides a standard way to create an ABC using

class great_expectations.execution_engine.ExecutionEngine(name=None, caching=True, batch_spec_defaults=None, batch_data_dict=None, validator=None)

Bases: abc.ABC

Helper class that provides a standard way to create an ABC using inheritance.

recognized_batch_spec_defaults
configure_validator(self, validator)

Optionally configure the validator as appropriate for the execution engine.

property active_batch_data_id(self)

The batch id for the default batch data.

When an execution engine is asked to process a compute domain that does not include a specific batch_id, then the data associated with the active_batch_data_id will be used as the default.

property active_batch_data(self)

The data from the currently-active batch.

property loaded_batch_data_dict(self)

The current dictionary of batches.

property loaded_batch_data_ids(self)
property config(self)
property dialect(self)
get_batch_data(self, batch_spec: BatchSpec)

Interprets batch_data and returns the appropriate data.

This method is primarily useful for utility cases (e.g. testing) where data is being fetched without a DataConnector and metadata like batch_markers is unwanted

Note: this method is currently a thin wrapper for get_batch_data_and_markers. It simply suppresses the batch_markers.

abstract get_batch_data_and_markers(self, batch_spec)
load_batch_data(self, batch_id: str, batch_data: Any)

Loads the specified batch_data into the execution engine

_load_batch_data_from_dict(self, batch_data_dict)

Loads all data in batch_data_dict into load_batch_data

resolve_metrics(self, metrics_to_resolve: Iterable[MetricConfiguration], metrics: Dict[Tuple, Any] = None, runtime_configuration: dict = None)

resolve_metrics is the main entrypoint for an execution engine. The execution engine will compute the value of the provided metrics.

Parameters
  • metrics_to_resolve – the metrics to evaluate

  • metrics – already-computed metrics currently available to the engine

  • runtime_configuration – runtime configuration information

Returns

a dictionary with the values for the metrics that have just been resolved.

Return type

resolved_metrics (Dict)

abstract resolve_metric_bundle(self, metric_fn_bundle)

Resolve a bundle of metrics with the same compute domain as part of a single trip to the compute engine.

abstract get_compute_domain(self, domain_kwargs: dict, domain_type: Union[str, MetricDomainTypes])

get_compute_domain computes the optimal domain_kwargs for computing metrics based on the given domain_kwargs and specific engine semantics.

Returns

  1. data corresponding to the compute domain;

  2. a modified copy of domain_kwargs describing the domain of the data returned in (1);

  3. a dictionary describing the access instructions for data elements included in the compute domain

    (e.g. specific column name).

In general, the union of the compute_domain_kwargs and accessor_domain_kwargs will be the same as the domain_kwargs provided to this method.

Return type

A tuple consisting of three elements

add_column_row_condition(self, domain_kwargs, column_name=None, filter_null=True, filter_nan=False)

EXPERIMENTAL

Add a row condition for handling null filter.

Parameters
  • domain_kwargs – the domain kwargs to use as the base and to which to add the condition

  • column_name – if provided, use this name to add the condition; otherwise, will use “column” key from table_domain_kwargs

  • filter_null – if true, add a filter for null values

  • filter_nan – if true, add a filter for nan values

class great_expectations.execution_engine.PandasExecutionEngine(*args, **kwargs)

Bases: great_expectations.execution_engine.ExecutionEngine

PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame.

For the full API reference, please see Dataset

Notes

  1. Samples and Subsets of PandaDataSet have ALL the expectations of the original data frame unless the user specifies the discard_subset_failing_expectations = True property on the original data frame.

  2. Concatenations, joins, and merges of PandaDataSets contain NO expectations (since no autoinspection is performed by default).

Feature Maturity

icon-d775331eefe811eb8a040242ac110002 Validation Engine - Pandas - How-to Guide
Use Pandas DataFrame to validate data
Maturity: Production
Details:
API Stability: Stable
Implementation Completeness: Complete
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: N/A -> see relevant Datasource evaluation
Documentation Completeness: Complete
Bug Risk: Low
Expectation Completeness: Complete
recognized_batch_spec_defaults
configure_validator(self, validator)

Optionally configure the validator as appropriate for the execution engine.

load_batch_data(self, batch_id: str, batch_data: Any)

Loads the specified batch_data into the execution engine

get_batch_data_and_markers(self, batch_spec: BatchSpec)
_apply_splitting_and_sampling_methods(self, batch_spec, batch_data)
property dataframe(self)

Tests whether or not a Batch has been loaded. If the loaded batch does not exist, raises a ValueError Exception

_get_reader_fn(self, reader_method=None, path=None)

Static helper for parsing reader types. If reader_method is not provided, path will be used to guess the correct reader_method.

Parameters
  • reader_method (str) – the name of the reader method to use, if available.

  • path (str) – the path used to guess

Returns

ReaderMethod to use for the filepath

static guess_reader_method_from_path(path)

Helper method for deciding which reader to use to read in a certain path.

Parameters

path (str) – the to use to guess

Returns

ReaderMethod to use for the filepath

get_compute_domain(self, domain_kwargs: dict, domain_type: Union[str, MetricDomainTypes], accessor_keys: Optional[Iterable[str]] = None)

Uses a given batch dictionary and domain kwargs (which include a row condition and a condition parser) to obtain and/or query a batch. Returns in the format of a Pandas DataFrame. If the domain is a single column, this is added to ‘accessor domain kwargs’ and used for later access

Parameters
  • domain_kwargs (dict) –

  • domain_type (str or MetricDomainTypes) –

  • to be using, or a corresponding string value representing it. String types include "identity", (like) –

  • "column_pair", "table" and "other". Enum types include capitalized versions of these from the ("column",) –

  • MetricDomainTypes. (class) –

  • accessor_keys (str iterable) –

  • the domain and simply transferred with their associated values into accessor_domain_kwargs. (describing) –

Returns

  • a DataFrame (the data on which to compute)

  • a dictionary of compute_domain_kwargs, describing the DataFrame

  • a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the domain within the compute domain

Return type

A tuple including

static _split_on_whole_table(df)
static _split_on_column_value(df, column_name: str, batch_identifiers: dict)
static _split_on_converted_datetime(df, column_name: str, batch_identifiers: dict, date_format_string: str = '%Y-%m-%d')

Convert the values in the named column to the given date_format, and split on that

static _split_on_divided_integer(df, column_name: str, divisor: int, batch_identifiers: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_mod_integer(df, column_name: str, mod: int, batch_identifiers: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_multi_column_values(df, column_names: List[str], batch_identifiers: dict)

Split on the joint values in the named columns

static _split_on_hashed_column(df, column_name: str, hash_digits: int, batch_identifiers: dict, hash_function_name: str = 'md5')

Split on the hashed value of the named column

static _sample_using_random(df, p: float = 0.1)

Take a random sample of rows, retaining proportion p

Note: the Random function behaves differently on different dialects of SQL

static _sample_using_mod(df, column_name: str, mod: int, value: int)

Take the mod of named column, and only keep rows that match the given value

static _sample_using_a_list(df, column_name: str, value_list: list)

Match the values in the named column against value_list, and only keep the matches

static _sample_using_hash(df, column_name: str, hash_digits: int = 1, hash_value: str = 'f', hash_function_name: str = 'md5')

Hash the values in the named column, and split on that

class great_expectations.execution_engine.SparkDFExecutionEngine(*args, persist=True, spark_config=None, force_reuse_spark_context=False, **kwargs)

Bases: great_expectations.execution_engine.ExecutionEngine

This class holds an attribute spark_df which is a spark.sql.DataFrame.

Feature Maturity

icon-d777316eefe811eb8a040242ac110002 Validation Engine - pyspark - Self-Managed - How-to Guide
Use Spark DataFrame to validate data
Maturity: Production
Details:
API Stability: Stable
Implementation Completeness: Moderate
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: N/A -> see relevant Datasource evaluation
Documentation Completeness: Complete
Bug Risk: Low/Moderate
Expectation Completeness: Moderate
icon-d777345cefe811eb8a040242ac110002 Validation Engine - Databricks - How-to Guide
Use Spark DataFrame in a Databricks cluster to validate data
Maturity: Beta
Details:
API Stability: Stable
Implementation Completeness: Low (dbfs-specific handling)
Unit Test Coverage: N/A -> implementation not different
Integration Infrastructure/Test Coverage: Minimal (we’ve tested a bit, know others have used it)
Documentation Completeness: Moderate (need docs on managing project configuration via dbfs/etc.)
Bug Risk: Low/Moderate
Expectation Completeness: Moderate
icon-d777361eefe811eb8a040242ac110002 Validation Engine - EMR - Spark - How-to Guide
Use Spark DataFrame in an EMR cluster to validate data
Maturity: Experimental
Details:
API Stability: Stable
Implementation Completeness: Low (need to provide guidance on “known good” paths, and we know there are many “knobs” to tune that we have not explored/tested)
Unit Test Coverage: N/A -> implementation not different
Integration Infrastructure/Test Coverage: Unknown
Documentation Completeness: Low (must install specific/latest version but do not have docs to that effect or of known useful paths)
Bug Risk: Low/Moderate
Expectation Completeness: Moderate
icon-d77737a4efe811eb8a040242ac110002 Validation Engine - Spark - Other - How-to Guide
Use Spark DataFrame to validate data
Maturity: Experimental
Details:
API Stability: Stable
Implementation Completeness: Other (we haven’t tested possibility, known glue deployment)
Unit Test Coverage: N/A -> implementation not different
Integration Infrastructure/Test Coverage: Unknown
Documentation Completeness: Low (must install specific/latest version but do not have docs to that effect or of known useful paths)
Bug Risk: Low/Moderate
Expectation Completeness: Moderate
recognized_batch_definition_keys
recognized_batch_spec_defaults
property dataframe(self)

If a batch has been loaded, returns a Spark Dataframe containing the data within the loaded batch

load_batch_data(self, batch_id: str, batch_data: Any)

Loads the specified batch_data into the execution engine

get_batch_data_and_markers(self, batch_spec: BatchSpec)
_apply_splitting_and_sampling_methods(self, batch_spec, batch_data)
static guess_reader_method_from_path(path)

Based on a given filepath, decides a reader method. Currently supports tsv, csv, and parquet. If none of these file extensions are used, returns BatchKwargsError stating that it is unable to determine the current path.

Parameters

- A given file path (path) –

Returns

reader_method}

Return type

A dictionary entry of format {‘reader_method’

_get_reader_fn(self, reader, reader_method=None, path=None)

Static helper for providing reader_fn

Parameters
  • reader – the base spark reader to use; this should have had reader_options applied already

  • reader_method – the name of the reader_method to use, if specified

  • path (str) – the path to use to guess reader_method if it was not specified

Returns

ReaderMethod to use for the filepath

get_compute_domain(self, domain_kwargs: dict, domain_type: Union[str, MetricDomainTypes], accessor_keys: Optional[Iterable[str]] = None)

Uses a given batch dictionary and domain kwargs (which include a row condition and a condition parser) to obtain and/or query a batch. Returns in the format of a Pandas Series if only a single column is desired, or otherwise a Data Frame.

Parameters
  • domain_kwargs (dict) –

  • domain_type (str or MetricDomainTypes) –

  • to be using, or a corresponding string value representing it. String types include "identity", (like) –

  • "column_pair", "table" and "other". Enum types include capitalized versions of these from the ("column",) –

  • MetricDomainTypes. (class) –

  • accessor_keys (str iterable) –

  • the domain and simply transferred with their associated values into accessor_domain_kwargs. (describing) –

Returns

  • a DataFrame (the data on which to compute)

  • a dictionary of compute_domain_kwargs, describing the DataFrame

  • a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the domain within the compute domain

Return type

A tuple including

add_column_row_condition(self, domain_kwargs, column_name=None, filter_null=True, filter_nan=False)

EXPERIMENTAL

Add a row condition for handling null filter.

Parameters
  • domain_kwargs – the domain kwargs to use as the base and to which to add the condition

  • column_name – if provided, use this name to add the condition; otherwise, will use “column” key from table_domain_kwargs

  • filter_null – if true, add a filter for null values

  • filter_nan – if true, add a filter for nan values

resolve_metric_bundle(self, metric_fn_bundle: Iterable[Tuple[MetricConfiguration, Callable, dict]])

For each metric name in the given metric_fn_bundle, finds the domain of the metric and calculates it using a metric function from the given provider class.

Args:

metric_fn_bundle - A batch containing MetricEdgeKeys and their corresponding functions metrics (dict) - A dictionary containing metrics and corresponding parameters

Returns:

A dictionary of the collected metrics over their respective domains

head(self, n=5)

Returns dataframe head. Default is 5

static _split_on_whole_table(df)
static _split_on_column_value(df, column_name: str, batch_identifiers: dict)
static _split_on_converted_datetime(df, column_name: str, batch_identifiers: dict, date_format_string: str = 'yyyy-MM-dd')
static _split_on_divided_integer(df, column_name: str, divisor: int, batch_identifiers: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_mod_integer(df, column_name: str, mod: int, batch_identifiers: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_multi_column_values(df, column_names: list, batch_identifiers: dict)

Split on the joint values in the named columns

static _split_on_hashed_column(df, column_name: str, hash_digits: int, batch_identifiers: dict, hash_function_name: str = 'sha256')

Split on the hashed value of the named column

static _sample_using_random(df, p: float = 0.1, seed: int = 1)

Take a random sample of rows, retaining proportion p

static _sample_using_mod(df, column_name: str, mod: int, value: int)

Take the mod of named column, and only keep rows that match the given value

static _sample_using_a_list(df, column_name: str, value_list: list)

Match the values in the named column against value_list, and only keep the matches

static _sample_using_hash(df, column_name: str, hash_digits: int = 1, hash_value: str = 'f', hash_function_name: str = 'md5')
class great_expectations.execution_engine.SqlAlchemyExecutionEngine(name=None, credentials=None, data_context=None, engine=None, connection_string=None, url=None, batch_data_dict=None, create_temp_table=True, **kwargs)

Bases: great_expectations.execution_engine.ExecutionEngine

Helper class that provides a standard way to create an ABC using inheritance.

property credentials(self)
property connection_string(self)
property url(self)
_build_engine(self, credentials, **kwargs)

Using a set of given credentials, constructs an Execution Engine , connecting to a database using a URL or a private key path.

_get_sqlalchemy_key_pair_auth_url(self, drivername: str, credentials: dict)

Utilizing a private key path and a passphrase in a given credentials dictionary, attempts to encode the provided values into a private key. If passphrase is incorrect, this will fail and an exception is raised.

Parameters
  • drivername (str) –

  • credentials (dict) –

Returns

a tuple consisting of a url with the serialized key-pair authentication, and a dictionary of engine kwargs.

get_compute_domain(self, domain_kwargs: Dict, domain_type: Union[str, MetricDomainTypes], accessor_keys: Optional[Iterable[str]] = None)

Uses a given batch dictionary and domain kwargs to obtain a SqlAlchemy column object.

Parameters
  • domain_kwargs (dict) –

  • domain_type (str or MetricDomainTypes) –

  • to be using, or a corresponding string value representing it. String types include "identity", (like) –

  • "column_pair", "table" and "other". Enum types include capitalized versions of these from the ("column",) –

  • MetricDomainTypes. (class) –

  • accessor_keys (str iterable) –

  • the domain and simply transferred with their associated values into accessor_domain_kwargs. (describing) –

Returns

SqlAlchemy column

resolve_metric_bundle(self, metric_fn_bundle: Iterable[Tuple[MetricConfiguration, Any, dict, dict]])

For every metric in a set of Metrics to resolve, obtains necessary metric keyword arguments and builds bundles of the metrics into one large query dictionary so that they are all executed simultaneously. Will fail if bundling the metrics together is not possible.

Args:
metric_fn_bundle (Iterable[Tuple[MetricConfiguration, Callable, dict]): A Dictionary containing a MetricProvider’s MetricConfiguration (its unique identifier), its metric provider function

(the function that actually executes the metric), and the arguments to pass to the metric provider function.

metrics (Dict[Tuple, Any]): A dictionary of metrics defined in the registry and corresponding arguments

Returns:

A dictionary of metric names and their corresponding now-queried values.

_split_on_whole_table(self, table_name: str, batch_identifiers: dict)

‘Split’ by returning the whole table

_split_on_column_value(self, table_name: str, column_name: str, batch_identifiers: dict)

Split using the values in the named column

_split_on_converted_datetime(self, table_name: str, column_name: str, batch_identifiers: dict, date_format_string: str = '%Y-%m-%d')

Convert the values in the named column to the given date_format, and split on that

_split_on_divided_integer(self, table_name: str, column_name: str, divisor: int, batch_identifiers: dict)

Divide the values in the named column by divisor, and split on that

_split_on_mod_integer(self, table_name: str, column_name: str, mod: int, batch_identifiers: dict)

Divide the values in the named column by divisor, and split on that

_split_on_multi_column_values(self, table_name: str, column_names: List[str], batch_identifiers: dict)

Split on the joint values in the named columns

_split_on_hashed_column(self, table_name: str, column_name: str, hash_digits: int, batch_identifiers: dict)

Split on the hashed value of the named column

_sample_using_random(self, p: float = 0.1)

Take a random sample of rows, retaining proportion p

Note: the Random function behaves differently on different dialects of SQL

_sample_using_mod(self, column_name, mod: int, value: int)

Take the mod of named column, and only keep rows that match the given value

_sample_using_a_list(self, column_name: str, value_list: list)

Match the values in the named column against value_list, and only keep the matches

_sample_using_md5(self, column_name: str, hash_digits: int = 1, hash_value: str = 'f')

Hash the values in the named column, and split on that

_build_selectable_from_batch_spec(self, batch_spec)
get_batch_data_and_markers(self, batch_spec: BatchSpec)