great_expectations.execution_engine

Package Contents

Classes

ExecutionEngine(name=None, caching=True, batch_spec_defaults=None, batch_data_dict=None, validator=None)

PandasExecutionEngine(*args, **kwargs)

PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame.

SparkDFExecutionEngine(*args, **kwargs)

This class holds an attribute spark_df which is a spark.sql.DataFrame.

SqlAlchemyExecutionEngine(name=None, credentials=None, data_context=None, engine=None, connection_string=None, url=None, batch_data_dict=None, **kwargs)

class great_expectations.execution_engine.ExecutionEngine(name=None, caching=True, batch_spec_defaults=None, batch_data_dict=None, validator=None)
recognized_batch_spec_defaults
configure_validator(self, validator)

Optionally configure the validator as appropriate for the execution engine.

property active_batch_data_id(self)

The batch id for the default batch data.

When an execution engine is asked to process a compute domain that does not include a specific batch_id, then the data associated with the active_batch_data_id will be used as the default.

property active_batch_data(self)

The data from the currently-active batch.

property loaded_batch_data_dict(self)

The current dictionary of batches.

property config(self)
get_batch_data(self, batch_spec: BatchSpec)

Interprets batch_data and returns the appropriate data.

This method is primarily useful for utility cases (e.g. testing) where data is being fetched without a DataConnector and metadata like batch_markers is unwanted

Note: this method is currently a thin wrapper for get_batch_data_and_markers. It simply suppresses the batch_markers.

load_batch_data(self, batch_id: str, batch_data: Any)

Loads the specified batch_data into the execution engine

_load_batch_data_from_dict(self, batch_data_dict)

Loads all data in batch_data_dict into load_batch_data

_get_typed_batch_data(self, batch_data)
resolve_metrics(self, metrics_to_resolve: Iterable[MetricConfiguration], metrics: Dict[Tuple, Any] = None, runtime_configuration: dict = None)

resolve_metrics is the main entrypoint for an execution engine. The execution engine will compute the value of the provided metrics.

Parameters
  • metrics_to_resolve – the metrics to evaluate

  • metrics – already-computed metrics currently available to the engine

  • runtime_configuration – runtime configuration information

Returns

a dictionary with the values for the metrics that have just been resolved.

Return type

resolved_metrics (Dict)

abstract resolve_metric_bundle(self, metric_fn_bundle)

Resolve a bundle of metrics with the same compute domain as part of a single trip to the compute engine.

abstract get_compute_domain(self, domain_kwargs: dict, domain_type: Union[str, 'MetricDomainTypes'])

get_compute_domain computes the optimal domain_kwargs for computing metrics based on the given domain_kwargs and specific engine semantics.

Returns

  1. data correspondig to the compute domain;

  2. a modified copy of domain_kwargs describing the domain of the data returned in (1);

  3. a dictionary describing the access instructions for data elements included in the compute domain

    (e.g. specific column name).

In general, the union of the compute_domain_kwargs and accessor_domain_kwargs will be the same as the domain_kwargs provided to this method.

Return type

A tuple consisting of three elements

add_column_row_condition(self, domain_kwargs, column_name=None, filter_null=True, filter_nan=False)

EXPERIMENTAL

Add a row condition for handling null filter.

Parameters
  • domain_kwargs – the domain kwargs to use as the base and to which to add the condition

  • column_name – if provided, use this name to add the condition; otherwise, will use “column” key from table_domain_kwargs

  • filter_null – if true, add a filter for null values

  • filter_nan – if true, add a filter for nan values

class great_expectations.execution_engine.PandasExecutionEngine(*args, **kwargs)

Bases: great_expectations.execution_engine.execution_engine.ExecutionEngine

PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame.

For the full API reference, please see Dataset

Notes

  1. Samples and Subsets of PandaDataSet have ALL the expectations of the original data frame unless the user specifies the discard_subset_failing_expectations = True property on the original data frame.

  2. Concatenations, joins, and merges of PandaDataSets contain NO expectations (since no autoinspection is performed by default).

Feature Maturity

icon-e96194e85dd311ebbb980242ac110002 Validation Engine - Pandas - How-to Guide
Use Pandas DataFrame to validate data
Maturity: Production
Details:
API Stability: Stable
Implementation Completeness: Complete
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: N/A -> see relevant Datasource evaluation
Documentation Completeness: Complete
Bug Risk: Low
Expectation Completeness: Complete
recognized_batch_spec_defaults
configure_validator(self, validator)

Optionally configure the validator as appropriate for the execution engine.

get_batch_data_and_markers(self, batch_spec: BatchSpec)
_apply_splitting_and_sampling_methods(self, batch_spec, batch_data)
_get_typed_batch_data(self, batch_data)
property dataframe(self)

Tests whether or not a Batch has been loaded. If the loaded batch does not exist, raises a ValueError Exception

_get_reader_fn(self, reader_method=None, path=None)

Static helper for parsing reader types. If reader_method is not provided, path will be used to guess the correct reader_method.

Parameters
  • reader_method (str) – the name of the reader method to use, if available.

  • path (str) – the path used to guess

Returns

ReaderMethod to use for the filepath

static guess_reader_method_from_path(path)

Helper method for deciding which reader to use to read in a certain path.

Parameters

path (str) – the to use to guess

Returns

ReaderMethod to use for the filepath

get_compute_domain(self, domain_kwargs: dict, domain_type: Union[str, 'MetricDomainTypes'], accessor_keys: Optional[Iterable[str]] = [])

Uses a given batch dictionary and domain kwargs (which include a row condition and a condition parser) to obtain and/or query a batch. Returns in the format of a Pandas DataFrame. If the domain is a single column, this is added to ‘accessor domain kwargs’ and used for later access

Parameters
  • domain_kwargs (dict) –

  • domain_type (str or "MetricDomainTypes") –

  • to be using, or a corresponding string value representing it. String types include "identity", "column", (like) –

  • "table" and "other". Enum types include capitalized versions of these from the class ("column_pair",) –

  • MetricDomainTypes.

  • accessor_keys (str iterable) – the domain and simply transferred with their associated values into accessor_domain_kwargs.

Returns

  • a DataFrame (the data on which to compute)

  • a dictionary of compute_domain_kwargs, describing the DataFrame

  • a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the domain within the compute domain

Return type

A tuple including

static _split_on_whole_table(df)
static _split_on_column_value(df, column_name: str, partition_definition: dict)
static _split_on_converted_datetime(df, column_name: str, partition_definition: dict, date_format_string: str = '%Y-%m-%d')

Convert the values in the named column to the given date_format, and split on that

static _split_on_divided_integer(df, column_name: str, divisor: int, partition_definition: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_mod_integer(df, column_name: str, mod: int, partition_definition: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_multi_column_values(df, column_names: List[str], partition_definition: dict)

Split on the joint values in the named columns

static _split_on_hashed_column(df, column_name: str, hash_digits: int, partition_definition: dict, hash_function_name: str = 'md5')

Split on the hashed value of the named column

static _sample_using_random(df, p: float = 0.1)

Take a random sample of rows, retaining proportion p

Note: the Random function behaves differently on different dialects of SQL

static _sample_using_mod(df, column_name: str, mod: int, value: int)

Take the mod of named column, and only keep rows that match the given value

static _sample_using_a_list(df, column_name: str, value_list: list)

Match the values in the named column against value_list, and only keep the matches

static _sample_using_hash(df, column_name: str, hash_digits: int = 1, hash_value: str = 'f', hash_function_name: str = 'md5')

Hash the values in the named column, and split on that

class great_expectations.execution_engine.SparkDFExecutionEngine(*args, **kwargs)

Bases: great_expectations.execution_engine.execution_engine.ExecutionEngine

This class holds an attribute spark_df which is a spark.sql.DataFrame.

Feature Maturity

icon-e962a0cc5dd311ebbb980242ac110002 Validation Engine - pyspark - Self-Managed - How-to Guide
Use Spark DataFrame to validate data
Maturity: Production
Details:
API Stability: Stable
Implementation Completeness: Moderate
Unit Test Coverage: Complete
Integration Infrastructure/Test Coverage: N/A -> see relevant Datasource evaluation
Documentation Completeness: Complete
Bug Risk: Low/Moderate
Expectation Completeness: Moderate
icon-e962a2a25dd311ebbb980242ac110002 Validation Engine - Databricks - How-to Guide
Use Spark DataFrame in a Databricks cluster to validate data
Maturity: Beta
Details:
API Stability: Stable
Implementation Completeness: Low (dbfs-specific handling)
Unit Test Coverage: N/A -> implementation not different
Integration Infrastructure/Test Coverage: Minimal (we’ve tested a bit, know others have used it)
Documentation Completeness: Moderate (need docs on managing project configuration via dbfs/etc.)
Bug Risk: Low/Moderate
Expectation Completeness: Moderate
icon-e962a3a65dd311ebbb980242ac110002 Validation Engine - EMR - Spark - How-to Guide
Use Spark DataFrame in an EMR cluster to validate data
Maturity: Experimental
Details:
API Stability: Stable
Implementation Completeness: Low (need to provide guidance on “known good” paths, and we know there are many “knobs” to tune that we have not explored/tested)
Unit Test Coverage: N/A -> implementation not different
Integration Infrastructure/Test Coverage: Unknown
Documentation Completeness: Low (must install specific/latest version but do not have docs to that effect or of known useful paths)
Bug Risk: Low/Moderate
Expectation Completeness: Moderate
icon-e962a48c5dd311ebbb980242ac110002 Validation Engine - Spark - Other - How-to Guide
Use Spark DataFrame to validate data
Maturity: Experimental
Details:
API Stability: Stable
Implementation Completeness: Other (we haven’t tested possibility, known glue deployment)
Unit Test Coverage: N/A -> implementation not different
Integration Infrastructure/Test Coverage: Unknown
Documentation Completeness: Low (must install specific/latest version but do not have docs to that effect or of known useful paths)
Bug Risk: Low/Moderate
Expectation Completeness: Moderate
recognized_batch_definition_keys
recognized_batch_spec_defaults
property dataframe(self)

If a batch has been loaded, returns a Spark Dataframe containing the data within the loaded batch

get_batch_data_and_markers(self, batch_spec: BatchSpec)
_apply_splitting_and_sampling_methods(self, batch_spec, batch_data)
static guess_reader_method_from_path(path)

Based on a given filepath, decides a reader method. Currently supports tsv, csv, and parquet. If none of these file extensions are used, returns BatchKwargsError stating that it is unable to determine the current path.

Parameters

- A given file path (path) –

Returns

reader_method}

Return type

A dictionary entry of format {‘reader_method’

_get_reader_fn(self, reader, reader_method=None, path=None)

Static helper for providing reader_fn

Parameters
  • reader – the base spark reader to use; this should have had reader_options applied already

  • reader_method – the name of the reader_method to use, if specified

  • path (str) – the path to use to guess reader_method if it was not specified

Returns

ReaderMethod to use for the filepath

get_compute_domain(self, domain_kwargs: dict, domain_type: Union[str, 'MetricDomainTypes'], accessor_keys: Optional[Iterable[str]] = [])

Uses a given batch dictionary and domain kwargs (which include a row condition and a condition parser) to obtain and/or query a batch. Returns in the format of a Pandas Series if only a single column is desired, or otherwise a Data Frame.

Parameters
  • domain_kwargs (dict) –

  • domain_type (str or "MetricDomainTypes") –

  • to be using, or a corresponding string value representing it. String types include "identity", "column", (like) –

  • "table" and "other". Enum types include capitalized versions of these from the class ("column_pair",) –

  • MetricDomainTypes.

  • accessor_keys (str iterable) –

  • domain and simply transferred with their associated values into accessor_domain_kwargs. (the) –

Returns

  • a DataFrame (the data on which to compute)

  • a dictionary of compute_domain_kwargs, describing the DataFrame

  • a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the domain within the compute domain

Return type

A tuple including

add_column_row_condition(self, domain_kwargs, column_name=None, filter_null=True, filter_nan=False)

EXPERIMENTAL

Add a row condition for handling null filter.

Parameters
  • domain_kwargs – the domain kwargs to use as the base and to which to add the condition

  • column_name – if provided, use this name to add the condition; otherwise, will use “column” key from table_domain_kwargs

  • filter_null – if true, add a filter for null values

  • filter_nan – if true, add a filter for nan values

resolve_metric_bundle(self, metric_fn_bundle: Iterable[Tuple[MetricConfiguration, Callable, dict]])

For each metric name in the given metric_fn_bundle, finds the domain of the metric and calculates it using a metric function from the given provider class.

Args:

metric_fn_bundle - A batch containing MetricEdgeKeys and their corresponding functions metrics (dict) - A dictionary containing metrics and corresponding parameters

Returns:

A dictionary of the collected metrics over their respective domains

head(self, n=5)

Returns dataframe head. Default is 5

static _split_on_whole_table(df)
static _split_on_column_value(df, column_name: str, partition_definition: dict)
static _split_on_converted_datetime(df, column_name: str, partition_definition: dict, date_format_string: str = 'yyyy-MM-dd')
static _split_on_divided_integer(df, column_name: str, divisor: int, partition_definition: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_mod_integer(df, column_name: str, mod: int, partition_definition: dict)

Divide the values in the named column by divisor, and split on that

static _split_on_multi_column_values(df, column_names: list, partition_definition: dict)

Split on the joint values in the named columns

static _split_on_hashed_column(df, column_name: str, hash_digits: int, partition_definition: dict, hash_function_name: str = 'sha256')

Split on the hashed value of the named column

static _sample_using_random(df, p: float = 0.1, seed: int = 1)

Take a random sample of rows, retaining proportion p

static _sample_using_mod(df, column_name: str, mod: int, value: int)

Take the mod of named column, and only keep rows that match the given value

static _sample_using_a_list(df, column_name: str, value_list: list)

Match the values in the named column against value_list, and only keep the matches

static _sample_using_hash(df, column_name: str, hash_digits: int = 1, hash_value: str = 'f', hash_function_name: str = 'md5')
class great_expectations.execution_engine.SqlAlchemyExecutionEngine(name=None, credentials=None, data_context=None, engine=None, connection_string=None, url=None, batch_data_dict=None, **kwargs)

Bases: great_expectations.execution_engine.ExecutionEngine

property credentials(self)
property connection_string(self)
property url(self)
_build_engine(self, credentials, **kwargs)

Using a set of given credentials, constructs an Execution Engine , connecting to a database using a URL or a private key path.

_get_sqlalchemy_key_pair_auth_url(self, drivername: str, credentials: dict)

Utilizing a private key path and a passphrase in a given credentials dictionary, attempts to encode the provided values into a private key. If passphrase is incorrect, this will fail and an exception is raised.

Parameters
  • drivername (str) –

  • credentials (dict) –

Returns

a tuple consisting of a url with the serialized key-pair authentication, and a dictionary of engine kwargs.

get_compute_domain(self, domain_kwargs: Dict, domain_type: Union[str, 'MetricDomainTypes'], accessor_keys: Optional[Iterable[str]] = None)

Uses a given batch dictionary and domain kwargs to obtain a SqlAlchemy column object.

Parameters
  • domain_kwargs (dict) –

  • domain_type (str or "MetricDomainTypes") –

  • to be using, or a corresponding string value representing it. String types include "identity", "column", (like) –

  • "table" and "other". Enum types include capitalized versions of these from the class ("column_pair",) –

  • MetricDomainTypes.

  • accessor_keys (str iterable) –

  • domain and simply transferred with their associated values into accessor_domain_kwargs. (the) –

Returns

SqlAlchemy column

resolve_metric_bundle(self, metric_fn_bundle: Iterable[Tuple[MetricConfiguration, Any, dict, dict]])

For every metrics in a set of Metrics to resolve, obtains necessary metric keyword arguments and builds a bundles the metrics into one large query dictionary so that they are all executed simultaneously. Will fail if bundling the metrics together is not possible.

Args:
metric_fn_bundle (Iterable[Tuple[MetricConfiguration, Callable, dict]): A Dictionary containing a MetricProvider’s MetricConfiguration (its unique identifier), its metric provider function

(the function that actually executes the metric), and the arguments to pass to the metric provider function.

metrics (Dict[Tuple, Any]): A dictionary of metrics defined in the registry and corresponding arguments

Returns:

A dictionary of metric names and their corresponding now-queried values.

_split_on_whole_table(self, table_name: str, partition_definition: dict)

‘Split’ by returning the whole table

_split_on_column_value(self, table_name: str, column_name: str, partition_definition: dict)

Split using the values in the named column

_split_on_converted_datetime(self, table_name: str, column_name: str, partition_definition: dict, date_format_string: str = '%Y-%m-%d')

Convert the values in the named column to the given date_format, and split on that

_split_on_divided_integer(self, table_name: str, column_name: str, divisor: int, partition_definition: dict)

Divide the values in the named column by divisor, and split on that

_split_on_mod_integer(self, table_name: str, column_name: str, mod: int, partition_definition: dict)

Divide the values in the named column by divisor, and split on that

_split_on_multi_column_values(self, table_name: str, column_names: List[str], partition_definition: dict)

Split on the joint values in the named columns

_split_on_hashed_column(self, table_name: str, column_name: str, hash_digits: int, partition_definition: dict)

Split on the hashed value of the named column

_sample_using_random(self, p: float = 0.1)

Take a random sample of rows, retaining proportion p

Note: the Random function behaves differently on different dialects of SQL

_sample_using_mod(self, column_name, mod: int, value: int)

Take the mod of named column, and only keep rows that match the given value

_sample_using_a_list(self, column_name: str, value_list: list)

Match the values in the named column against value_list, and only keep the matches

_sample_using_md5(self, column_name: str, hash_digits: int = 1, hash_value: str = 'f')

Hash the values in the named column, and split on that

_build_selectable_from_batch_spec(self, batch_spec)
get_batch_data_and_markers(self, batch_spec)