great_expectations.execution_engine
¶
Submodules¶
Package Contents¶
Classes¶
|
|
|
PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame. |
|
This class holds an attribute spark_df which is a spark.sql.DataFrame. |
|
-
class
great_expectations.execution_engine.
ExecutionEngine
(name=None, caching=True, batch_spec_defaults=None, batch_data_dict=None, validator=None)¶ -
recognized_batch_spec_defaults
¶
-
configure_validator
(self, validator)¶ Optionally configure the validator as appropriate for the execution engine.
-
property
active_batch_data_id
(self)¶ The batch id for the default batch data.
When an execution engine is asked to process a compute domain that does not include a specific batch_id, then the data associated with the active_batch_data_id will be used as the default.
-
property
active_batch_data
(self)¶ The data from the currently-active batch.
-
property
loaded_batch_data_dict
(self)¶ The current dictionary of batches.
-
property
config
(self)¶
-
get_batch_data
(self, batch_spec: BatchSpec)¶ Interprets batch_data and returns the appropriate data.
This method is primarily useful for utility cases (e.g. testing) where data is being fetched without a DataConnector and metadata like batch_markers is unwanted
Note: this method is currently a thin wrapper for get_batch_data_and_markers. It simply suppresses the batch_markers.
-
load_batch_data
(self, batch_id: str, batch_data: Any)¶ Loads the specified batch_data into the execution engine
-
_load_batch_data_from_dict
(self, batch_data_dict)¶ Loads all data in batch_data_dict into load_batch_data
-
_get_typed_batch_data
(self, batch_data)¶
-
resolve_metrics
(self, metrics_to_resolve: Iterable[MetricConfiguration], metrics: Dict[Tuple, Any] = None, runtime_configuration: dict = None)¶ resolve_metrics is the main entrypoint for an execution engine. The execution engine will compute the value of the provided metrics.
- Parameters
metrics_to_resolve – the metrics to evaluate
metrics – already-computed metrics currently available to the engine
runtime_configuration – runtime configuration information
- Returns
a dictionary with the values for the metrics that have just been resolved.
- Return type
resolved_metrics (Dict)
-
abstract
resolve_metric_bundle
(self, metric_fn_bundle)¶ Resolve a bundle of metrics with the same compute domain as part of a single trip to the compute engine.
-
abstract
get_compute_domain
(self, domain_kwargs: dict, domain_type: Union[str, 'MetricDomainTypes'])¶ get_compute_domain computes the optimal domain_kwargs for computing metrics based on the given domain_kwargs and specific engine semantics.
- Returns
data correspondig to the compute domain;
a modified copy of domain_kwargs describing the domain of the data returned in (1);
- a dictionary describing the access instructions for data elements included in the compute domain
(e.g. specific column name).
In general, the union of the compute_domain_kwargs and accessor_domain_kwargs will be the same as the domain_kwargs provided to this method.
- Return type
A tuple consisting of three elements
-
add_column_row_condition
(self, domain_kwargs, column_name=None, filter_null=True, filter_nan=False)¶ EXPERIMENTAL
Add a row condition for handling null filter.
- Parameters
domain_kwargs – the domain kwargs to use as the base and to which to add the condition
column_name – if provided, use this name to add the condition; otherwise, will use “column” key from table_domain_kwargs
filter_null – if true, add a filter for null values
filter_nan – if true, add a filter for nan values
-
-
class
great_expectations.execution_engine.
PandasExecutionEngine
(*args, **kwargs)¶ Bases:
great_expectations.execution_engine.execution_engine.ExecutionEngine
PandasExecutionEngine instantiates the great_expectations Expectations API as a subclass of a pandas.DataFrame.
For the full API reference, please see
Dataset
Notes
Samples and Subsets of PandaDataSet have ALL the expectations of the original data frame unless the user specifies the
discard_subset_failing_expectations = True
property on the original data frame.Concatenations, joins, and merges of PandaDataSets contain NO expectations (since no autoinspection is performed by default).
Validation Engine - Pandas - How-to Guide
Use Pandas DataFrame to validate dataMaturity: ProductionDetails:API Stability: StableImplementation Completeness: CompleteUnit Test Coverage: CompleteIntegration Infrastructure/Test Coverage: N/A -> see relevant Datasource evaluationDocumentation Completeness: CompleteBug Risk: LowExpectation Completeness: Complete-
recognized_batch_spec_defaults
¶
-
configure_validator
(self, validator)¶ Optionally configure the validator as appropriate for the execution engine.
-
get_batch_data_and_markers
(self, batch_spec: BatchSpec)¶
-
_apply_splitting_and_sampling_methods
(self, batch_spec, batch_data)¶
-
_get_typed_batch_data
(self, batch_data)¶
-
property
dataframe
(self)¶ Tests whether or not a Batch has been loaded. If the loaded batch does not exist, raises a ValueError Exception
-
_get_reader_fn
(self, reader_method=None, path=None)¶ Static helper for parsing reader types. If reader_method is not provided, path will be used to guess the correct reader_method.
- Parameters
reader_method (str) – the name of the reader method to use, if available.
path (str) – the path used to guess
- Returns
ReaderMethod to use for the filepath
-
static
guess_reader_method_from_path
(path)¶ Helper method for deciding which reader to use to read in a certain path.
- Parameters
path (str) – the to use to guess
- Returns
ReaderMethod to use for the filepath
-
get_compute_domain
(self, domain_kwargs: dict, domain_type: Union[str, 'MetricDomainTypes'], accessor_keys: Optional[Iterable[str]] = [])¶ Uses a given batch dictionary and domain kwargs (which include a row condition and a condition parser) to obtain and/or query a batch. Returns in the format of a Pandas DataFrame. If the domain is a single column, this is added to ‘accessor domain kwargs’ and used for later access
- Parameters
domain_kwargs (dict) –
domain_type (str or "MetricDomainTypes") –
to be using, or a corresponding string value representing it. String types include "identity", "column", (like) –
"table" and "other". Enum types include capitalized versions of these from the class ("column_pair",) –
MetricDomainTypes. –
accessor_keys (str iterable) – the domain and simply transferred with their associated values into accessor_domain_kwargs.
- Returns
a DataFrame (the data on which to compute)
a dictionary of compute_domain_kwargs, describing the DataFrame
a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the domain within the compute domain
- Return type
A tuple including
-
static
_split_on_whole_table
(df)¶
-
static
_split_on_column_value
(df, column_name: str, partition_definition: dict)¶
-
static
_split_on_converted_datetime
(df, column_name: str, partition_definition: dict, date_format_string: str = '%Y-%m-%d')¶ Convert the values in the named column to the given date_format, and split on that
-
static
_split_on_divided_integer
(df, column_name: str, divisor: int, partition_definition: dict)¶ Divide the values in the named column by divisor, and split on that
-
static
_split_on_mod_integer
(df, column_name: str, mod: int, partition_definition: dict)¶ Divide the values in the named column by divisor, and split on that
-
static
_split_on_multi_column_values
(df, column_names: List[str], partition_definition: dict)¶ Split on the joint values in the named columns
-
static
_split_on_hashed_column
(df, column_name: str, hash_digits: int, partition_definition: dict, hash_function_name: str = 'md5')¶ Split on the hashed value of the named column
-
static
_sample_using_random
(df, p: float = 0.1)¶ Take a random sample of rows, retaining proportion p
Note: the Random function behaves differently on different dialects of SQL
-
static
_sample_using_mod
(df, column_name: str, mod: int, value: int)¶ Take the mod of named column, and only keep rows that match the given value
-
static
_sample_using_a_list
(df, column_name: str, value_list: list)¶ Match the values in the named column against value_list, and only keep the matches
-
static
_sample_using_hash
(df, column_name: str, hash_digits: int = 1, hash_value: str = 'f', hash_function_name: str = 'md5')¶ Hash the values in the named column, and split on that
-
class
great_expectations.execution_engine.
SparkDFExecutionEngine
(*args, **kwargs)¶ Bases:
great_expectations.execution_engine.execution_engine.ExecutionEngine
This class holds an attribute spark_df which is a spark.sql.DataFrame.
Validation Engine - pyspark - Self-Managed - How-to Guide
Use Spark DataFrame to validate dataMaturity: ProductionDetails:API Stability: StableImplementation Completeness: ModerateUnit Test Coverage: CompleteIntegration Infrastructure/Test Coverage: N/A -> see relevant Datasource evaluationDocumentation Completeness: CompleteBug Risk: Low/ModerateExpectation Completeness: ModerateValidation Engine - Databricks - How-to Guide
Use Spark DataFrame in a Databricks cluster to validate dataMaturity: BetaDetails:API Stability: StableImplementation Completeness: Low (dbfs-specific handling)Unit Test Coverage: N/A -> implementation not differentIntegration Infrastructure/Test Coverage: Minimal (we’ve tested a bit, know others have used it)Documentation Completeness: Moderate (need docs on managing project configuration via dbfs/etc.)Bug Risk: Low/ModerateExpectation Completeness: ModerateValidation Engine - EMR - Spark - How-to Guide
Use Spark DataFrame in an EMR cluster to validate dataMaturity: ExperimentalDetails:API Stability: StableImplementation Completeness: Low (need to provide guidance on “known good” paths, and we know there are many “knobs” to tune that we have not explored/tested)Unit Test Coverage: N/A -> implementation not differentIntegration Infrastructure/Test Coverage: UnknownDocumentation Completeness: Low (must install specific/latest version but do not have docs to that effect or of known useful paths)Bug Risk: Low/ModerateExpectation Completeness: ModerateValidation Engine - Spark - Other - How-to Guide
Use Spark DataFrame to validate dataMaturity: ExperimentalDetails:API Stability: StableImplementation Completeness: Other (we haven’t tested possibility, known glue deployment)Unit Test Coverage: N/A -> implementation not differentIntegration Infrastructure/Test Coverage: UnknownDocumentation Completeness: Low (must install specific/latest version but do not have docs to that effect or of known useful paths)Bug Risk: Low/ModerateExpectation Completeness: Moderate-
recognized_batch_definition_keys
¶
-
recognized_batch_spec_defaults
¶
-
property
dataframe
(self)¶ If a batch has been loaded, returns a Spark Dataframe containing the data within the loaded batch
-
get_batch_data_and_markers
(self, batch_spec: BatchSpec)¶
-
_apply_splitting_and_sampling_methods
(self, batch_spec, batch_data)¶
-
static
guess_reader_method_from_path
(path)¶ Based on a given filepath, decides a reader method. Currently supports tsv, csv, and parquet. If none of these file extensions are used, returns BatchKwargsError stating that it is unable to determine the current path.
- Parameters
- A given file path (path) –
- Returns
reader_method}
- Return type
A dictionary entry of format {‘reader_method’
-
_get_reader_fn
(self, reader, reader_method=None, path=None)¶ Static helper for providing reader_fn
- Parameters
reader – the base spark reader to use; this should have had reader_options applied already
reader_method – the name of the reader_method to use, if specified
path (str) – the path to use to guess reader_method if it was not specified
- Returns
ReaderMethod to use for the filepath
-
get_compute_domain
(self, domain_kwargs: dict, domain_type: Union[str, 'MetricDomainTypes'], accessor_keys: Optional[Iterable[str]] = [])¶ Uses a given batch dictionary and domain kwargs (which include a row condition and a condition parser) to obtain and/or query a batch. Returns in the format of a Pandas Series if only a single column is desired, or otherwise a Data Frame.
- Parameters
domain_kwargs (dict) –
domain_type (str or "MetricDomainTypes") –
to be using, or a corresponding string value representing it. String types include "identity", "column", (like) –
"table" and "other". Enum types include capitalized versions of these from the class ("column_pair",) –
MetricDomainTypes. –
accessor_keys (str iterable) –
domain and simply transferred with their associated values into accessor_domain_kwargs. (the) –
- Returns
a DataFrame (the data on which to compute)
a dictionary of compute_domain_kwargs, describing the DataFrame
a dictionary of accessor_domain_kwargs, describing any accessors needed to identify the domain within the compute domain
- Return type
A tuple including
-
add_column_row_condition
(self, domain_kwargs, column_name=None, filter_null=True, filter_nan=False)¶ EXPERIMENTAL
Add a row condition for handling null filter.
- Parameters
domain_kwargs – the domain kwargs to use as the base and to which to add the condition
column_name – if provided, use this name to add the condition; otherwise, will use “column” key from table_domain_kwargs
filter_null – if true, add a filter for null values
filter_nan – if true, add a filter for nan values
-
resolve_metric_bundle
(self, metric_fn_bundle: Iterable[Tuple[MetricConfiguration, Callable, dict]])¶ For each metric name in the given metric_fn_bundle, finds the domain of the metric and calculates it using a metric function from the given provider class.
- Args:
metric_fn_bundle - A batch containing MetricEdgeKeys and their corresponding functions metrics (dict) - A dictionary containing metrics and corresponding parameters
- Returns:
A dictionary of the collected metrics over their respective domains
-
head
(self, n=5)¶ Returns dataframe head. Default is 5
-
static
_split_on_whole_table
(df)¶
-
static
_split_on_column_value
(df, column_name: str, partition_definition: dict)¶
-
static
_split_on_converted_datetime
(df, column_name: str, partition_definition: dict, date_format_string: str = 'yyyy-MM-dd')¶
-
static
_split_on_divided_integer
(df, column_name: str, divisor: int, partition_definition: dict)¶ Divide the values in the named column by divisor, and split on that
-
static
_split_on_mod_integer
(df, column_name: str, mod: int, partition_definition: dict)¶ Divide the values in the named column by divisor, and split on that
-
static
_split_on_multi_column_values
(df, column_names: list, partition_definition: dict)¶ Split on the joint values in the named columns
-
static
_split_on_hashed_column
(df, column_name: str, hash_digits: int, partition_definition: dict, hash_function_name: str = 'sha256')¶ Split on the hashed value of the named column
-
static
_sample_using_random
(df, p: float = 0.1, seed: int = 1)¶ Take a random sample of rows, retaining proportion p
-
static
_sample_using_mod
(df, column_name: str, mod: int, value: int)¶ Take the mod of named column, and only keep rows that match the given value
-
static
_sample_using_a_list
(df, column_name: str, value_list: list)¶ Match the values in the named column against value_list, and only keep the matches
-
static
_sample_using_hash
(df, column_name: str, hash_digits: int = 1, hash_value: str = 'f', hash_function_name: str = 'md5')¶
-
-
class
great_expectations.execution_engine.
SqlAlchemyExecutionEngine
(name=None, credentials=None, data_context=None, engine=None, connection_string=None, url=None, batch_data_dict=None, **kwargs)¶ Bases:
great_expectations.execution_engine.ExecutionEngine
-
property
credentials
(self)¶
-
property
connection_string
(self)¶
-
property
url
(self)¶
-
_build_engine
(self, credentials, **kwargs)¶ Using a set of given credentials, constructs an Execution Engine , connecting to a database using a URL or a private key path.
-
_get_sqlalchemy_key_pair_auth_url
(self, drivername: str, credentials: dict)¶ Utilizing a private key path and a passphrase in a given credentials dictionary, attempts to encode the provided values into a private key. If passphrase is incorrect, this will fail and an exception is raised.
- Parameters
drivername (str) –
credentials (dict) –
- Returns
a tuple consisting of a url with the serialized key-pair authentication, and a dictionary of engine kwargs.
-
get_compute_domain
(self, domain_kwargs: Dict, domain_type: Union[str, 'MetricDomainTypes'], accessor_keys: Optional[Iterable[str]] = None)¶ Uses a given batch dictionary and domain kwargs to obtain a SqlAlchemy column object.
- Parameters
domain_kwargs (dict) –
domain_type (str or "MetricDomainTypes") –
to be using, or a corresponding string value representing it. String types include "identity", "column", (like) –
"table" and "other". Enum types include capitalized versions of these from the class ("column_pair",) –
MetricDomainTypes. –
accessor_keys (str iterable) –
domain and simply transferred with their associated values into accessor_domain_kwargs. (the) –
- Returns
SqlAlchemy column
-
resolve_metric_bundle
(self, metric_fn_bundle: Iterable[Tuple[MetricConfiguration, Any, dict, dict]])¶ For every metrics in a set of Metrics to resolve, obtains necessary metric keyword arguments and builds a bundles the metrics into one large query dictionary so that they are all executed simultaneously. Will fail if bundling the metrics together is not possible.
- Args:
- metric_fn_bundle (Iterable[Tuple[MetricConfiguration, Callable, dict]): A Dictionary containing a MetricProvider’s MetricConfiguration (its unique identifier), its metric provider function
(the function that actually executes the metric), and the arguments to pass to the metric provider function.
metrics (Dict[Tuple, Any]): A dictionary of metrics defined in the registry and corresponding arguments
- Returns:
A dictionary of metric names and their corresponding now-queried values.
-
_split_on_whole_table
(self, table_name: str, partition_definition: dict)¶ ‘Split’ by returning the whole table
-
_split_on_column_value
(self, table_name: str, column_name: str, partition_definition: dict)¶ Split using the values in the named column
-
_split_on_converted_datetime
(self, table_name: str, column_name: str, partition_definition: dict, date_format_string: str = '%Y-%m-%d')¶ Convert the values in the named column to the given date_format, and split on that
-
_split_on_divided_integer
(self, table_name: str, column_name: str, divisor: int, partition_definition: dict)¶ Divide the values in the named column by divisor, and split on that
-
_split_on_mod_integer
(self, table_name: str, column_name: str, mod: int, partition_definition: dict)¶ Divide the values in the named column by divisor, and split on that
-
_split_on_multi_column_values
(self, table_name: str, column_names: List[str], partition_definition: dict)¶ Split on the joint values in the named columns
-
_split_on_hashed_column
(self, table_name: str, column_name: str, hash_digits: int, partition_definition: dict)¶ Split on the hashed value of the named column
-
_sample_using_random
(self, p: float = 0.1)¶ Take a random sample of rows, retaining proportion p
Note: the Random function behaves differently on different dialects of SQL
-
_sample_using_mod
(self, column_name, mod: int, value: int)¶ Take the mod of named column, and only keep rows that match the given value
-
_sample_using_a_list
(self, column_name: str, value_list: list)¶ Match the values in the named column against value_list, and only keep the matches
-
_sample_using_md5
(self, column_name: str, hash_digits: int = 1, hash_value: str = 'f')¶ Hash the values in the named column, and split on that
-
_build_selectable_from_batch_spec
(self, batch_spec)¶
-
get_batch_data_and_markers
(self, batch_spec)¶
-
property