DataContext Module

DataContext

class great_expectations.data_context.BaseDataContext(project_config, context_root_dir=None)

Bases: object

This class implements most of the functionality of DataContext, with a few exceptions.

  1. BaseDataContext does not attempt to keep its project_config in sync with a file on disc.

  2. BaseDataContext doesn’t attempt to “guess” paths or objects types. Instead, that logic is pushed

    into DataContext class.

Together, these changes make BaseDataContext class more testable.

PROFILING_ERROR_CODE_TOO_MANY_DATA_ASSETS = 2
PROFILING_ERROR_CODE_SPECIFIED_DATA_ASSETS_NOT_FOUND = 3
PROFILING_ERROR_CODE_NO_GENERATOR_FOUND = 4
PROFILING_ERROR_CODE_MULTIPLE_GENERATORS_FOUND = 5
UNCOMMITTED_DIRECTORIES = ['data_docs', 'validations']
GE_UNCOMMITTED_DIR = 'uncommitted'
BASE_DIRECTORIES = ['expectations', 'notebooks', 'plugins', 'uncommitted']
NOTEBOOK_SUBDIRECTORIES = ['pandas', 'spark', 'sql']
GE_DIR = 'great_expectations'
GE_YML = 'great_expectations.yml'
GE_EDIT_NOTEBOOK_DIR = 'uncommitted'
classmethod validate_config(project_config)
add_store(store_name, store_config)

Add a new Store to the DataContext and (for convenience) return the instantiated Store object.

Parameters
  • store_name (str) – a key for the new Store in in self._stores

  • store_config (dict) – a config for the Store to add

Returns

store (Store)

add_validation_operator(validation_operator_name, validation_operator_config)

Add a new ValidationOperator to the DataContext and (for convenience) return the instantiated object.

Parameters
  • validation_operator_name (str) – a key for the new ValidationOperator in in self._validation_operators

  • validation_operator_config (dict) – a config for the ValidationOperator to add

Returns

validation_operator (ValidationOperator)

get_docs_sites_urls(resource_identifier=None)

Get URLs for a resource for all data docs sites.

This function will return URLs for any configured site even if the sites have not been built yet.

Parameters

resource_identifier – optional. It can be an identifier of ExpectationSuite’s, ValidationResults and other resources that have typed identifiers. If not provided, the method will return the URLs of the index page.

Returns

a list of URLs. Each item is the URL for the resource for a data docs site

open_data_docs(resource_identifier=None)

A stdlib cross-platform way to open a file in a browser.

Parameters

resource_identifier – ExpectationSuiteIdentifier, ValidationResultIdentifier or any other type’s identifier. The argument is optional - when not supplied, the method returns the URL of the index page.

property root_directory

The root directory for configuration objects in the data context; the location in which great_expectations.yml is located.

property plugins_directory

The directory in which custom plugin modules should be placed.

property stores

A single holder for all Stores in this context

property datasources

A single holder for all Datasources in this context

property expectations_store_name
get_config_with_variables_substituted(config=None)
save_config_variable(config_variable_name, value)

Save config variable value

Parameters
  • config_variable_name – name of the property

  • value – the value to save for the property

Returns

None

get_available_data_asset_names(datasource_names=None, generator_names=None)

Inspect datasource and generators to provide available data_asset objects.

Parameters
  • datasource_names – list of datasources for which to provide available data_asset_name objects. If None, return available data assets for all datasources.

  • generator_names – list of generators for which to provide available data_asset_name objects.

Returns

Dictionary describing available data assets

{
  datasource_name: {
    generator_name: [ data_asset_1, data_asset_2, ... ]
    ...
  }
  ...
}

Return type

data_asset_names (dict)

build_batch_kwargs(datasource, generator, name=None, partition_id=None, **kwargs)

Builds batch kwargs using the provided datasource, generator, and batch_parameters.

Parameters
  • datasource (str) – the name of the datasource for which to build batch_kwargs

  • generator (str) – the name of the generator to use to build batch_kwargs

  • name (str) – an optional name batch_parameter

  • **kwargs – additional batch_parameters

Returns

BatchKwargs

get_batch(batch_kwargs, expectation_suite_name, data_asset_type=None, batch_parameters=None)

Build a batch of data using batch_kwargs, and return a DataAsset with expectation_suite_name attached. If batch_parameters are included, they will be available as attributes of the batch.

Parameters
  • batch_kwargs – the batch_kwargs to use; must include a datasource key

  • expectation_suite_name – The ExpectationSuite or the name of the expectation_suite to get

  • data_asset_type – the type of data_asset to build, with associated expectation implementations. This can generally be inferred from the datasource.

  • batch_parameters – optional parameters to store as the reference description of the batch. They should reflect parameters that would provide the passed BatchKwargs.

Returns

DataAsset

run_validation_operator(validation_operator_name, assets_to_validate, run_id=None, evaluation_parameters=None, **kwargs)

Run a validation operator to validate data assets and to perform the business logic around validation that the operator implements.

Parameters
  • validation_operator_name – name of the operator, as appears in the context’s config file

  • assets_to_validate – a list that specifies the data assets that the operator will validate. The members of the list can be either batches, or a tuple that will allow the operator to fetch the batch: (batch_kwargs, expectation_suite_name)

  • run_id – The run_id for the validation; if None, a default value will be used

  • **kwargs – Additional kwargs to pass to the validation operator

Returns

ValidationOperatorResult

list_validation_operator_names()
add_datasource(name, initialize=True, **kwargs)

Add a new datasource to the data context, with configuration provided as kwargs. :param name: the name for the new datasource to add :param initialize: if False, add the datasource to the config, but do not

initialize it, for example if a user needs to debug database connectivity.

Parameters

kwargs (keyword arguments) – the configuration for the new datasource

Returns

datasource (Datasource)

add_generator(datasource_name, generator_name, class_name, **kwargs)

Add a generator to the named datasource, using the provided configuration.

Parameters
  • datasource_name – name of datasource to which to add the new generator

  • generator_name – name of the generator to add

  • class_name – class of the generator to add

  • **kwargs – generator configuration, provided as kwargs

Returns:

get_config()
get_datasource(datasource_name='default')

Get the named datasource

Parameters

datasource_name (str) – the name of the datasource from the configuration

Returns

datasource (Datasource)

list_expectation_suites()

Return a list of available expectation suite names.

list_datasources()

List currently-configured datasources on this context.

Returns

each dictionary includes “name” and “class_name” keys

Return type

List(dict)

create_expectation_suite(expectation_suite_name, overwrite_existing=False)

Build a new expectation suite and save it into the data_context expectation store.

Parameters
  • expectation_suite_name – The name of the expectation_suite to create

  • overwrite_existing (boolean) – Whether to overwrite expectation suite if expectation suite with given name already exists.

Returns

A new (empty) expectation suite.

get_expectation_suite(expectation_suite_name)

Get a named expectation suite for the provided data_asset_name.

Parameters

expectation_suite_name (str) – the name for the expectation suite

Returns

expectation_suite

save_expectation_suite(expectation_suite, expectation_suite_name=None)

Save the provided expectation suite into the DataContext.

Parameters
  • expectation_suite – the suite to save

  • expectation_suite_name – the name of this expectation suite. If no name is provided the name will be read from the suite

Returns

None

store_validation_result_metrics(requested_metrics, validation_results, target_store_name)
store_evaluation_parameters(validation_results, target_store_name=None)
property evaluation_parameter_store
property evaluation_parameter_store_name
property validations_store_name
property validations_store
get_validation_result(expectation_suite_name, run_id=None, batch_identifier=None, validations_store_name=None, failed_only=False)

Get validation results from a configured store.

Parameters
  • data_asset_name – name of data asset for which to get validation result

  • expectation_suite_name – expectation_suite name for which to get validation result (default: “default”)

  • run_id – run_id for which to get validation result (if None, fetch the latest result by alphanumeric sort)

  • validations_store_name – the name of the store from which to get validation results

  • failed_only – if True, filter the result to return only failed expectations

Returns

validation_result

update_return_obj(data_asset, return_obj)

Helper called by data_asset.

Parameters
  • data_asset – The data_asset whose validation produced the current return object

  • return_obj – the return object to update

Returns

the return object, potentially changed into a widget by the configured expectation explorer

Return type

return_obj

build_data_docs(site_names=None, resource_identifiers=None)

Build Data Docs for your project.

These make it simple to visualize data quality in your project. These include Expectations, Validations & Profiles. The are built for all Datasources from JSON artifacts in the local repo including validations & profiles from the uncommitted directory.

Parameters
  • site_names – if specified, build data docs only for these sites, otherwise, build all the sites specified in the context’s config

  • resource_identifiers – a list of resource identifiers (ExpectationSuiteIdentifier, ValidationResultIdentifier). If specified, rebuild HTML (or other views the data docs sites are rendering) only for the resources in this list. This supports incremental build of data docs sites (e.g., when a new validation result is created) and avoids full rebuild.

Returns

A dictionary with the names of the updated data documentation sites as keys and the the location info of their index.html files as values

profile_datasource(datasource_name, generator_name=None, data_assets=None, max_data_assets=20, profile_all_data_assets=True, profiler=<class 'great_expectations.profile.basic_dataset_profiler.BasicDatasetProfiler'>, dry_run=False, run_id='profiling', additional_batch_kwargs=None)

Profile the named datasource using the named profiler.

Parameters
  • datasource_name – the name of the datasource for which to profile data_assets

  • generator_name – the name of the generator to use to get batches

  • data_assets – list of data asset names to profile

  • max_data_assets – if the number of data assets the generator yields is greater than this max_data_assets, profile_all_data_assets=True is required to profile all

  • profile_all_data_assets – when True, all data assets are profiled, regardless of their number

  • profiler – the profiler class to use

  • dry_run – when true, the method checks arguments and reports if can profile or specifies the arguments that are missing

  • additional_batch_kwargs – Additional keyword arguments to be provided to get_batch when loading the data asset.

Returns

A dictionary:

{
    "success": True/False,
    "results": List of (expectation_suite, EVR) tuples for each of the data_assets found in the datasource
}

When success = False, the error details are under “error” key

profile_data_asset(datasource_name, generator_name=None, data_asset_name=None, batch_kwargs=None, expectation_suite_name=None, profiler=<class 'great_expectations.profile.basic_dataset_profiler.BasicDatasetProfiler'>, run_id='profiling', additional_batch_kwargs=None)

Profile a data asset

Parameters
  • datasource_name – the name of the datasource to which the profiled data asset belongs

  • generator_name – the name of the generator to use to get batches (only if batch_kwargs are not provided)

  • data_asset_name – the name of the profiled data asset

  • batch_kwargs – optional - if set, the method will use the value to fetch the batch to be profiled. If not passed, the generator (generator_name arg) will choose a batch

  • profiler – the profiler class to use

  • run_id – optional - if set, the validation result created by the profiler will be under the provided run_id

  • additional_batch_kwargs

:returns

A dictionary:

{
    "success": True/False,
    "results": List of (expectation_suite, EVR) tuples for each of the data_assets found in the datasource
}

When success = False, the error details are under “error” key

class great_expectations.data_context.DataContext(context_root_dir=None)

Bases: great_expectations.data_context.data_context.BaseDataContext

A DataContext represents a Great Expectations project. It organizes storage and access for expectation suites, datasources, notification settings, and data fixtures.

The DataContext is configured via a yml file stored in a directory called great_expectations; the configuration file as well as managed expectation suites should be stored in version control.

Use the create classmethod to create a new empty config, or instantiate the DataContext by passing the path to an existing data context root directory.

DataContexts use data sources you’re already familiar with. Generators help introspect data stores and data execution frameworks (such as airflow, Nifi, dbt, or dagster) to describe and produce batches of data ready for analysis. This enables fetching, validation, profiling, and documentation of your data in a way that is meaningful within your existing infrastructure and work environment.

DataContexts use a datasource-based namespace, where each accessible type of data has a three-part normalized data_asset_name, consisting of datasource/generator/generator_asset.

  • The datasource actually connects to a source of materialized data and returns Great Expectations DataAssets connected to a compute environment and ready for validation.

  • The Generator knows how to introspect datasources and produce identifying “batch_kwargs” that define particular slices of data.

  • The generator_asset is a specific name – often a table name or other name familiar to users – that generators can slice into batches.

An expectation suite is a collection of expectations ready to be applied to a batch of data. Since in many projects it is useful to have different expectations evaluate in different contexts–profiling vs. testing; warning vs. error; high vs. low compute; ML model or dashboard–suites provide a namespace option for selecting which expectations a DataContext returns.

In many simple projects, the datasource or generator name may be omitted and the DataContext will infer the correct name when there is no ambiguity.

Similarly, if no expectation suite name is provided, the DataContext will assume the name “default”.

classmethod create(project_root_dir=None)

Build a new great_expectations directory and DataContext object in the provided project_root_dir.

create will not create a new “great_expectations” directory in the provided folder, provided one does not already exist. Then, it will initialize a new DataContext in that folder and write the resulting config.

Parameters

project_root_dir – path to the root directory in which to create a new great_expectations directory

Returns

DataContext

classmethod all_uncommitted_directories_exist(ge_dir)

Check if all uncommitted direcotries exist.

classmethod config_variables_yml_exist(ge_dir)

Check if all config_variables.yml exists.

classmethod write_config_variables_template_to_disk(uncommitted_dir)
classmethod write_project_template_to_disk(ge_dir)
classmethod scaffold_directories(base_dir)

Safely create GE directories for a new project.

classmethod scaffold_custom_data_docs(plugins_dir)

Copy custom data docs templates

classmethod scaffold_notebooks(base_dir)

Copy template notebooks into the notebooks directory for a project.

list_expectation_suite_names()

Lists the available expectation suite names

add_store(store_name, store_config)

Add a new Store to the DataContext and (for convenience) return the instantiated Store object.

Parameters
  • store_name (str) – a key for the new Store in in self._stores

  • store_config (dict) – a config for the Store to add

Returns

store (Store)

add_datasource(name, **kwargs)

Add a new datasource to the data context, with configuration provided as kwargs. :param name: the name for the new datasource to add :param initialize: if False, add the datasource to the config, but do not

initialize it, for example if a user needs to debug database connectivity.

Parameters

kwargs (keyword arguments) – the configuration for the new datasource

Returns

datasource (Datasource)

classmethod find_context_root_dir()
classmethod find_context_yml_file(search_start_dir=None)

Search for the yml file starting here and moving upward.

classmethod does_config_exist_on_disk(context_root_dir)

Return True if the great_expectations.yml exists on disk.

classmethod is_project_initialized(ge_dir)

Return True if the project is initialized.

To be considered initialized, all of the following must be true: - all project directories exist (including uncommitted directories) - a valid great_expectations.yml is on disk - a config_variables.yml is on disk - the project has at least one datasource - the project has at least one suite

classmethod does_project_have_a_datasource_in_config_file(ge_dir)
great_expectations.data_context.util.safe_mmkdir(directory, exist_ok=True)

Simple wrapper since exist_ok is not available in python 2

great_expectations.data_context.util.load_class(class_name, module_name)

Dynamically load a class from strings or raise a helpful error.

great_expectations.data_context.util.instantiate_class_from_config(config, runtime_environment, config_defaults=None)

Build a GE class from configuration dictionaries.

great_expectations.data_context.util.format_dict_for_error_message(dict_)
great_expectations.data_context.util.substitute_config_variable(template_str, config_variables_dict)

This method takes a string, and if it contains a pattern ${SOME_VARIABLE} or $SOME_VARIABLE, returns a string where the pattern is replaced with the value of SOME_VARIABLE, otherwise returns the string unchanged.

If the environment variable SOME_VARIABLE is set, the method uses its value for substitution. If it is not set, the value of SOME_VARIABLE is looked up in the config variables store (file). If it is not found there, the input string is returned as is.

Parameters
  • template_str – a string that might or might not be of the form ${SOME_VARIABLE} or $SOME_VARIABLE

  • config_variables_dict – a dictionary of config variables. It is loaded from the config variables store (by default, “uncommitted/config_variables.yml file)

Returns

great_expectations.data_context.util.substitute_all_config_variables(data, replace_variables_dict)

Substitute all config variables of the form ${SOME_VARIABLE} in a dictionary-like config object for their values.

The method traverses the dictionary recursively.

Parameters
  • data

  • replace_variables_dict

Returns

a dictionary with all the variables replaced with their values

great_expectations.data_context.util.file_relative_path(dunderfile, relative_path)

This function is useful when one needs to load a file that is relative to the position of the current file. (Such as when you encode a configuration file path in source file and want in runnable in any current working directory)

It is meant to be used like the following: file_relative_path(__file__, ‘path/relative/to/file’)

H/T https://github.com/dagster-io/dagster/blob/8a250e9619a49e8bff8e9aa7435df89c2d2ea039/python_modules/dagster/dagster/utils/__init__.py#L34