great_expectations.datasource.data_connector.data_connector

Module Contents

Classes

DataConnector(name: str, datasource_name: str, execution_engine: Optional[ExecutionEngine] = None, batch_spec_passthrough: Optional[dict] = None)

DataConnectors produce identifying information, called “batch_spec” that ExecutionEngines

great_expectations.datasource.data_connector.data_connector.logger
class great_expectations.datasource.data_connector.data_connector.DataConnector(name: str, datasource_name: str, execution_engine: Optional[ExecutionEngine] = None, batch_spec_passthrough: Optional[dict] = None)

DataConnectors produce identifying information, called “batch_spec” that ExecutionEngines can use to get individual batches of data. They add flexibility in how to obtain data such as with time-based partitioning, downsampling, or other techniques appropriate for the Datasource.

For example, a DataConnector could produce a SQL query that logically represents “rows in the Events table with a timestamp on February 7, 2012,” which a SqlAlchemyDatasource could use to materialize a SqlAlchemyDataset corresponding to that batch of data and ready for validation.

A batch is a sample from a data asset, sliced according to a particular rule. For example, an hourly slide of the Events table or “most recent users records.”

A Batch is the primary unit of validation in the Great Expectations DataContext. Batches include metadata that identifies how they were constructed–the same “batch_spec” assembled by the data connector, While not every Datasource will enable re-fetching a specific batch of data, GE can store snapshots of batches or store metadata from an external data version control system.

property batch_spec_passthrough(self)
property name(self)
property datasource_name(self)
property data_context_root_directory(self)
get_batch_data_and_metadata(self, batch_definition: BatchDefinition)

Uses batch_definition to retrieve batch_data and batch_markers by building a batch_spec from batch_definition, then using execution_engine to return batch_data and batch_markers

Parameters

batch_definition (BatchDefinition) – required batch_definition parameter for retrieval

build_batch_spec(self, batch_definition: BatchDefinition)

Builds batch_spec from batch_definition by generating batch_spec params and adding any pass_through params

Parameters

batch_definition (BatchDefinition) – required batch_definition parameter for retrieval

Returns

BatchSpec object built from BatchDefinition

abstract _refresh_data_references_cache(self)
abstract _get_data_reference_list(self, data_asset_name: Optional[str] = None)

List objects in the underlying data store to create a list of data_references. This method is used to refresh the cache by classes that extend this base DataConnector class

Parameters

data_asset_name (str) – optional data_asset_name to retrieve more specific results

abstract _get_data_reference_list_from_cache_by_data_asset_name(self, data_asset_name: str)

Fetch data_references corresponding to data_asset_name from the cache.

abstract get_data_reference_list_count(self)
abstract get_unmatched_data_references(self)
abstract get_available_data_asset_names(self)

Return the list of asset names known by this data connector.

Returns

A list of available names

abstract get_batch_definition_list_from_batch_request(self, batch_request: BatchRequest)
abstract _map_data_reference_to_batch_definition_list(self, data_reference: Any, data_asset_name: Optional[str] = None)
abstract _map_batch_definition_to_data_reference(self, batch_definition: BatchDefinition)
abstract _generate_batch_spec_parameters_from_batch_definition(self, batch_definition: BatchDefinition)
self_check(self, pretty_print=True, max_examples=3)

Checks the configuration of the current DataConnector by doing the following :

  1. refresh or create data_reference_cache

  2. print batch_definition_count and example_data_references for each data_asset_names

  3. also print unmatched data_references, and allow the user to modify the regex or glob configuration if necessary

  4. select a random data_reference and attempt to retrieve and print the first few rows to user

When used as part of the test_yaml_config() workflow, the user will be able to know if the data_connector is properly configured, and if the associated execution_engine can properly retrieve data using the configuration.

Parameters
  • pretty_print (bool) – should the output be printed?

  • max_examples (int) – how many data_references should be printed?

_self_check_fetch_batch(self, pretty_print: bool, example_data_reference: Any, data_asset_name: str)

Helper function for self_check() to retrieve batch using example_data_reference and data_asset_name, all while printing helpful messages. First 5 rows of batch_data are printed by default.

Parameters
  • pretty_print (bool) – print to console?

  • example_data_reference (Any) – data_reference to retrieve

  • data_asset_name (str) – data_asset_name to retrieve

_validate_batch_request(self, batch_request: BatchRequest)
Validate batch_request by checking:
  1. if configured datasource_name matches batch_request’s datasource_name

  2. if current data_connector_name matches batch_request’s data_connector_name

Parameters

batch_request (BatchRequest) – batch_request to validate