Skip to main content

How to configure a DataConnector to introspect and partition tables in SQL

This guide will help you introspect and partition tables in an SQL database using SimpleSqlalchemyDatasource, which operates as a proxy to InferredAssetSqlDataConnector and ConfiguredAssetSqlDataConnector. For background, please see the Datasource specific guides in the "Connecting to your data" section.

The SQL database introspection and partitioning are useful for:

  • Exploring the schema and column metadata of the tables in your SQL database, and
  • Organizing the tables into data assets according to the partitioning considerations informed by this exploration.

Partitioning enables you to select the desired subsets of your dataset for Validation.

Prerequisites: This how-to guide assumes you have:

We will use the "Yellow Taxi" dataset to walk you through the configuration of SimpleSqlalchemyDatasource, where the introspection section characterizes InferredAssetSqlDataConnector objects and the tables section characterizes ConfiguredAssetSqlDataConnector objects. Starting with the bare-bones version of either the introspection section or the tables section of the SimpleSqlalchemyDatasource configuration, we gradually build out the configuration to achieve the introspection of your SQL database with the semantics consistent with your goals.

info

Only introspection and tables are the legal top-level keys in the SimpleSqlalchemyDatasource configuration.

To learn more about Datasources, Data Connectors, and Batch(es), please see our Datasources Core Concepts Guide in the Core Concepts reference guide.

Preliminary Steps#

1. Instantiate your project's DataContext#

Import Great Expectations.

import great_expectations as ge

2. Obtain DataContext#

Load your DataContext into memory using the get_context() method.

context = ge.get_context()

Configuring Introspection and Tables#

1. Configure your SimpleSqlalchemyDatasource for introspection#

Start with an elementary SimpleSqlalchemyDatasource configuration, containing just a basic introspection component:

datasource_yaml = f"""name: taxi_datasourceclass_name: SimpleSqlalchemyDatasourceconnection_string: <CONNECTION_STRING>
introspection:  # Each key in the "introspection" section is the name of an InferredAssetSqlDataConnector (key name "introspection" in "SimpleSqlalchemyDatasource" configuration is reserved).    whole_table: {{}}  # Any alphanumeric key name is acceptable."""

Using the above example configuration, specify the connection string for your database. Then run this code to test your configuration:

context.test_yaml_config(datasource_yaml)

Notice that the output reports the presence of exactly one InferredAssetSqlDataConnector (called whole_table, as per the configuration).

An integral part of the recommended approach, illustrated as part of this exercise, will be the use of the internal Great Expectations utility

context.test_yaml_config(    yaml_string, pretty_print: bool = True,    return_mode: str = "instantiated_class",    shorten_tracebacks: bool = False,)

to ensure the correctness of the proposed YAML configuration prior to incorporating it and trying to use it.

For instance, try the following erroneous SimpleSqlalchemyDatasource configuration (it contains an illegal top-level key):

datasource_yaml = f"""  # buggy datasource_yaml configurationname: mis_configured_datasourceclass_name: SimpleSqlalchemyDatasourceconnection_string: <CONNECTION_STRING>
introspecting:  # illegal top-level key name    whole_table: {{}}"""

Then specify the connection string for your database, and again run this code to test your configuration:

context.test_yaml_config(datasource_yaml)

Notice that the output reports an empty Data Connectors list, signaling a misconfiguration.

Feel free to experiment with the arguments to

context.test_yaml_config(    yaml_string, pretty_print: bool = True,    return_mode: str = "instantiated_class",    shorten_tracebacks: bool = False,)

For instance, running

context.test_yaml_config(yaml_string, return_mode="report_object")

will return the information appearing in standard output converted to the Python dictionary format.

Any structural errors (e.g., indentation, typos in class and configuration key names, etc.) will result in an exception raised and sent to standard error. This can be converted to an exception trace by running

context.test_yaml_config(yaml_string, shorten_tracebacks=True)

showing the line numbers, where the exception occurred, most likely caused by the failure of the required class (in this case InferredAssetSqlDataConnector) from being successfully instantiated.

2. Customize the introspection configuration to fit your needs#

SimpleSqlalchemyDatasource supports a number of configuration options to assist you with the introspection of your SQL database:

  • the database views will included in the list of identified Data References (by setting the include_views flag to true)
  • if any exceptions occur during the introspection operation, then the process will continue (by setting the skip_inapplicable_tables flag to true)
  • specifying excluded_tables will have the effect of excluding only the tables on this list, while including the rest
  • specifying included_tables will have the effect of including only the tables on this list, while excluding the rest

The following YAML configurqation example utilizes several of these configuration directives:

datasource_yaml = f"""name: taxi_datasourceclass_name: SimpleSqlalchemyDatasourceconnection_string: <CONNECTION_STRING>
introspection:  # Each key in the "introspection" section is the name of an InferredAssetSqlDataConnector (key name "introspection" in "SimpleSqlalchemyDatasource" configuration is reserved).    whole_table:        introspection_directives:            include_views: true        skip_inapplicable_tables: true  # skip and continue upon encountering introspection errors        excluded_tables:  # a list of tables to ignore when inferring data asset_names            - main.yellow_tripdata_sample_2019_03  # format: schema_name.table_name"""

3. Save the Datasource configuration to your DataContext#

Once the SimpleSqlalchemyDatasource configuration is error-free and satisfies your requirements, save it into your DataContext by using the add_datasource() function.

context.add_datasource(**yaml.load(datasource_yaml))

4. Get names of available Data Assets#

Getting names of available data assets using an InferredAssetSqlDataConnector affords you the visibility into types and naming structures of tables in your SQL database:

available_data_asset_names = context.datasources[    "taxi_datasource"].get_available_data_asset_names(data_connector_names="whole_table")["whole_table"]assert len(available_data_asset_names) == 2

To view the full script used in this page, see it on GitHub: