How to configure a DataConnector to introspect and partition tables in SQL
This guide will help you introspect and partition tables in an SQL database using SimpleSqlalchemyDatasource
, which
operates as a proxy to InferredAssetSqlDataConnector
and ConfiguredAssetSqlDataConnector
. For background, please
see the Datasource specific guides in the Connecting to your data section of our documentation.
The SQL database introspection and partitioning are useful for:
- Exploring the schema and column metadata of the tables in your SQL database, and
- Organizing the tables into Data AssetsA collection of records within a Datasource which is usually named based on the underlying data system and sliced to correspond to a desired specification. according to the partitioning considerations informed by this exploration.
Partitioning enables you to select the desired subsets of your dataset for ValidationThe act of applying an Expectation Suite to a Batch..
Prerequisites: This how-to guide assumes you have:
- Completed the Getting Started Tutorial
- A working installation of Great Expectations
- Configured and loaded a Data Context
- Configured a Datasource and Data Connector
We will use the "Yellow Taxi" dataset to walk you through the configuration of SimpleSqlalchemyDatasource
, where
the introspection
section characterizes InferredAssetSqlDataConnector
objects and the tables
section characterizes
ConfiguredAssetSqlDataConnector
objects. Starting with the bare-bones version of either the introspection
section
or the tables
section of the SimpleSqlalchemyDatasource
configuration, we gradually build out the configuration to
achieve the introspection of your SQL database with the semantics consistent with your goals.
Only introspection
and tables
are the legal top-level keys in the SimpleSqlalchemyDatasource
configuration.
To learn more about DatasourcesProvides a standard API for accessing and interacting with data from a wide variety of source systems., Data ConnectorsProvides the configuration details based on the source data system which are needed by a Datasource to define Data Assets., and Batch(es)A selection of records from a Data Asset., please see our Datasources Guide.
Preliminary Steps
1. Instantiate your project's DataContext
Import Great Expectations.
import great_expectations as gx
2. Obtain DataContext
Load your DataContext into memory using the get_context()
method.
context = gx.get_context()
Configuring Introspection and Tables
- Introspection (InferredAssetSqlDataConnector)
- Tables (ConfiguredAssetSqlDataConnector)
1. Configure your SimpleSqlalchemyDatasource for introspection
Start with an elementary SimpleSqlalchemyDatasource
configuration, containing just a basic introspection
component:
datasource_yaml = f"""
name: taxi_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <connection_string>
introspection: # Each key in the "introspection" section is the name of an InferredAssetSqlDataConnector (key name "introspection" in "SimpleSqlalchemyDatasource" configuration is reserved).
whole_table: {{}} # Any alphanumeric key name is acceptable.
"""
Using the above example configuration, specify the connection string for your database. Then run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Notice that the output reports the presence of exactly one InferredAssetSqlDataConnector
(called whole_table
, as per
the configuration).
An integral part of the recommended approach, illustrated as part of this exercise, will be the use of the internal Great Expectations utility
context.test_yaml_config(
yaml_string, pretty_print: bool = True,
return_mode: str = "instantiated_class",
shorten_tracebacks: bool = False,
)
to ensure the correctness of the proposed YAML
configuration prior to incorporating it and trying to use it.
For instance, try the following erroneous SimpleSqlalchemyDatasource
configuration (it contains an illegal top-level
key):
datasource_yaml = f""" # buggy datasource_yaml configuration
name: mis_configured_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <connection_string>
introspecting: # illegal top-level key name
whole_table: {{}}
"""
Then specify the connection string for your database, and again run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Notice that the output reports an empty Data Connectors list, signaling a misconfiguration.
Feel free to experiment with the arguments to
context.test_yaml_config(
yaml_string, pretty_print: bool = True,
return_mode: str = "instantiated_class",
shorten_tracebacks: bool = False,
)
For instance, running
context.test_yaml_config(yaml_string, return_mode="report_object")
will return the information appearing in standard output converted to the Python
dictionary format.
Any structural errors (e.g., indentation, typos in class and configuration key names, etc.) will result in an exception raised and sent to standard error. This can be converted to an exception trace by running
context.test_yaml_config(yaml_string, shorten_tracebacks=True)
showing the line numbers, where the exception occurred, most likely caused by the failure of the required class (in this
case InferredAssetSqlDataConnector
) from being successfully instantiated.
2. Customize the introspection configuration to fit your needs
SimpleSqlalchemyDatasource
supports a number of configuration options to assist you with the introspection
of your
SQL database:
- the database views will included in the list of identified
Data References
(by setting theinclude_views
flag totrue
) - if any exceptions occur during the
introspection
operation, then the process will continue (by setting theskip_inapplicable_tables
flag totrue
) - specifying
excluded_tables
will have the effect of excluding only the tables on this list, while including the rest - specifying
included_tables
will have the effect of including only the tables on this list, while excluding the rest
The following YAML
configurqation example utilizes several of these configuration directives:
datasource_yaml = f"""
name: taxi_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <connection_string>
introspection: # Each key in the "introspection" section is the name of an InferredAssetSqlDataConnector (key name "introspection" in "SimpleSqlalchemyDatasource" configuration is reserved).
whole_table:
include_schema_name: true
introspection_directives:
include_views: true
skip_inapplicable_tables: true # skip and continue upon encountering introspection errors
excluded_tables: # a list of tables to ignore when inferring data asset_names
- main.yellow_tripdata_sample_2019_03 # format: schema_name.table_name
"""
3. Save the Datasource configuration to your DataContext
Once the SimpleSqlalchemyDatasource
configuration is error-free and satisfies your requirements, save it into your
DataContext
by using the add_datasource()
function.
context.add_datasource(**yaml.load(datasource_yaml))
4. Get names of available Data Assets
Getting names of available data assets using an InferredAssetSqlDataConnector
affords you the visibility into types
and naming structures of tables in your SQL database:
available_data_asset_names = context.datasources[
"taxi_datasource"
].get_available_data_asset_names(data_connector_names="whole_table")["whole_table"]
1. Configure your SimpleSqlalchemyDatasource to characterize tables
Start with an elementary SimpleSqlalchemyDatasource
configuration, containing just a basic tables
component:
datasource_yaml = f"""
name: taxi_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <connection_string>
tables: # Each key in the "tables" section is a table_name (key name "tables" in "SimpleSqlalchemyDatasource" configuration is reserved).
yellow_tripdata_sample_2019_01: # Must match table name exactly.
partitioners: # Each key in the "partitioners" sub-section the name of a ConfiguredAssetSqlDataConnector (key name "partitioners" in "SimpleSqlalchemyDatasource" configuration is reserved).
whole_table: {{}}
"""
Using the above example configuration, specify the connection string for your database. Then run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Notice that the output reports the presence of exactly one ConfiguredAssetSqlDataConnector
(called whole_table
, as
per the configuration) and that Available data_asset_names (1 of 1)
, the name of the single Data Asset
being
yellow_tripdata_sample_2019_01
.
2. Enhance your SimpleSqlalchemyDatasource with ability to attribute metadata to tables
Add Data Asset Name
identification attributes (data_asset_name_prefix
and data_asset_name_suffix
) and set the
include_schema_name
flag in your ConfiguredAssetSqlDataConnector
(named whole_table
) configuration section. These
directives will result in the reported properties of your table to contain annotations, customized for your purposes:
datasource_yaml = f"""
name: taxi_datasource
class_name: SimpleSqlalchemyDatasource
connection_string: <connection_string>
tables: # Each key in the "tables" section is a table_name (key name "tables" in "SimpleSqlalchemyDatasource" configuration is reserved).
yellow_tripdata_sample_2019_01: # Must match table name exactly.
partitioners: # Each key in the "partitioners" sub-section the name of a ConfiguredAssetSqlDataConnector (key name "partitioners" in "SimpleSqlalchemyDatasource" configuration is reserved).
whole_table:
include_schema_name: true
schema_name: main
data_asset_name_prefix: taxi__
data_asset_name_suffix: __asset
"""
Using the above example configuration, specify the connection string for your database. Then run this code to test your configuration:
context.test_yaml_config(datasource_yaml)
Notice that the output reports the presence of exactly one ConfiguredAssetSqlDataConnector
(called whole_table
, as
per the configuration) and that Available data_asset_names (1 of 1)
, the name of the single Data Asset
this time
being taxi__yellow_tripdata_sample_2019_01__asset
, correctly reflecting the enhanced configuration directives.
Finally, once your Data Connector
configuration satisfies your requirements, save the enclosing Datasource
into your
DataContext
using
context.add_datasource(**yaml.load(datasource_yaml))
To view the full script used in this page, see it on GitHub: