Skip to main content

How to configure a DataConnector for splitting and sampling tables in SQL

This guide will help you configure Splitting and Sampling for working with tables in an SQL database using SimpleSqlalchemyDatasource, which operates as a proxy to InferredAssetSqlDataConnector and ConfiguredAssetSqlDataConnector.

We will use the tables section of the SimpleSqlalchemyDatasource configuration, which exercises the ConfiguredAssetSqlDataConnector, to showcase Splitting and Sampling (the same Splitting and Sampling configuration options can be readily applied to the introspection section of the SimpleSqlalchemyDatasource configuration, which exercises the InferredAssetSqlDataConnector).

The Splitting and Sampling mechanisms provided by Great Expectations serve as additional tools for Partitioning your data at various levels of granularity:

  • Splitting provides the means of focusing the batch data on the values of certain dimensions of the data of interest.
  • Sampling provides a means for reducing the amount of data in the retrieved batch to facilitate data analysis.
Prerequisites: This how-to guide assumes you have:

This guide will use the tables section that is part of the following SimpleSqlalchemyDatasource configuration as an example:

datasource_yaml = f"""name: taxi_datasourceclass_name: SimpleSqlalchemyDatasourceconnection_string: <CONNECTION_STRING>
introspection:  # Each key in the "introspection" section is an InferredAssetSqlDataConnector    whole_table:        introspection_directives:            include_views: true        skip_inapplicable_tables: true  # skip and continue upon encountering introspection errors        excluded_tables:  # a list of tables to ignore when inferring data asset_names            - main.yellow_tripdata_sample_2019_03  # format: schema_name.table_name
    daily:        introspection_directives:            include_views: true        skip_inapplicable_tables: true  # skip and continue upon encountering introspection errors        included_tables:  # only include tables in this list when inferring data asset_names            - main.yellow_tripdata_sample_2019_01  # format: schema_name.table_name        splitter_method: _split_on_converted_datetime        splitter_kwargs:            column_name: pickup_datetime            date_format_string: "%Y-%m-%d"
    hourly:        introspection_directives:            include_views: true        skip_inapplicable_tables: true        included_tables:  # only include tables in this list when inferring data asset_names            - main.yellow_tripdata_sample_2019_01  # format: schema_name.table_name        splitter_method: _split_on_converted_datetime        splitter_kwargs:            column_name: pickup_datetime            date_format_string: "%Y-%m-%d %H"
tables:  # Each key in the "tables" section is a table_name (key name "tables" in "SimpleSqlalchemyDatasource" configuration is reserved).    # data_asset_name is: concatenate(data_asset_name_prefix, table_name, data_asset_name_suffix)    yellow_tripdata_sample_2019_01:  # Must match table name exactly.        partitioners:  # Each key in the "partitioners" sub-section the name of a ConfiguredAssetSqlDataConnector (key name "partitioners" in "SimpleSqlalchemyDatasource" configuration is reserved).            whole_table:                include_schema_name: True                data_asset_name_prefix: taxi__                data_asset_name_suffix: __asset
            by_num_riders:                include_schema_name: True                data_asset_name_prefix: taxi__                data_asset_name_suffix: __asset                splitter_method: _split_on_column_value                splitter_kwargs:                    column_name: passenger_count
            by_num_riders_random_sample:                include_schema_name: True                data_asset_name_prefix: taxi__                data_asset_name_suffix: __asset                splitter_method: _split_on_column_value                splitter_kwargs:                    column_name: passenger_count                sampling_method: _sample_using_random                sampling_kwargs:                    p: 1.0e-1"""

Preliminary Steps#

1. Instantiate your project's DataContext#

Import these necessary packages and modules.

import great_expectations as ge

Load your DataContext into memory using the get_context() method.

context = ge.get_context()

2. Configure your Datasource#

Using the above example configuration, specify the connection string for your database. Then run this code to test your configuration:

context.test_yaml_config(datasource_yaml)

Feel free to adjust your configuration and re-run test_yaml_config() as needed.

3. Save the Datasource configuration to your DataContext#

Save the configuration into your DataContext by using the add_datasource() function.

context.add_datasource(**yaml.load(datasource_yaml))

Splitting and Sampling#

To configure Splitting, specify a dimension (i.e., column_name or column_names), the method of Splitting, and parameters to be used by the specified Splitting method. In the present example, the Data Connectors named by_num_riders and by_num_riders_random_sample split the table yellow_tripdata_sample_2019_01 on the column name passenger_count using the _split_on_column_value method of Splitting.

To configure Sampling, specify the method of Sampling, and parameters to be used by the specified Sampling method. In the present example, the Data Connector named by_num_riders_random_sample samples the table yellow_tripdata_sample_2019_01 using the _sample_using_random method of Sampling, configured to return 10% of the rows sampled at random, which is specified by the parameter p (stands for "proportion") set to the value 0.1.

Finally, confirm the expected number of batches was retrieved and the reduced size of a batch (due to Sampling):

batch_request = BatchRequest(    datasource_name="taxi_datasource",    data_connector_name="by_num_riders_random_sample",    data_asset_name="<YOUR_DATA_ASSET_NAME>",)

(set data_asset_name to "taxi__yellow_tripdata_sample_2019_01__asset" for the present example).

batch_list = context.get_batch_list(batch_request=batch_request)assert len(batch_list) == 6  # ride occupancy ranges from 1 passenger to 6 passengers
batch_data = batch_list[1].data  # 2-passenger sample of batch datanum_rows = batch_data.execution_engine.engine.execute(    sa.select([sa.func.count()]).select_from(batch_data.selectable)).scalar()assert num_rows < 200

Additional Notes#

Available Splitting methods and their configuration parameters:

+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+| **Method**                        | **Parameters**                                                                                                                        | **Returned Batch Data**                                                                                              |+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+| _split_on_whole_table             | N/A                                                                                                                                   | identical to original                                                                                                |+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+| _split_on_column_value            | table_name='table', column_name='col'                                                                                                 | rows where value of column_name are same                                                                             |+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+| _split_on_converted_datetime      | table_name='table', column_name='col', date_format_string=<'%Y-%m-%d'>                                                                | rows where value of column_name converted to datetime using the given date_format_string are same                    |+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+| _split_on_divided_integer         | table_name='table', column_name='col', divisor=<int>                                                                                  | rows where value of column_name divided (using integral division) by the given divisor are same                      |+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+| _split_on_mod_integer             | table_name='table', column_name='col', mod=<int>                                                                                      | rows where value of column_name divided (using modular division) by the given mod are same                           |+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+| _split_on_multi_column_values     | table_name='table', column_names='<list[col]>'                                                                                        | rows where values of column_names are same                                                                           |+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------+-----------------+| _split_on_hashed_column           | column_name='col',                                                                                                                    | rows where value of column_name hashed (using "md5" hash function) are same (experimental)                           |+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+

Available Sampling methods and their configuration parameters:

+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+| **Method**                        | **Parameters**                                                                                                                        | **Returned Batch Data**                                                                                              |+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+| _sample_using_limit               | n=num_rows                                                                                                                            | first up to to n (specific limit parameter) rows of batch                                                            | +-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+| _sample_using_random              | p=fraction                                                                                                                            | rows selected at random, whose number amounts to selected fraction of total number of rows in batch                  |+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+| _sample_using_mod                 | column_name='col', mod=<int>                                                                                                          | take the mod of named column, and only keep rows that match the given value                                          |+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+| _sample_using_a_list              | column_name='col', value_list=<list[val]>                                                                                             | match the values in the named column against value_list, and only keep the matches                                   |+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+| _sample_using_hash                | column_name='col', hash_digits=<int>, hash_value=<str>                                                                                | hash the values in the named column (using "md5" hash function), and only keep rows that match the given hash_value  |+-----------------------------------+---------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------+

To view the full script used in this page, see it on GitHub: