How to configure a Snowflake Datasource

This guide shows how to connect to a Snowflake Datasource.

Great Expectations supports 3 different authentication mechanisms for Snowflake:

  • User / password

  • Single sign-on (SSO)

  • Key pair

Steps

Show Docs for Stable API (up to 0.12.x)

Prerequisites: This how-to guide assumes you have already:

To add a Snowflake datasource, for all authentication mechanisms:

  1. Install the required modules

    If you haven’t already, install these modules for connecting to Snowflake.

    pip install sqlalchemy
    
    pip install snowflake-connector-python
    
    pip install snowflake-sqlalchemy
    
  2. Run datasource new

    From the command line, run:

    great_expectations datasource new
    
  3. Choose “Relational database (SQL)”

    What data would you like Great Expectations to connect to?
        1. Files on a filesystem (for processing with Pandas or Spark)
        2. Relational database (SQL)
    : 2
    
  4. Choose Snowflake

    Which database backend are you using?
        1. MySQL
        2. Postgres
        3. Redshift
        4. Snowflake
        5. BigQuery
        6. other - Do you have a working SQLAlchemy connection string?
    : 4
    
  5. Give your Datasource a name

    When prompted, provide a custom name for your Snowflake data source, or hit Enter to accept the default.

    Give your new Datasource a short name.
     [my_snowflake_db]:
    
  6. Choose an authentication mechanism

    What authentication method would you like to use?
    
    1. User and Password
    2. Single sign-on (SSO)
    3. Key pair authentication
    
  7. Enter connection information

    Next, you will be asked for information common to all authentication mechanisms.

    Next, we will configure database credentials and store them in the `my_snowflake_db` section
    of this config file: great_expectations/uncommitted/config_variables.yml:
    
    What is the user login name for the snowflake connection? []: myusername
    What is the account name for the snowflake connection (include region -- ex 'ABCD.us-east-1')? []: xyz12345.us-east-1
    What is database name for the snowflake connection? (optional -- leave blank for none) []: MY_DATABASE
    What is schema name for the snowflake connection? (optional -- leave blank for none) []: MY_SCHEMA
    What is warehouse name for the snowflake connection? (optional -- leave blank for none) []: MY_COMPUTE_WH
    What is role name for the snowflake connection? (optional -- leave blank for none) []: MY_ROLE
    
  8. For “User and Password”: provide password

    Next, you will be asked to supply the password for your Snowflake instance:

    What is the password for the snowflake connection?:
    

    Great Expectations will store these secrets privately on your machine. They will not be committed to git.

  9. For “Single sign-on (SSO)”: provide SSO information

    Next, you will be asked to enter single sign-on information:

    Valid okta URL or 'externalbrowser' used to connect through SSO: externalbrowser
    
  10. For “Key pair authentication”: provide key pair information

    Next, you will be asked to enter key pair authentication information:

    Path to the private key used for authentication: ~/.ssh/my_snowflake.p8
    Passphrase for the private key used for authentication (optional -- leave blank for none): mypass
    

    Great Expectations will store these secrets privately on your machine. They will not be committed to git.

  11. Wait to verify your connection

    You will then see the following message on your terminal screen:

    Attempting to connect to your database. This may take a moment...
    

    For SSO, you will additionally see a “browser tab” open, follow the authentication process and close the tab once the following message is displayed:

    Your identity was confirmed and propagated to Snowflake PythonConnector. You can close this window now and go back where you started from.
    

    If all goes well, it will be followed by the message:

    Great Expectations connected to your database!
    

    If you run into an error, you will see something like:

    Cannot connect to the database.
      - Please check your environment and the configuration you provided.
      - Database Error: Cannot initialize datasource my_snowflake_db, error: (snowflake.connector.errors.DatabaseError) 250001 (08001): Failed to connect to DB: oca29081.us-east-1.snowflakecomputing.com:443. Incorrect username or password was specified.
    
    (Background on this error at: http://sqlalche.me/e/4xp6)
    Enter the credentials again? [Y/n]:
    

    In this case, please check your credentials, ports, firewall, etc. and try again.

  12. Save your new configuration

    Finally, you’ll be asked to confirm that you want to save your configuration:

    Great Expectations will now add a new Datasource 'my_snowflake_db' to your deployment, by adding this entry to your great_expectations.yml:
    
      my_snowflake_db:
        credentials: ${my_snowflake_db}
        data_asset_type:
          class_name: SqlAlchemyDataset
          module_name: great_expectations.dataset
        class_name: SqlAlchemyDatasource
    
    The credentials will be saved in uncommitted/config_variables.yml under the key 'my_snowflake_db'
    
    Would you like to proceed? [Y/n]:
    

    After this confirmation, you can proceed with exploring the data sets in your new Snowflake Datasource.

Show Docs for Experimental API (0.13)

Prerequisites: This how-to guide assumes you have already:

To add a Snowflake datasource, do the following:

  1. Install the required modules

    If you haven’t already, install these modules for connecting to Snowflake.

    pip install sqlalchemy
    pip install snowflake-connector-python
    pip install snowflake-sqlalchemy
    
  2. Instantiate a DataContext

    import great_expectations as ge
    context = ge.get_context()
    
  3. Create or copy a yaml config

    Parameters can be set as strings, or passed in as environment variables. In the following example, a yaml config is configured for a SimpleSqlalchemyDatasource with associated credentials using username and password authentication. Username, password and host are set as environment variables, and database and query are set as strings. SimpleSqlalchemyDatasource is a sub-class of Datasource that automatically configures a SqlDataConnector, and is one you will probably want to use in connecting data living in a sql database. More information on Datasources in GE 0.13 can found in Core Great Expectations Concepts document.

    This example also uses introspection to configure the datasource, where each table in the database is associated with its own data_asset. A deeper explanation on the different modes of building data_asset from data (introspective / inferred vs configured) can be found in the Core Great Expectations Concepts document.

    Also, additional examples of yaml configurations for various filesystems and databases can be found in the following document: How to configure DataContext components using test_yaml_config. Examples of yaml configurations for Key pair and SSO authentication can be found in the Additional Notes section below.

    config = f"""
        class_name: SimpleSqlalchemyDatasource
        credentials:
            drivername: snowflake
            username: ${snowflake_username}
            password: ${snowflake_pw}
            host: ${snowflake_host}
            database: TEST
            query:
                schema: KAGGLE_MOVIE_DATASET
    
        introspection:
            whole_table:
                data_asset_name_suffix: __whole_table
        """
    
  4. Run context.test_yaml_config.

    context.test_yaml_config(
        name="my_snowflake_datasource",
        yaml_config=my_config
    )
    

    When executed, test_yaml_config will instantiate the component and run through a self_check procedure to verify that the component works as expected.

    The output will look something like this:

     Attempting to instantiate class from config...
     Instantiating as a DataSource, since class_name is SimpleSqlalchemyDatasource
     Successfully instantiated SimpleSqlalchemyDatasource
    
     Execution engine: SqlAlchemyExecutionEngine
     Data connectors:
         whole_table : InferredAssetSqlDataConnector
    
         Available data_asset_names (1 of 1):
             imdb_100k_main__whole_table (1 of 1): [{}]
    
         Unmatched data_references (0 of 0): []
    
         Choosing an example data reference...
             Reference chosen: {}
    
             Fetching batch data...
             [(50832,)]
    
     Showing 5 rows
    movieid                               title                                       genres
     0        1                    Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy
     1        2                      Jumanji (1995)                   Adventure|Children|Fantasy
     2        3             Grumpier Old Men (1995)                               Comedy|Romance
     3        4            Waiting to Exhale (1995)                         Comedy|Drama|Romance
     4        5  Father of the Bride Part II (1995)                                       Comedy
    

    This means all has went well and you can proceed with exploring the data sets in your new Snowflake datasource.

    Note : In the current example, the yaml config will only create a connection to the datasource for the current session. After you exit python, the datasource and configuration will be gone. To make the datasource and configuration persistent, please add information to great_expectations.yml in your great_expectations/ directory.

Additional Notes

  1. When using the Snowflake dialect, SqlAlchemyDataset may create a transient table instead of a temporary table when passing in query Batch Kwargs or providing custom_sql to its constructor. Consequently, users may provide a snowflake_transient_table in addition to the query parameter. Any existing table with that name will be overwritten. By default, if no snowflake_transient_table is passed into Batch Kwargs, SqlAlchemyDataset will create a temporary table instead.

  2. snowflake_transient_table and table Batch Kwargs do not currently accept a fully qualified table name (i.e. database.schema.table) - only the table name alone. Queries generated by Great Expectations are scoped to the the schema and database specified in your datasource configuration, including the creation of the transient table specified in snowflake_transient_table. If you need to use custom SQL, but want to isolate transient tables creates to a schema seperate from the rest of your warehouse, you can fully qualify your custom SQL, and let the transient table be created using the database and schema specified in your datasource configuration.

  3. Should you need to modify your connection string, you can manually edit the great_expectations/uncommitted/config_variables.yml file.

  4. You can edit the great_expectations/uncommitted/config_variables.yml file to accomplish the connection configuration without using the CLI. The entry would have the following format:

    For “User and password authentication”:

    my_snowflake_db:
        url: "snowflake://<user_login_name>:<password>@<account_name>/<database_name>/<schema_name>?warehouse=<warehouse_name>&role=<role_name>"
    

    For “Single sign-on authentication”:

    my_snowflake_db:
        url: "snowflake://<myuser%40mydomain.com>:<password>@<account_name>/<database_name>/<schema_name>?authenticator=<externalbrowser or valid URL encoded okta url>&warehouse=<warehouse_name>&role=<role_name>"
    

    For “Key pair authentication”:

    my_snowflake_db:
        drivername: snowflake
        username: <user_login_name>
        host: <account_name>
        database: <database_name>
        query:
            schema: <schema_name>
            warehouse: <warehouse_name>
            role: <role_name>
        private_key_path: </path/to/key.p8>
        private_key_passphrase: <pass_phrase or ''>
    
  5. For Snowflake SSO authentication, by default, one browser tab will be opened per connection. You can enable token caching at the account level to re-use tokens and minimize the number of browser tabs opened.

    To do so, run the following SQL on Snowflake:

    alter account set allow_id_token = true;
    

    And make sure the version of your snowflake-connector-python library is >=2.2.8

  6. Single sign-on (SSO) Authentication for Experimental API (0.13)

    Add connect_args and authenticator to credentials in the yaml configuration. The value for authenticator can be externalbrowser, or a valid okta URL.

    config = f"""
        class_name: SimpleSqlalchemyDatasource
        credentials:
            drivername: snowflake
            username: ${snowflake_username}
            host: ${snowflake_host}
            database: TEST
            connect_args:
                authenticator: externalbrowser
            query:
                schema: KAGGLE_MOVIE_DATASET
        introspection:
            whole_table:
                data_asset_name_suffix: __whole_table
        """
    

    Note This feature is still experimental, so please leave us a comment below if you run into any problems.

  7. Key pair Authentication for Experimental API (0.13)

    Add private_key_path and optional private_key_passphrase to credentials in the yaml configuration.

    • private_key_path will need to be set to the path to the private key used for authentication ( ie ~/.ssh/my_snowflake.p8 ).

    • private_key_passphrase: is the optional passphrase used for authentication with private key ( ie mypass ).

    config = f"""
        class_name: SimpleSqlalchemyDatasource
        credentials:
            drivername: snowflake
            username: ${snowflake_username}
            private_key_path: ~/.ssh/my_snowflake.p8
            private_key_passphrase: mypass
            host: ${snowflake_host}
            database: TEST
            query:
                schema: KAGGLE_MOVIE_DATASET
        introspection:
            whole_table:
                data_asset_name_suffix: __whole_table
        """
    

    Note This feature is still experimental, so please leave us a comment below if you run into any problems.

Comments