How to configure a MySQL Datasource

This guide shows how to connect to a MySql Datasource. Great Expectations uses SqlAlchemy to connect to MySQL.

Steps

Show Docs for Stable API (up to 0.12.x)

Prerequisites: This how-to guide assumes you have already:

  1. Install the required modules

    If you have not already done so, install required modules for connecting to MySql.

    pip install sqlalchemy
    pip install PyMySQL
    
  2. Run datasource new

    From the command line, run:

    great_expectations datasource new
    
  3. Choose “Relational database (SQL)”

    What data would you like Great Expectations to connect to?
        1. Files on a filesystem (for processing with Pandas or Spark)
        2. Relational database (SQL)
    : 2
    
  4. Choose MySQL

    Which database backend are you using?
        1. MySQL
        2. Postgres
        3. Redshift
        4. Snowflake
        5. BigQuery
        6. other - Do you have a working SQLAlchemy connection string?
    : 1
    
  5. Give your Datasource a name

    When prompted, provide a custom name for your Snowflake data source, or hit Enter to accept the default.

    Give your new Datasource a short name.
     [my_mysql_db]:
    
  6. Enter connection information

    When prompted, provide a custom name for your new Datasource, or hit Enter to accept the default.

    Next, we will configure database credentials and store them in the `my_mysql_db` section
    of this config file: great_expectations/uncommitted/config_variables.yml:
    
    What is the host for the MySQL connection? [localhost]:
    What is the port for the MySQL connection? [3306]:
    What is the username for the MySQL connection? []: root
    What is the password for the MySQL connection?:
    What is the database name for the MySQL connection? []: test_ci
    Attempting to connect to your database. This may take a moment...
    
  7. Save your new configuration

    Great Expectations will now add a new Datasource 'my_mysql_db' to your deployment, by adding this entry to your great_expectations.yml:
    
      my_mysql_db:
        credentials: ${my_mysql_db}
        data_asset_type:
          class_name: SqlAlchemyDataset
          module_name: great_expectations.dataset
        class_name: SqlAlchemyDatasource
        module_name: great_expectations.datasource
    
    The credentials will be saved in uncommitted/config_variables.yml under the key 'my_mysql_db'
    

Show Docs for Experimental API (0.13)

Prerequisites: This how-to guide assumes you have already:

To add a MySql datasource, do the following:

  1. Install the required modules

    If you have not already done so, install required modules for connecting to MySql.

    pip install sqlalchemy
    pip install PyMySQL
    
  2. Instantiate a DataContext

    import great_expectations as ge
    context = ge.get_context()
    
  3. Create or copy a yaml config

    Parameters can be set as strings, or passed in as environment variables. In the following example, a yaml config is configured for a SimpleSqlalchemyDatasource with associated credentials passed in as environment variables. SimpleSqlalchemyDatasource is a sub-class of Datasource that automatically configures a SqlDataConnector, and is one you will probably want to use in connecting data living in a sql database. More information on Datasources in GE 0.13 can found in Core Great Expectations Concepts document.

    This example also uses introspection to configure the datasource, where each table in the database is associated with its own data_asset. A deeper explanation on the different modes of building data_asset from data (introspective / inferred vs configured) can be found in the Core Great Expectations Concepts document.

    Also, additional examples of yaml configurations for various filesystems and databases can be found in the following document: How to configure DataContext components using test_yaml_config

    config = f"""
        class_name: SimpleSqlalchemyDatasource
        credentials:
            drivername: mysql+pymysql
            host: {host}
            port: {port}
            username: {user_name}
            password: {password}
            database: {db_name}
        introspection:
            whole_table:
                data_asset_name_suffix: __whole_table
        """
    
  4. Run context.test_yaml_config.

    context.test_yaml_config(
        name="mysql_datasource",
        yaml_config=my_config
    )
    

    When executed, test_yaml_config will instantiate the component and run through a self_check procedure to verify that the component works as expected.

    The resulting output will look something like this:

       Attempting to instantiate class from config...
       Instantiating as a DataSource, since class_name is SimpleSqlalchemyDatasource
       Successfully instantiated SimpleSqlalchemyDatasource
    
       Execution engine: SqlAlchemyExecutionEngine
       Data connectors:
           whole_table : InferredAssetSqlDataConnector
    
           Available data_asset_names (1 of 1):
               imdb_100k_main__whole_table (1 of 1): [{}]
    
           Unmatched data_references (0 of 0): []
    
           Choosing an example data reference...
               Reference chosen: {}
    
               Fetching batch data...
       [(58098,)]
    
               Showing 5 rows
          movieId                               title                                         genres
       0        1                    Toy Story (1995)  Adventure|Animation|Children|Comedy|Fantasy\r
       1        2                      Jumanji (1995)                   Adventure|Children|Fantasy\r
       2        3             Grumpier Old Men (1995)                               Comedy|Romance\r
       3        4            Waiting to Exhale (1995)                         Comedy|Drama|Romance\r
       4        5  Father of the Bride Part II (1995)                                       Comedy\r
    
    This means all has went well and you can proceed with exploring the data sets in your new MySql datasource.
    

    Note : In the current example, the yaml config will only create a connection to the datasource for the current session. After you exit python, the datasource and configuration will be gone. To make the datasource and configuration persistent, please add information to great_expectations.yml in your great_expectations/ directory.

Additional notes

Comments