This guide will help you add an Athena instance (or a database) as a DatasourceProvides a standard API for accessing and interacting with data from a wide variety of source systems.. This will allow you to ValidateThe act of applying an Expectation Suite to a Batch. tables and queries within this instance. When you use an Athena Datasource, the validation is done in Athena itself. Your data is not downloaded.
Prerequisites: This how-to guide assumes you have:
1. Run the following CLI command to begin the interactive Datasource creation process:
great_expectations datasource new
When prompted to choose from the list of database engines, chose
2. Identify your connection string
In order for Great Expectations to connect to Athena, you will need to provide a connection string. To determine your connection string, reference the examples below and the PyAthena documentation.
The following urls don't include credentials as it is recommended to use either the instance profile or the boto3 configuration file.
If you want Great Expectations to connect to your Athena instance (without specifying a particular database), the URL should be:
Note the url parameter "s3_staging_dir" needed for storing query results in S3.
If you want Great Expectations to connect to a particular database inside your Athena, the URL should be:
credentials key uses a dictionary to provide the elements of your connection string as separate, individual values. For information on how to populate the
credentials dictionary and how to configure your
great_expectations.yml project config file to populate credentials from either a YAML file or a secret manager, please see our guide on How to configure credentials.
After providing your connection string, you will then be presented with a Jupyter Notebook.
3. Follow the steps in the Jupyter Notebook
The Jupyter Notebook will guide you through the remaining steps of creating a Datasource. Follow the steps in the presented notebook, including entering the connection string in the yaml configuration.
Environment variables can be used to store the SQLAlchemy URL instead of the file, if preferred - search documentation for "Managing Environment and Secrets".