Skip to main content
Version: 0.18.17

Use Great Expectations with Amazon Web Services using Athena

Use the information provided here to learn how to use Great Expectations (GX) with AWS and cloud storage. You'll configure a local GX project to store Expectations, Validation Results, and Data Docs in Amazon S3 buckets. You'll also configure Great Expectations to access data stored in an Athena database.

Prerequisites

    Ensure that the AWS CLI is ready for use

    1. Run the following code to verify that the AWS CLI is installed:

      Terminal command
      aws --version

      If this code does not return the AWS CLI version information, you may need to install the AWS CLI or troubleshoot your current installation. See Install or update the latest version of the AWS CLI

    2. Run the following command in the AWS CLI to verify that your AWS credentials are properly configured:

      Terminal command
      aws sts get-caller-identity

      When your credentials are properly configured, your UserId, Account, and Arn are returned. If your credentials are not configured correctly, an error message appears. If you received an error message, or you couldn't verify your credentials, see Configuring the AWS CLI.

    Prepare a local installation of Great Expectations

    1. Run the following code to check what version of Python is currently installed:

      Terminal command
      python --version

      Great Expectations supports Python versions 3.8 to 3.11. If a Python 3 version number is not returned, run the following code:

      Terminal command
      python3 --version

      If you do not have Python 3 installed, go to python.org for the current downloads and installation guidance.

    2. After you have confirmed that Python 3 is installed locally, you can create a virtual environment with venv before installing your packages with pip. The following examples use venv for virtual environments because it is included with Python 3. You can use alternate tools such as virtualenv and pyenv to install GX in virtual environments.

      Run one of the following code blocks to create your virtual environment:

      Terminal command
      python -m venv my_venv

      or

      Terminal command
      python3 -m venv my_venv

      A new directory named my_venv is created in your virtual environment.

      Run the following code to activate the virtual environment:

      Terminal command
      source my_venv/bin/activate
      tip

      To change the name of your virtual environment, replace my_venv in the example code.

    3. After you've activated your virtual environment, you should ensure that you have the latest version of pip installed. Pip is a tool that is used to easily install Python packages.

      Run the following code to ensure that you have the latest version of pip installed:

      Terminal command
      python -m ensurepip --upgrade

      or

      Terminal command
      python3 -m ensurepip --upgrade
    4. Python interacts with AWS through the boto3 library. Great Expectations makes use of this library in the background when working with AWS. Although you won't use boto3 directly, you'll need to install it in your virtual environment.

      Run one of the following pip commands to install boto3 in your virtual environment:

      Terminal command
      python -m pip install boto3

      or

      Terminal command
      python3 -m pip install boto3

      To set up boto3 with AWS, and use boto3 within Python, see the Boto3 documentation.

    5. Run one of the following code blocks to use pip to install Great Expectations:

      Terminal command
      python -m pip install great_expectations

      or

      Terminal command
      python3 -m pip install great_expectations
    6. Run the following code to confirm the GX installation is working:

      Terminal command
      great_expectations --version

      Version information similar to the following is returned:

      Terminal output
      great_expectations, version 0.18.9

    Create your Data Context

    It is assumed that there is an empty folder to initialize the Filesystem Data Context. For example:

    Python code
    path_to_empty_folder = '/my_gx_project/'

    You provide the path for the empty folder in the GX library FileDataContext.create(...) method as a project_root_dir parameter. When you provide the path to the empty folder, the Filesystem Data Context is initialized in that location.

    For convenience, the FileDataContext.create(...) method instantiates and returns the initialized Data Context, which you can keep in a Python variable. For example:

    Python code
    from great_expectations.data_context import FileDataContext

    context = FileDataContext.create(project_root_dir=path_to_empty_folder)

    Configure your Expectations Store on Amazon S3

    Your Expectation StoreA connector to store and retrieve information about collections of verifiable assertions about data. configuration is in your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..

    The following section in your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. great_expectations.yml file tells Great Expectations to look for Expectations in a Store named expectations_store:

    Python
    stores:
    expectations_store:
    class_name: ExpectationsStore
    store_backend:
    class_name: TupleFilesystemStoreBackend
    base_directory: expectations/

    expectations_store_name: expectations_store

    The default base_directory for expectations_store is expectations/.

    1. To manually add an Expectations StoreA connector to store and retrieve information about collections of verifiable assertions about data. to your configuration, add the following configuration to the stores section of your great_expectations.yml file:

      Python
      stores:
      expectations_S3_store:
      class_name: ExpectationsStore
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your>'
      prefix: '<your>' # Bucket and prefix in combination must be unique across all stores

      expectations_store_name: expectations_S3_store

      Change the default store_backend settings to make the Store work with S3. The class_name is set to TupleS3StoreBackend, bucket is the address of your S3 bucket, and prefix is the folder in your S3 bucket where Expectations are located.

      The following example shows the additional options that are available to customize TupleS3StoreBackend:

      File contents: great_expectations.yml
      class_name: ExpectationsStore
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your_s3_bucket_name>'
      prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
      boto3_options:
      endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
      region_name: '<your_aws_region_name>'

      In the previous example, the Store name is expectations_S3_store. If you use a personalized Store name, you must also update the value of the expectations_store_name key to match the Store name. For example:

      File contents: great_expectations.yml
      expectations_store_name: expectations_S3_store

      When you update the expectations_store_name key value, Great Expectations uses the new Store for Validation Results.

      Add the following code to great_expectations.yml to configure the IAM user:

      File contents: great_expectations.yml
      class_name: ExpectationsStore
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your_s3_bucket_name>'
      prefix: '<your_s3_bucket_folder_name>'
      boto3_options:
      aws_access_key_id: ${AWS_ACCESS_KEY_ID} # Uses the AWS_ACCESS_KEY_ID environment variable to get aws_access_key_id.
      aws_secret_access_key: ${AWS_ACCESS_KEY_ID}
      aws_session_token: ${AWS_ACCESS_KEY_ID}

      Add the following code to great_expectations.yml to configure the IAM Assume Role:

      File contents: great_expectations.yml
      class_name: ExpectationsStore
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your_s3_bucket_name>'
      prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
      boto3_options:
      assume_role_arn: '<your_role_to_assume>'
      region_name: '<your_aws_region_name>'
      assume_role_duration: session_duration_in_seconds
      caution

      If you're storing Validations in S3 or DataDocs in S3, make sure that the prefix values are disjoint and one is not a substring of the other.

    2. Run the following code to verify that your Stores are properly configured:

      Terminal command
      great_expectations store list

      A list of configured Stores that Great Expectations can access appears. If you added a new S3 Expectations Store, the output should include the following ExpectationsStore entry:

      Terminal output
      - name: expectations_S3_store
      class_name: ExpectationsStore
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your_s3_bucket_name>'
      prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores

      Only one Expectation Store is listed. Your configuration contains the original expectations_store on the local filesystem and the expectations_S3_store you configured, but the great_expectations store list command only lists your active stores. For your Expectation Store, this is the one that you set as the value of the expectations_store_name variable in the expectations_S3_store configuration file.

    3. If you are converting an existing local Great Expectations deployment to one that works in AWS, you might have Expectations saved that you want to transfer to your S3 bucket.

      Run the following aws s3 synccommand to copy Expectations into Amazon S3:

      Terminal command
      aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'

      The base_directory is set to expectations/ by default.

      In the following example, the Expectations exp1 and exp2 are copied to Amazon S3 and a confirmation message is returned:

      Terminal output
      upload: ./exp1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp1.json
      upload: ./exp2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp2.json
    4. If you copied your existing Expectation Suites to the S3 bucket, run the following Python code to confirm that Great Expectations can find them:

      Python
      import great_expectations as gx

      context = gx.get_context()
      context.list_expectation_suite_names()

      The Expectations you copied to S3 are returned as a list. Expectations that weren't copied to the new Store aren't listed.

    Configure your Validation Results Store on Amazon S3

    Your Validation Results StoreA connector to store and retrieve information about objects generated when data is Validated against an Expectation Suite. configuration is in your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..

    The following section in your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. great_expectations.yml file tells Great Expectations to look for Validation Results in a Store named validations_store. It also creates a ValidationsStore named validations_store that is backed by a Filesystem and stores Validation Results under the base_directory uncommitted/validations (the default).

    Python
    stores:
    validations_store:
    class_name: ValidationsStore
    store_backend:
    class_name: TupleFilesystemStoreBackend
    base_directory: uncommitted/validations/

    validations_store_name: validations_store
    1. To manually add a Validation Results Store, add the following configuration to the stores section of your great_expectations.yml file:

      Python
      stores:
      validations_S3_store:
      class_name: ValidationsStore
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your>'
      prefix: '<your>' # Bucket and prefix in combination must be unique across all stores

      As shown in the previous example, you need to change the default store_backend settings to make the Store work with S3. The class_name is set to TupleS3StoreBackend, bucket is the address of your S3 bucket, and prefix is the folder in your S3 bucket where Validation Results are located.

      The following example shows the additional options that are available to customize TupleS3StoreBackend:

      File contents: great_expectations.yml
      class_name: ValidationsStore
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your_s3_bucket_name>'
      prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
      boto3_options:
      endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
      region_name: '<your_aws_region_name>'

      In the previous example, the Store name is validations_S3_store. If you use a personalized Store name, you must also update the value of the validations_store_name key to match the Store name. For example:

      Python
      validations_store_name: validations_S3_store

      When you update the validations_store_name key value, Great Expectations uses the new Store for Validation Results.

      Add the following code to great_expectations.yml to configure the IAM user:

      File contents: great_expectations.yml
      class_name: ValidationsStore
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your_s3_bucket_name>'
      prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
      boto3_options:
      aws_access_key_id: ${AWS_ACCESS_KEY_ID} # Uses the AWS_ACCESS_KEY_ID environment variable to get aws_access_key_id.
      aws_secret_access_key: ${AWS_ACCESS_KEY_ID}
      aws_session_token: ${AWS_ACCESS_KEY_ID}

      Add the following code to great_expectations.yml to configure the IAM Assume Role:

      File contents: great_expectations.yml
      class_name: ValidationsStore
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your_s3_bucket_name>'
      prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
      boto3_options:
      assume_role_arn: '<your_role_to_assume>'
      region_name: '<your_aws_region_name>'
      assume_role_duration: session_duration_in_seconds
      caution

      If you are also storing ExpectationsA verifiable assertion about data. in S3 How to configure an Expectation store to use Amazon S3, or DataDocs in S3 How to host and share Data Docs, then make sure the prefix values are disjoint and one is not a substring of the other.

    2. To make Great Expectations look for Validation Results in an S3 bucket, set the validations_store_name variable in great_expectations.yml to the name of your S3 Validations Store as shown in the following example:

      File contents: great_expectations.yml
      class_name: ValidationsStore
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your_s3_bucket_name>'
      prefix: '<your_s3_bucket_folder_name>'
      boto3_options:
      endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
      region_name: '<your_aws_region_name>'

      Add the following code to great_expectations.yml to configure the IAM user:

      File contents: great_expectations.yml
      class_name: ExpectationsStore
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your_s3_bucket_name>'
      prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
      boto3_options:
      aws_access_key_id: ${AWS_ACCESS_KEY_ID} # Uses the AWS_ACCESS_KEY_ID environment variable to get aws_access_key_id.
      aws_secret_access_key: ${AWS_ACCESS_KEY_ID}
      aws_session_token: ${AWS_ACCESS_KEY_ID}

      Add the following code to great_expectations.yml to configure the IAM Assume Role:

      File contents: great_expectations.yml
      class_name: ExpectationsStore
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your_s3_bucket_name>'
      prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
      boto3_options:
      assume_role_arn: '<your_role_to_assume>'
      region_name: '<your_aws_region_name>'
      assume_role_duration: session_duration_in_seconds
    3. If you are converting an existing local Great Expectations deployment to one that works in AWS, you might have Validation Results saved that you want to transfer to your S3 bucket.

      To copy Validation Results into Amazon S3, use the aws s3 sync command as shown in the following example:

      Terminal input
      aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'

      The base_directory is set to uncommitted/validations/ by default.

      In the following example, the Validation Results Validation1 and Validation2 are copied to Amazon S3 and a confirmation message is returned:

      Terminal output
      upload: uncommitted/validations/val1/val1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val1.json
      upload: uncommitted/validations/val2/val2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val2.json

    Configure Data Docs for hosting and sharing from Amazon S3

    1. In the AWS CLI, run the following command to create an S3 bucket configured for a specific location. Modify the bucket name and region for your environment.

      Terminal input
      > aws s3api create-bucket --bucket data-docs.my_org --region us-east-1
      {
      "Location": "/data-docs.my_org"
      }
    2. The example policy below enforces IP-based access. Modify the bucket name and IP addresses for your environment. After you have customized the example policy to suit your situation, name the file ip-policy.json and save it in your local directory.

      caution

      Your policy should limit access to authorized users. Data Docs sites can include sensitive information and should not be publicly accessible.

      File content: ip-policy.json
        {
      "Version": "2012-10-17",
      "Statement": [{
      "Sid": "Allow only based on source IP",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": [
      "arn:aws:s3:::data-docs.my_org",
      "arn:aws:s3:::data-docs.my_org/*"
      ],
      "Condition": {
      "IpAddress": {
      "aws:SourceIp": [
      "192.168.0.1/32",
      "2001:db8:1234:1234::/64"
      ]
      }
      }
      }
      ]
      }
      tip

      Because Data Docs include multiple generated pages, it is important to include the arn:aws:s3:::{your_data_docs_site}/* path in the Resource list along with the arn:aws:s3:::{your_data_docs_site} path that permits access to your Data Docs' front page.

      REMINDER

      Amazon Web Service's S3 buckets are a third party utility. For more information about configuring AWS S3 bucket policies, see Using bucket policies.

    3. Run the following AWS CLI command to apply the policy:

      Terminal input
      > aws s3api put-bucket-policy --bucket data-docs.my_org --policy file://ip-policy.json
    4. The following example shows the default local_site configuration that you will find in your great_expectations.yml file, followed by the s3_site configuration that you will need to add. To maintain a single S3 Data Docs site, remove the default local_site configuration and replace it with the new s3_site configuration.

      Python
      data_docs_sites:
      local_site:
      class_name: SiteBuilder
      show_how_to_buttons: true
      store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/data_docs/local_site/
      site_index_builder:
      class_name: DefaultSiteIndexBuilder
      S3_site: # this is a user-selected name - you may select your own
      class_name: SiteBuilder
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your>'
      site_index_builder:
      class_name: DefaultSiteIndexBuilder
    5. Run the following code to build and open your newly configured S3 Data Docs site:

      Python
      context.build_data_docs()

    Optional settings

    • Run the following code to update static hosting settings for your bucket to enable AWS to automatically serve your index.html file or a custom error file:

      Terminal input
      > aws s3 website s3://data-docs.my_org/ --index-document index.html
    • To host a Data Docs site in a subfolder of an S3 bucket, add the prefix property to the configuration snippet immediately after the bucket property.

    • To host a Data Docs site through a private DNS, you can configure a base_public_path for the Data Docs StoreA connector to store and retrieve information pertaining to Human readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc.. The following example will configure a S3 site with the base_public_path set to www.mydns.com. Data Docs will still be written to the configured location on S3 (for example https://s3.amazonaws.com/data-docs.my_org/docs/index.html), but you can access the pages from your DNS (http://www.mydns.com/index.html in our example)

      YAML
      data_docs_sites:
      s3_site: # this is a user-selected name - you may select your own
      class_name: SiteBuilder
      store_backend:
      class_name: TupleS3StoreBackend
      bucket: data-docs.my_org # UPDATE the bucket name here to match the bucket you configured above.
      base_public_path: http://www.mydns.com
      site_index_builder:
      class_name: DefaultSiteIndexBuilder
      show_cta_footer: true

    Connect to data

    Previously, you were required to manually edit configuration files when you added configurations for Amazon S3 buckets. Now, it is recommended that you use the Great Expectations Python API in a Python interpreter to set up your Data Source configurations. When you use this methodology, you'll receive immediate feedback when you run your code. Alternatively, you can use a Python script in the integrated development environment (IDE) of your choice.

    1. Run the following code to import the packages and modules:

      Python
      import great_expectations as gx

      Run the following code to load your DataContext:

      Python
      context = gx.get_context()
    2. To connect Great Expectations to Athena, you need to provide a connection string. To determine your connection string, reference the following examples and the PyAthena documentation.

      caution

      As of this writing, Great Expectations is not compatible with SQLAlchemy version 2 or greater. We recommend using the latest non-version-2 release.

      The following urls don't include credentials as it is recommended to use either the instance profile or the boto3 configuration file.

      To connect Great Expectations to your Athena instance (without specifying a particular database), the URL is:

      Connection string
      awsathena+rest://@athena.{region}.amazonaws.com/?s3_staging_dir={s3_path}

      Note the url parameter "s3_staging_dir" needed for storing query results in S3.

      To connect Great Expectations to a specific Athena database, the URL is:

      Connection string
      awsathena+rest://@athena.{region}.amazonaws.com/{database}?s3_staging_dir={s3_path}
      Tip: Using credentials instead of connection_string

      The credentials key uses a dictionary to provide the elements of your connection string as separate, individual values. For information on how to populate the credentials dictionary and how to configure your great_expectations.yml project config file to populate credentials from either a YAML file or a secret manager, see Configure credentials.

    3. To configure a SQL Data Source, see Connect to a SQL database Data Source.

    4. Run the following code to connect to Athena, add an asset for your table, and build a batch request for your table:

      Python
      athena_source: SQLDatasource = context.sources.add_or_update_sql(
      "my_awsathena_datasource", connection_string=connection_string
      )
      athena_table = athena_source.add_table_asset("taxitable", table_name="taxitable")


      batch_request = athena_table.build_batch_request()

      Run the following code to prepare an empty Expectation suite:

      Python
      expectation_suite_name = "my_awsathena_expectation_suite"
      suite = context.add_or_update_expectation_suite(
      expectation_suite_name=expectation_suite_name
      )

      Run the following code to load data into a Validator:

      Python
      validator = context.get_validator(
      batch_request=batch_request,
      expectation_suite_name=expectation_suite_name,
      )
      validator.head(n_rows=5, fetch_all=False)

      When the code executes successfully, it indicates that your Data Source is working.

    Create Expectations

    There are many Expectations available for you to use. To demonstrate the creation of an Expectation through the use of the Validator you defined earlier, here are examples of the process for two of them:

    Python
    validator.expect_column_values_to_not_be_null(column="passenger_count")
    validator.expect_column_values_to_be_between(
    column="congestion_surcharge", min_value=0, max_value=1000
    )

    Each time you evaluate an Expectation with validator.expect_*, the Expectation is immediately Validated against your provided Batch of data. This instant feedback helps you identify unexpected data quickly. The Expectation configuration is stored in the Expectation Suite you provided when the Validator was initialized.

    To find out more about the available Expectations, see the Expectations Gallery.

    When you have run all of the Expectations you want for this dataset, you can call validator.save_expectation_suite() to save the Expectation Suite (all of the unique Expectation Configurations from each run of validator.expect_*)for later use in a Checkpoint.

    Python
    validator.save_expectation_suite(discard_failed_expectations=False)

    Validate data

    To validate and run post-validation ActionsA Python class with a run method that takes a Validation Result and does something with it, you create and store a CheckpointThe primary means for validating data in a production deployment of Great Expectations. for your Batch.

    Checkpoints can be preconfigured with a Batch Request and Expectation Suite, or they can take them in as parameters at runtime. They can also execute numerous Actions based on the Validation Results that are returned when Checkpoint runs.

    tip

    To preconfigure a Checkpoint with a Batch Request and Expectation Suite, see Manage Checkpoints

    1. Run the following code to create the Checkpoint configuration:

      Python
      checkpoint = context.add_or_update_checkpoint(
      name="my_checkpoint",
      validations=[
      {
      "batch_request": batch_request,
      "expectation_suite_name": expectation_suite_name,
      },
      ],
      )
    2. Run the following code to add the Checkpoint to your Data Context:

      Python
          context.add_or_update_checkpoint(checkpoint)
    3. Run the following code to run the Checkpoint:

      Python
      checkpoint_result = checkpoint.run()
    4. The Checkpoint contains UpdateDataDocsAction which renders the Data DocsHuman readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc. from the generated Validation Results. The Data Docs store contains a new entry for the rendered Validation Result.

      tip

      For more information on Actions that Checkpoints can perform and how to add them, see Configure Actions.

      Run the following code to view the new entry for the rendered Validation Result:

      Python
      context.open_data_docs()