Version: 0.18.17

Use Great Expectations with Amazon Web Services using Redshift

Great Expectations can work within many frameworks. In this guide you will be shown a workflow for using Great Expectations with AWS and cloud storage. You will configure a local Great Expectations project to store Expectations, Validation Results, and Data Docs in Amazon S3 buckets. You will further configure Great Expectations to access data from a Redshift database.

This guide will demonstrate each of the steps necessary to go from installing a new instance of Great Expectations to Validating your data for the first time and viewing your Validation Results as Data Docs.

Prerequisites

An installation of Python, version 3.8 to 3.11. To download and install Python, see Python downloads.
The AWS CLI. To download and install the AWS CLI, see Installing or updating the latest version of the AWS CLI.
AWS credentials. See Configuring the AWS CLI.
Permissions to install the Python packages (boto3 and great_expectations) with pip.
An S3 bucket and prefix to store Expectations and Validation Results.

Steps

Part 1: Setup

1.1 Ensure that the AWS CLI is ready for use

1.1.1 Verify that the AWS CLI is installed

Run the following code to verify that the AWS CLI is installed:

Terminal command

aws --version

If this code does not return the AWS CLI version information, you may need to install the AWS CLI or troubleshoot your current installation. See Install or update the latest version of the AWS CLI

1.1.2 Verify that your AWS credentials are properly configured

Run the following command in the AWS CLI to verify that your AWS credentials are properly configured:

Terminal command

aws sts get-caller-identity

When your credentials are properly configured, your UserId, Account, and Arn are returned. If your credentials are not configured correctly, an error message appears. If you received an error message, or you couldn't verify your credentials, see Configuring the AWS CLI.

1.2 Prepare a local installation of Great Expectations

1.2.1 Verify that your Python version meets requirements

Run the following code to check what version of Python is currently installed:

Terminal command

python --version

Great Expectations supports Python versions 3.8 to 3.11. If a Python 3 version number is not returned, run the following code:

Terminal command

python3 --version

If you do not have Python 3 installed, go to python.org for the current downloads and installation guidance.

1.2.2 Create a virtual environment for your Great Expectations project

After you have confirmed that Python 3 is installed locally, you can create a virtual environment with venv before installing your packages with pip. The following examples use venv for virtual environments because it is included with Python 3. You can use alternate tools such as virtualenv and pyenv to install GX in virtual environments.

Run one of the following code blocks to create your virtual environment:

Terminal command

python -m venv my_venv

Terminal command

python3 -m venv my_venv

A new directory named my_venv is created in your virtual environment.

Run the following code to activate the virtual environment:

Terminal command

source my_venv/bin/activate

tip

To change the name of your virtual environment, replace my_venv in the example code.

1.2.3 Ensure you have the latest version of pip

After you've activated your virtual environment, you should ensure that you have the latest version of pip installed. Pip is a tool that is used to easily install Python packages.

Run the following code to ensure that you have the latest version of pip installed:

Terminal command

python -m ensurepip --upgrade

Terminal command

python3 -m ensurepip --upgrade

1.2.4 Install boto3

Python interacts with AWS through the boto3 library. Great Expectations makes use of this library in the background when working with AWS. Although you won't use boto3 directly, you'll need to install it in your virtual environment.

Run one of the following pip commands to install boto3 in your virtual environment:

Terminal command

python -m pip install boto3

Terminal command

python3 -m pip install boto3

To set up boto3 with AWS, and use boto3 within Python, see the Boto3 documentation.

1.2.5 Install Great Expectations

Run one of the following code blocks to use pip to install Great Expectations:

Terminal command

python -m pip install great_expectations

Terminal command

python3 -m pip install great_expectations

1.2.6 Verify that Great Expectations installed successfully

Run the following code to confirm the GX installation is working:

Terminal command

great_expectations --version

Version information similar to the following is returned:

Terminal output

great_expectations, version 0.18.9

1.2.7 Install additional dependencies for Redshift

To use connect to your Redshift database, Great Expectations will require the installation of additional dependencies. Fortunately, it is simple to install the necessary dependencies for Redshift by using pip and running the following from your terminal:

Terminal input
pip install sqlalchemy sqlalchemy-redshift psycopg2

# or if on macOS:
pip install sqlalchemy sqlalchemy-redshift psycopg2-binary

caution

As of this writing, Great Expectations is not compatible with SQLAlchemy version 2 or greater. We recommend using the latest non-version-2 release.

1.3 Create your Data Context

The simplest way to create a new Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. is by using the create() method.

From a Notebook or script where you want to deploy Great Expectations run the following command. Here the full_path_to_project_directory can be an empty directory where you intend to build your Great Expectations configuration.:

Python
import great_expectations as gx

context = gx.data_context.FileDataContext.create(full_path_to_project_directory)

1.4 Configure your Expectations Store on Amazon S3

1.4.1 Identify your Data Context Expectations Store

Your Expectation StoreA connector to store and retrieve information about collections of verifiable assertions about data. configuration is in your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..

Python
stores:
  expectations_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: expectations/

expectations_store_name: expectations_store

The default base_directory for expectations_store is expectations/.

1.4.2 Update your configuration file to include a new Store for Expectations on Amazon S3

To manually add an Expectations StoreA connector to store and retrieve information about collections of verifiable assertions about data. to your configuration, add the following configuration to the stores section of your great_expectations.yml file:

Python
stores:
  expectations_S3_store:
    class_name: ExpectationsStore
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your>'
      prefix: '<your>'  # Bucket and prefix in combination must be unique across all stores

expectations_store_name: expectations_S3_store

Change the default store_backend settings to make the Store work with S3. The class_name is set to TupleS3StoreBackend, bucket is the address of your S3 bucket, and prefix is the folder in your S3 bucket where Expectations are located.

The following example shows the additional options that are available to customize TupleS3StoreBackend:

File contents: great_expectations.yml
class_name: ExpectationsStore
store_backend:
  class_name: TupleS3StoreBackend
  bucket: '<your_s3_bucket_name>'
  prefix: '<your_s3_bucket_folder_name>'  # Bucket and prefix in combination must be unique across all stores
  boto3_options:
    endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
    region_name: '<your_aws_region_name>'

In the previous example, the Store name is expectations_S3_store. If you use a personalized Store name, you must also update the value of the expectations_store_name key to match the Store name. For example:

File contents: great_expectations.yml
expectations_store_name: expectations_S3_store

When you update the expectations_store_name key value, Great Expectations uses the new Store for Validation Results.

Add the following code to great_expectations.yml to configure the IAM user:

File contents: great_expectations.yml
class_name: ExpectationsStore
store_backend:
  class_name: TupleS3StoreBackend
  bucket: '<your_s3_bucket_name>'
  prefix: '<your_s3_bucket_folder_name>'  
  boto3_options:
    aws_access_key_id: ${AWS_ACCESS_KEY_ID} # Uses the AWS_ACCESS_KEY_ID environment variable to get aws_access_key_id.
    aws_secret_access_key: ${AWS_ACCESS_KEY_ID}
    aws_session_token: ${AWS_ACCESS_KEY_ID}

Add the following code to great_expectations.yml to configure the IAM Assume Role:

File contents: great_expectations.yml
class_name: ExpectationsStore
store_backend:
  class_name: TupleS3StoreBackend
  bucket: '<your_s3_bucket_name>'
  prefix: '<your_s3_bucket_folder_name>'  # Bucket and prefix in combination must be unique across all stores
  boto3_options:
    assume_role_arn: '<your_role_to_assume>'
    region_name: '<your_aws_region_name>'
    assume_role_duration: session_duration_in_seconds

caution

If you're storing Validations in S3 or DataDocs in S3, make sure that the prefix values are disjoint and one is not a substring of the other.

1.4.3 (Optional) Copy existing Expectation JSON files to the Amazon S3 bucket

If you are converting an existing local Great Expectations deployment to one that works in AWS, you might have Expectations saved that you want to transfer to your S3 bucket.

Run the following aws s3 synccommand to copy Expectations into Amazon S3:

Terminal command

aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'

The base_directory is set to expectations/ by default.

In the following example, the Expectations exp1 and exp2 are copied to Amazon S3 and a confirmation message is returned:

Terminal output

upload: ./exp1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp1.json
upload: ./exp2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/exp2.json

1.4.4 (Optional) Verify that copied Expectations can be accessed from Amazon S3

If you copied your existing Expectation Suites to the S3 bucket, run the following Python code to confirm that Great Expectations can find them:

Python
import great_expectations as gx

context = gx.get_context()
context.list_expectation_suite_names()

The Expectations you copied to S3 are returned as a list. Expectations that weren't copied to the new Store aren't listed.

1.5 Configure your Validation Results Store on Amazon S3

1.5.1 Identify your Data Context's Validation Results Store

Your Validation Results StoreA connector to store and retrieve information about objects generated when data is Validated against an Expectation Suite. configuration is in your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components..

The following section in your Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. great_expectations.yml file tells Great Expectations to look for Validation Results in a Store named validations_store. It also creates a ValidationsStore named validations_store that is backed by a Filesystem and stores Validation Results under the base_directory uncommitted/validations (the default).

Python
stores:
  validations_store:
    class_name: ValidationsStore
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/validations/

validations_store_name: validations_store

1.5.2 Update your configuration file to include a new Store for Validation Results on Amazon S3

To manually add a Validation Results Store, add the following configuration to the stores section of your great_expectations.yml file:

Python
stores:
  validations_S3_store:
    class_name: ValidationsStore
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your>'
      prefix: '<your>'  # Bucket and prefix in combination must be unique across all stores

As shown in the previous example, you need to change the default store_backend settings to make the Store work with S3. The class_name is set to TupleS3StoreBackend, bucket is the address of your S3 bucket, and prefix is the folder in your S3 bucket where Validation Results are located.

The following example shows the additional options that are available to customize TupleS3StoreBackend:

File contents: great_expectations.yml
class_name: ValidationsStore
store_backend:
  class_name: TupleS3StoreBackend
  bucket: '<your_s3_bucket_name>'
  prefix: '<your_s3_bucket_folder_name>'  # Bucket and prefix in combination must be unique across all stores
  boto3_options:
    endpoint_url: ${S3_ENDPOINT} # Uses the S3_ENDPOINT environment variable to determine which endpoint to use.
    region_name: '<your_aws_region_name>'

In the previous example, the Store name is validations_S3_store. If you use a personalized Store name, you must also update the value of the validations_store_name key to match the Store name. For example:

Python
validations_store_name: validations_S3_store

When you update the validations_store_name key value, Great Expectations uses the new Store for Validation Results.

Add the following code to great_expectations.yml to configure the IAM user:

File contents: great_expectations.yml
class_name: ValidationsStore
store_backend:
  class_name: TupleS3StoreBackend
  bucket: '<your_s3_bucket_name>'
  prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
  boto3_options:
    aws_access_key_id: ${AWS_ACCESS_KEY_ID} # Uses the AWS_ACCESS_KEY_ID environment variable to get aws_access_key_id.
    aws_secret_access_key: ${AWS_ACCESS_KEY_ID}
    aws_session_token: ${AWS_ACCESS_KEY_ID}

Add the following code to great_expectations.yml to configure the IAM Assume Role:

File contents: great_expectations.yml
class_name: ValidationsStore
store_backend:
  class_name: TupleS3StoreBackend
  bucket: '<your_s3_bucket_name>'
  prefix: '<your_s3_bucket_folder_name>' # Bucket and prefix in combination must be unique across all stores
  boto3_options:
    assume_role_arn: '<your_role_to_assume>'
    region_name: '<your_aws_region_name>'
    assume_role_duration: session_duration_in_seconds

caution

If you are also storing ExpectationsA verifiable assertion about data. in S3 How to configure an Expectation store to use Amazon S3, or DataDocs in S3 How to host and share Data Docs, then make sure the prefix values are disjoint and one is not a substring of the other.

1.5.3 (Optional) Copy existing Validation results to the Amazon S3 bucket

If you are converting an existing local Great Expectations deployment to one that works in AWS, you might have Validation Results saved that you want to transfer to your S3 bucket.

To copy Validation Results into Amazon S3, use the aws s3 sync command as shown in the following example:

Terminal input

aws s3 sync '<base_directory>' s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'

The base_directory is set to uncommitted/validations/ by default.

In the following example, the Validation Results Validation1 and Validation2 are copied to Amazon S3 and a confirmation message is returned:

Terminal output

upload: uncommitted/validations/val1/val1.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val1.json
upload: uncommitted/validations/val2/val2.json to s3://'<your_s3_bucket_name>'/'<your_s3_bucket_folder_name>'/val2.json

1.6.1 Create an Amazon S3 bucket for your Data Docs

In the AWS CLI, run the following command to create an S3 bucket configured for a specific location. Modify the bucket name and region for your environment.

Terminal input
> aws s3api create-bucket --bucket data-docs.my_org --region us-east-1
{
    "Location": "/data-docs.my_org"
}

1.6.2 Configure your bucket policy to enable appropriate access

The example policy below enforces IP-based access. Modify the bucket name and IP addresses for your environment. After you have customized the example policy to suit your situation, name the file ip-policy.json and save it in your local directory.

caution

Your policy should limit access to authorized users. Data Docs sites can include sensitive information and should not be publicly accessible.

File content: ip-policy.json
  {
    "Version": "2012-10-17",
    "Statement": [{
      "Sid": "Allow only based on source IP",
      "Effect": "Allow",
      "Principal": "*",
      "Action": "s3:GetObject",
      "Resource": [
        "arn:aws:s3:::data-docs.my_org",
        "arn:aws:s3:::data-docs.my_org/*"
      ],
      "Condition": {
        "IpAddress": {
          "aws:SourceIp": [
            "192.168.0.1/32",
            "2001:db8:1234:1234::/64"
          ]
        }
      }
    }
    ]
  }

tip

Because Data Docs include multiple generated pages, it is important to include the arn:aws:s3:::{your_data_docs_site}/* path in the Resource list along with the arn:aws:s3:::{your_data_docs_site} path that permits access to your Data Docs' front page.

REMINDER

Amazon Web Service's S3 buckets are a third party utility. For more information about configuring AWS S3 bucket policies, see Using bucket policies.

1.6.3 Apply the access policy to your Data Docs' Amazon S3 bucket

Run the following AWS CLI command to apply the policy:

Terminal input

> aws s3api put-bucket-policy --bucket data-docs.my_org --policy file://ip-policy.json

1.6.4 Add a new Amazon S3 site to the `data_docs_sites` section of your `great_expectations.yml`

The following example shows the default local_site configuration that you will find in your great_expectations.yml file, followed by the s3_site configuration that you will need to add. To maintain a single S3 Data Docs site, remove the default local_site configuration and replace it with the new s3_site configuration.

Python
data_docs_sites:
  local_site:
    class_name: SiteBuilder
    show_how_to_buttons: true
    store_backend:
      class_name: TupleFilesystemStoreBackend
      base_directory: uncommitted/data_docs/local_site/
    site_index_builder:
      class_name: DefaultSiteIndexBuilder
  S3_site:  # this is a user-selected name - you may select your own
    class_name: SiteBuilder
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: '<your>'
    site_index_builder:
      class_name: DefaultSiteIndexBuilder

1.6.5 Test that your Data Docs configuration is correct by building the site

Run the following code to build and open your newly configured S3 Data Docs site:

Python
context.build_data_docs()

Additional notes on hosting Data Docs from an Amazon S3 bucket

Run the following code to update static hosting settings for your bucket to enable AWS to automatically serve your index.html file or a custom error file:
Terminal input
```
> aws s3 website s3://data-docs.my_org/ --index-document index.html
```
To host a Data Docs site in a subfolder of an S3 bucket, add the prefix property to the configuration snippet immediately after the bucket property.
To host a Data Docs site through a private DNS, you can configure a base_public_path for the Data Docs StoreA connector to store and retrieve information pertaining to Human readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc.. The following example will configure a S3 site with the base_public_path set to www.mydns.com. Data Docs will still be written to the configured location on S3 (for example https://s3.amazonaws.com/data-docs.my_org/docs/index.html), but you can access the pages from your DNS (http://www.mydns.com/index.html in our example)
YAML
```
data_docs_sites:
  s3_site:  # this is a user-selected name - you may select your own
    class_name: SiteBuilder
    store_backend:
      class_name: TupleS3StoreBackend
      bucket: data-docs.my_org  # UPDATE the bucket name here to match the bucket you configured above.
      base_public_path: http://www.mydns.com
    site_index_builder:
      class_name: DefaultSiteIndexBuilder
      show_cta_footer: true
```

Part 2: Connect to data

2.1 Instantiate your project's DataContext

Python
import great_expectations as gx

context = gx.data_context.FileDataContext.create(full_path_to_project_directory)

If you have already instantiated your DataContext in a previous step, this step can be skipped.

2.1.1 Determine your connection string

For this guide we will use a connection_string like this:

Connection string

redshift+psycopg2://<USER_NAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE>?sslmode=<SSLMODE>

Note: Depending on your Redshift cluster configuration, you may or may not need the sslmode parameter. For more details, please refer to Amazon's documentation for configuring security options on Amazon Redshift.

Is there a more secure way to store my credentials than plain text in a connection string?

We recommend that database credentials be stored in the config_variables.yml file, which is located in the uncommitted/ folder by default, and is not part of source control.

For additional options on configuring the config_variables.yml file or additional environment variables, please see our guide on how to configure credentials.

2.2 Add Data Source to your DataContext

Creating a Redshift Data Source is as simple as providing the add_or_update_sql(...) method a name by which to reference it in the future and the connection_string with which to access it.

Python
datasource_name = "my_redshift_datasource"
connection_string = "redshift+psycopg2://<user_name>:<password>@<host>:<port>/<database>?sslmode=<sslmode>"

With these two values, we can create our Data Source:

Python
datasource = context.sources.add_or_update_sql(
    name=datasource_name,
    connection_string=connection_string,
)

2.3. Connect to a specific set of data with a Data Asset

Now that our Data Source has been created, we will use it to connect to a specific set of data in the database it is configured for. This is done by defining a Data Asset in the Data Source. A Data Source may contain multiple Data Assets, each of which will serve as the interface between GX and the specific set of data it has been configured for.

With SQL databases, there are two types of Data Assets that can be used. The first is a Table Data Asset, which connects GX to the data contained in a single table in the source database. The other is a Query Data Asset, which connects GX to the data returned by a SQL query. We will demonstrate how to create both of these in the following steps.

How many Data Assets can my Data Source contain?

Although there is no set maximum number of Data Assets you can define for a Data Source, there is a functional minimum. In order for GX to retrieve data from your Data Source you will need to create at least one Data Asset.

We will indicate a table to connect to with a Table Data Asset. This is done by providing the add_table_asset(...) method a name by which we will reference the Data Asset in the future and a table_name to specify the table we wish the Data Asset to connect to.

Python
table_asset = datasource.add_table_asset(name="my_table_asset", table_name="taxi_data")

To indicate the query that provides data to connect to we will define a Query Data Asset. This done by providing the add_query_asset(...) method a name by which we will reference the Data Asset in the future and a query which will provide the data we wish the Data Asset to connect to.

Python
query_asset = datasource.add_query_asset(
    name="my_query_asset", query="SELECT * from taxi_data"
)

2.4 Test your new Data Source

Verify your new Data SourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. by loading data from it into a ValidatorUsed to run an Expectation Suite against data. using a Batch RequestProvided to a Data Source in order to create a Batch..

Python
request = table_asset.build_batch_request()

context.add_or_update_expectation_suite(expectation_suite_name="test_suite")

validator = context.get_validator(
    batch_request=request, expectation_suite_name="test_suite"
)

print(validator.head())

Part 3: Create Expectations

3.1: Prepare a Batch Request, empty Expectation Suite, and Validator

When we tested our Data Source in step 2.3: Test your new Data Source we also created all of the components we need to begin creating Expectations: A Batch Request to provide sample data we can test our new Expectations against, an empty Expectation Suite to contain our new Expectations, and a Validator to create those Expectations with.

We can reuse those components now. Alternatively, you may follow the same process that we did before and define a new Batch Request, Expectation Suite, and Validator if you wish to use a different Batch of data as the reference sample when you are creating Expectations or if you wish to use a different name than test_suite for your Expectation Suite.

3.2: Use a Validator to add Expectations to the Expectation Suite

There are many Expectations available for you to use. To demonstrate the creation of an Expectation through the use of the Validator you defined earlier, here are examples of the process for two of them:

Python
validator.expect_column_values_to_not_be_null(column="passenger_count")
validator.expect_column_values_to_be_between(
    column="congestion_surcharge", min_value=0, max_value=1000
)

Each time you evaluate an Expectation with validator.expect_*, the Expectation is immediately Validated against your provided Batch of data. This instant feedback helps you identify unexpected data quickly. The Expectation configuration is stored in the Expectation Suite you provided when the Validator was initialized.

To find out more about the available Expectations, see the Expectations Gallery.

3.3: Save the Expectation Suite

When you have run all of the Expectations you want for this dataset, you can call validator.save_expectation_suite() to save the Expectation Suite (all of the unique Expectation Configurations from each run of validator.expect_*)for later use in a Checkpoint.

Python
validator.save_expectation_suite(discard_failed_expectations=False)

Part 4: Validate Data

4.1: Create and run a Checkpoint

To validate and run post-validation ActionsA Python class with a run method that takes a Validation Result and does something with it, you create and store a CheckpointThe primary means for validating data in a production deployment of Great Expectations. for your Batch.

Checkpoints can be preconfigured with a Batch Request and Expectation Suite, or they can take them in as parameters at runtime. They can also execute numerous Actions based on the Validation Results that are returned when Checkpoint runs.

tip

To preconfigure a Checkpoint with a Batch Request and Expectation Suite, see Manage Checkpoints

4.1.1 Create a Checkpoint

Run the following code to create the Checkpoint:

Python
checkpoint = context.add_or_update_checkpoint(
    name="my_checkpoint",
    validations=[{"batch_request": request, "expectation_suite_name": "test_suite"}],
)

The Checkpoint you created is named my_checkpoint. It includes a Validation using the BatchRequest you created earlier, and an ExpectationSuite containing two Expectations, test_suite.

4.1.2 Run the Checkpoint

Run the following code to run the Checkpoint:

Python
checkpoint_result = checkpoint.run()

4.2: Build and view Data Docs

The Checkpoint contains UpdateDataDocsAction which renders the Data DocsHuman readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc. from the generated Validation Results. The Data Docs store contains a new entry for the rendered Validation Result.

tip

For more information on Actions that Checkpoints can perform and how to add them, see Configure Actions.

Run the following code to view the new entry for the rendered Validation Result:

Python
context.open_data_docs()

Prerequisites​

Steps​

Part 1: Setup​

1.1 Ensure that the AWS CLI is ready for use​

1.1.1 Verify that the AWS CLI is installed​

1.1.2 Verify that your AWS credentials are properly configured​

1.2 Prepare a local installation of Great Expectations​

1.2.1 Verify that your Python version meets requirements​

1.2.2 Create a virtual environment for your Great Expectations project​

1.2.3 Ensure you have the latest version of pip​

1.2.4 Install boto3​

1.2.5 Install Great Expectations​

1.2.6 Verify that Great Expectations installed successfully​

1.2.7 Install additional dependencies for Redshift​

1.3 Create your Data Context​

1.4 Configure your Expectations Store on Amazon S3​

1.4.1 Identify your Data Context Expectations Store​

1.4.2 Update your configuration file to include a new Store for Expectations on Amazon S3​

1.4.3 (Optional) Copy existing Expectation JSON files to the Amazon S3 bucket​

1.4.4 (Optional) Verify that copied Expectations can be accessed from Amazon S3​

1.5 Configure your Validation Results Store on Amazon S3​

1.5.1 Identify your Data Context's Validation Results Store​

1.5.2 Update your configuration file to include a new Store for Validation Results on Amazon S3​

1.5.3 (Optional) Copy existing Validation results to the Amazon S3 bucket​

1.6 Configure Data Docs for hosting and sharing from Amazon S3​

1.6.1 Create an Amazon S3 bucket for your Data Docs​

1.6.2 Configure your bucket policy to enable appropriate access​

1.6.3 Apply the access policy to your Data Docs' Amazon S3 bucket​

1.6.4 Add a new Amazon S3 site to the data_docs_sites section of your great_expectations.yml​

1.6.5 Test that your Data Docs configuration is correct by building the site​

Additional notes on hosting Data Docs from an Amazon S3 bucket​

Part 2: Connect to data​

2.1 Instantiate your project's DataContext​

2.1.1 Determine your connection string​

2.2 Add Data Source to your DataContext​

2.3. Connect to a specific set of data with a Data Asset​

2.4 Test your new Data Source​

Part 3: Create Expectations​

3.1: Prepare a Batch Request, empty Expectation Suite, and Validator​

3.2: Use a Validator to add Expectations to the Expectation Suite​

3.3: Save the Expectation Suite​

Part 4: Validate Data​

4.1: Create and run a Checkpoint​

4.1.1 Create a Checkpoint​

4.1.2 Run the Checkpoint​

4.2: Build and view Data Docs​

Prerequisites

Steps

Part 1: Setup

1.1 Ensure that the AWS CLI is ready for use

1.1.1 Verify that the AWS CLI is installed

1.1.2 Verify that your AWS credentials are properly configured

1.2 Prepare a local installation of Great Expectations

1.2.1 Verify that your Python version meets requirements

1.2.2 Create a virtual environment for your Great Expectations project

1.2.3 Ensure you have the latest version of pip

1.2.4 Install boto3

1.2.5 Install Great Expectations

1.2.6 Verify that Great Expectations installed successfully

1.2.7 Install additional dependencies for Redshift

1.3 Create your Data Context

1.4 Configure your Expectations Store on Amazon S3

1.4.1 Identify your Data Context Expectations Store

1.4.2 Update your configuration file to include a new Store for Expectations on Amazon S3

1.4.3 (Optional) Copy existing Expectation JSON files to the Amazon S3 bucket

1.4.4 (Optional) Verify that copied Expectations can be accessed from Amazon S3

1.5 Configure your Validation Results Store on Amazon S3

1.5.1 Identify your Data Context's Validation Results Store

1.5.2 Update your configuration file to include a new Store for Validation Results on Amazon S3

1.5.3 (Optional) Copy existing Validation results to the Amazon S3 bucket

1.6 Configure Data Docs for hosting and sharing from Amazon S3

1.6.1 Create an Amazon S3 bucket for your Data Docs

1.6.2 Configure your bucket policy to enable appropriate access

1.6.3 Apply the access policy to your Data Docs' Amazon S3 bucket

1.6.4 Add a new Amazon S3 site to the `data_docs_sites` section of your `great_expectations.yml`

1.6.5 Test that your Data Docs configuration is correct by building the site

Additional notes on hosting Data Docs from an Amazon S3 bucket

Part 2: Connect to data

2.1 Instantiate your project's DataContext

2.1.1 Determine your connection string

2.2 Add Data Source to your DataContext

2.3. Connect to a specific set of data with a Data Asset

2.4 Test your new Data Source

Part 3: Create Expectations

3.1: Prepare a Batch Request, empty Expectation Suite, and Validator

3.2: Use a Validator to add Expectations to the Expectation Suite

3.3: Save the Expectation Suite

Part 4: Validate Data

4.1: Create and run a Checkpoint

4.1.1 Create a Checkpoint

4.1.2 Run the Checkpoint

4.2: Build and view Data Docs