Host and share Data Docs
Data Docs translate Expectations, Validation Results, and other metadata into human-readable documentation. Automatically compiling your data documentation from your data tests in the form of Data Docs keeps your documentation current. Use the information provided here to host and share Data Docs stored on a filesystem or a Data Source.
- Amazon S3
- Microsoft Azure Blob Storage
- Google Cloud Service
- Filesystem
Amazon S3
Host and share Data DocsHuman readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc. on AWS S3.
Prerequisites
Create an S3 bucket
In the AWS CLI, run the following command to create an S3 bucket configured for a specific location. Modify the bucket name and region for your environment.
> aws s3api create-bucket --bucket data-docs.my_org --region us-east-1
{
"Location": "/data-docs.my_org"
}
Configure your bucket policy
The example policy below enforces IP-based access. Modify the bucket name and IP addresses for your environment. After you have customized the example policy to suit your situation, name the file ip-policy.json
and save it in your local directory.
Your policy should limit access to authorized users. Data Docs sites can include sensitive information and should not be publicly accessible.
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "Allow only based on source IP",
"Effect": "Allow",
"Principal": "*",
"Action": "s3:GetObject",
"Resource": [
"arn:aws:s3:::data-docs.my_org",
"arn:aws:s3:::data-docs.my_org/*"
],
"Condition": {
"IpAddress": {
"aws:SourceIp": [
"192.168.0.1/32",
"2001:db8:1234:1234::/64"
]
}
}
}
]
}
Because Data Docs include multiple generated pages, it is important to include the arn:aws:s3:::{your_data_docs_site}/*
path in the Resource
list along with the arn:aws:s3:::{your_data_docs_site}
path that permits access to your Data Docs' front page.
Amazon Web Service's S3 buckets are a third party utility. For more information about configuring AWS S3 bucket policies, see Using bucket policies.
Apply the policy
Run the following AWS CLI command to apply the policy:
> aws s3api put-bucket-policy --bucket data-docs.my_org --policy file://ip-policy.json
Add a new S3 site to great_expectations.yml
The following example shows the default local_site
configuration that you will find in your great_expectations.yml
file, followed by the s3_site
configuration that you will need to add. To maintain a single S3 Data Docs site, remove the default local_site
configuration and replace it with the new s3_site
configuration.
data_docs_sites:
local_site:
class_name: SiteBuilder
show_how_to_buttons: true
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/
site_index_builder:
class_name: DefaultSiteIndexBuilder
S3_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: '<your>'
site_index_builder:
class_name: DefaultSiteIndexBuilder
Test your configuration
Run the following code to build and open your newly configured S3 Data Docs site:
context.build_data_docs()
Additional notes
-
Run the following code to update static hosting settings for your bucket to enable AWS to automatically serve your index.html file or a custom error file:
Terminal input> aws s3 website s3://data-docs.my_org/ --index-document index.html
-
To host a Data Docs site in a subfolder of an S3 bucket, add the
prefix
property to the configuration snippet immediately after thebucket
property. -
To host a Data Docs site through a private DNS, you can configure a
base_public_path
for the Data Docs StoreA connector to store and retrieve information pertaining to Human readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc.. The following example will configure a S3 site with thebase_public_path
set towww.mydns.com
. Data Docs will still be written to the configured location on S3 (for examplehttps://s3.amazonaws.com/data-docs.my_org/docs/index.html
), but you can access the pages from your DNS (http://www.mydns.com/index.html
in our example)YAMLdata_docs_sites:
s3_site: # this is a user-selected name - you may select your own
class_name: SiteBuilder
store_backend:
class_name: TupleS3StoreBackend
bucket: data-docs.my_org # UPDATE the bucket name here to match the bucket you configured above.
base_public_path: http://www.mydns.com
site_index_builder:
class_name: DefaultSiteIndexBuilder
show_cta_footer: true
Microsoft Azure Blob Storage
Host and share Data DocsHuman readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc. on Azure Blob Storage. Data Docs are served using an Azure Blob Storage static website with restricted access.
Prerequisites
- A working deployment of Great Expectations
- Permissions to create and configure an Azure Storage account
Install Azure Storage Blobs client library for Python
Run the following pip command in a terminal to install Azure Storage Blobs client library and its dependencies:
pip install azure-storage-blob
Create an Azure Blob Storage static website
-
Create a storage account.
-
In Settings, select Static website.
-
Select Enabled to enable static website hosting for the storage account.
-
Write "index.html" in the Index document.
-
Record the Primary endpoint URL. Your team will use this URL to view the Data Doc. A container named
$web
is added to your storage account to help you map a custom domain to this endpoint.
Configure the config_variables.yml
file
GX recommends storing Azure Storage credentials in the config_variables.yml
file, which is located in the uncommitted/
folder by default, and is not part of source control.
To review additional options for configuring the config_variables.yml
file or additional environment variables, see Configure credentials.
-
Get the Connection string of the storage account you created.
-
Open the
config_variables.yml
file and then add the following entry belowAZURE_STORAGE_CONNECTION_STRING
:AZURE_STORAGE_CONNECTION_STRING: "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=<YOUR-STORAGE-ACCOUNT-NAME>;AccountKey=<YOUR-STORAGE-ACCOUNT-KEY==>"
Add a new Azure site to the data_docs_sites section of your great_expectations.yml
-
Open the
great_expectations.yml
file and add the following entry:data_docs_sites:
local_site:
class_name: SiteBuilder
show_how_to_buttons: true
store_backend:
class_name: TupleFilesystemStoreBackend
base_directory: uncommitted/data_docs/local_site/
site_index_builder:
class_name: DefaultSiteIndexBuilder
new_site_name: # this is a user-selected name - you can select your own
class_name: SiteBuilder
store_backend:
class_name: TupleAzureBlobStoreBackend
container: \$web
connection_string: ${AZURE_STORAGE_CONNECTION_STRING}
site_index_builder:
class_name: DefaultSiteIndexBuilder -
Optional. Replace the default
local_site
to maintain a single Azure Data Docs site.
Since the container is named $web
, setting container: $web
in great_expectations.yml
would cause GX to unsuccessfully try to find the web
variable in config_variables.yml
. Use an escape char \
before the $
so the substitute_config_variable
can locate the $web
container.
You can also configure GX to store your ExpectationsA verifiable assertion about data. and Validation ResultsGenerated when data is Validated against an Expectation or Expectation Suite. in the Azure Storage account.
See Configure Expectation Stores and Configure Validation Result Stores. Make sure you set container: \$web
correctly.
The following options are available:
container
: The name of the Azure Blob container to store your data in.connection_string
: The Azure Storage connection string. This can also be supplied by setting theAZURE_STORAGE_CONNECTION_STRING
environment variable.prefix
: All paths on blob storage will be prefixed with this string.account_url
: The URL to the blob storage account. Any other entities included in the URL path (e.g. container or blob) will be discarded. This URL can be optionally authenticated with a SAS token. This can only be used if you don't configure theconnection_string
. You can also configure this by setting theAZURE_STORAGE_ACCOUNT_URL
environment variable.
The following authentication methods are supported:
- SAS token authentication: append the SAS token to
account_url
or make sure it is set in theconnection_string
. - Account key authentication: include the account key in the
connection_string
. - When none of the above authentication methods are specified, the DefaultAzureCredential will be used which supports most common authentication methods. You still need to provide the account url either through the config file or environment variable.
Build the Azure Blob Data Docs site
You can create or modify an Expectation SuiteA collection of verifiable assertions about data. and this will build the Data Docs website.
Run the following Python code to build and open your Data Docs:
site_name = "new_site_name"
context.build_data_docs(site_names=site_name)
context.open_data_docs(site_name=site_name)