Configure Data Docs
Data Docs translate Expectations, Validation Results, and other metadata into human-readable documentation that is saved as static web pages. Automatically compiling your data documentation from your data tests in the form of Data Docs keeps your documentation current. This guide covers how to configure additional locations where Data Docs should be created.
Prerequisites:
- Python version 3.9 to 3.11.
- An installation of GX Core.
- A preconfigured File Data Context. This guide assumes the variable
context
contains your Data Context.
To host Data Docs in an environment other than a local or networked filesystem, you will also need to install the appropriate dependencies and configure access credentials accordingly:
- Optional. An installation of GX Core with support for Amazon S3 dependencies and credentials configured.
- Optional. An installation of GX Core with support for Google Cloud Storage dependencies and credentials configured.
- Optional. An installation of GX Core with support for Azure Blob Storage dependencies and credentials configured.
Procedure
- Instructions
- Sample code
-
Define a configuration dictionary for your new Data Docs site.
The main component that requires customization in a Data Docs site configuration is its
store_backend
. Thestore_backend
is a dictionary that tells GX where the Data Docs site will be hosted and how to access that location when the site is updated.The specifics of the
store_backend
will depend on the environment in which the Data Docs will be created. GX Core supports generation of Data Docs in local or networked filesystems, Amazon S3, Google Cloud Service, and Azure Blob Storage.To create a Data Docs site configuration, select one of the following environments and follow the corresponding instructions.
- Filesystem
- Amazon S3
- Azure Blob Storage
- Google Cloud Service
A local or networked filesystem Data Doc site requires the following
store_backend
information:class_name
: The name of the class to implement for accessing the target environment. For a local or networked filesystem this will beTupleFilesystemStoreBackend
.base_directory
: A path to the folder where the static sites should be created. This can be an absolute path, or a path relative to the root folder of the Data Context.
To define a Data Docs site configuration for a local or networked filesystem environment, update the value of
base_directory
in the following code and execute it:Pythonbase_directory = "uncommitted/data_docs/local_site/" # this is the default path (relative to the root folder of the Data Context) but can be changed as required
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleFilesystemStoreBackend",
"base_directory": base_directory,
},
}An Amazon S3 Data Doc site requires the following
store_backend
information:-
class_name
: The name of the class to implement for accessing the target environment. For Amazon S3 this will beTupleS3StoreBackend
. -
bucket
: The name of the Amazon S3 bucket that will host the Data Docs site. -
prefix
: The path of the folder that will contain the Data Docs pages relative to the root of the Amazon S3 bucket. The combination ofcontainer
andbucket
must be unique accross all Stores used by a Data Context. -
boto3_options
: The credentials for accessing your Amazon S3 account. Amazon S3 supports multiple methods of providing credentials, such as use of an endpoint url, access key, or role assignment. For more information on how to configure your Amazon S3 credentials, see Amazon's documentation for how to Configure the AWS CLI.The
boto3_options
dictionary can contain the following keys, depending on how you have configured your credentials in the AWS CLI:endpoint_url
: An AWS endpoint for service requests. Using this also requiresregion_name
to be included in theboto3_options
.region_name
: The AWS region to send requests to. This must be included in theboto3_options
ifendpoint_url
orassume_role_arn
are used.aws_access_key_id
: An AWS access key associated with an IAM account. Using this also requiresaws_secret_access_key
to be provided.aws_secret_access_key
: The secret key associated with the access key. This is required if yourboto3_options
use theaws_access_key_id
key, and can be conscidered the "password" for the access key specified byaws_access_key_id
.aws_session_token
: The value of the session token you retrieve directly from AWS STS operations when using temporary credentials.assume_role_arn
: The Amazon Resource Name (ARN) of an IAM role with your access credentials. Using this also requiresassume_role_duration
to be included in theboto3_options
.assume_role_duration
: The duration of your session, measured in seconds. This is required if yourboto3_options
use theassume_role_arn
key.
To define a Data Docs site configuration for S3, update
bucket
,prefix
, andboto3_options
in the following code and execute it:Pythonbucket = "my_s3_bucket"
prefix = "data_docs/"
boto3_options = {
"endpoint_url": "${S3_ENDPOINT}", # Uses string substitution to get the endpoint url form the S3_ENDPOINT environment variable.
"region_name": "<your>", # Use the name of your AWS region.
}
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": bucket,
"prefix": prefix,
"boto3_options": boto3_options,
},
}An Azure Blob Storage Data Doc site requires the following
store_backend
information:class_name
: The name of the class to implement for accessing the target environment. For Azure Blob Storage this will beTupleAzureBlobStoreBackend
.container
: The name of the Azure Blob Storage container that will host the Data Docs site.prefix
: The path of the folder that will contain the Data Docs pages relative to the root of the Azure Blob Storage container. The combination ofcontainer
andprefix
must be unique accross all Stores used by a Data Context.connection_string
: The connection string for your Azure Blob Storage. For more information on how to securely store your connection string, see Configure credentials.
To define a Data Docs site configuration in Azure Blob Storage, update the values of
container
,prefix
, andconnection_string
in the following code and execute it:Pythoncontainer = "my_abs_container"
prefix = "data_docs/"
connection_string = "${AZURE_STORAGE_CONNECTION_STRING}" # This uses string substitution to get the actual connection string from an environment variable or config file.
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleAzureBlobStoreBackend",
"container": container,
"prefix": prefix,
"connection_string": connection_string,
},
}A Google Cloud Service Data Doc site requires the following
store_backend
information:class_name
: The name of the class to implement for accessing the target environment. For Google Cloud Storage this will beTupleGCSStoreBackend
.project
: The name of the GCS project that will host the Data Docs site.bucket
: The name of the bucket that will contain the Data Docs pages.prefix
: The path of the folder that will contain the Data Docs pages relative to the root of the GCS bucket. The combination ofbucket
andprefix
must be unique accross all Stores used by a Data Context.
To define a Data Docs site configuration for Google Cloud Storage, update the values of
project
,bucket
, andprefix
in the following code and execute it:project = "my_project"
bucket = "my_gcs_bucket"
prefix = "data_docs_site/"
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleGCSStoreBackend",
"project": project,
"bucket": bucket,
"prefix": prefix,
},
}For GX to access your Google Cloud Services environment, you will also need to configure the appropriate credentials. By default, GCS credentials are handled through the gcloud command line tool and the
GOOGLE_APPLICATION_CREDENTIALS
environment variable. The gcloud command line tool is used to set up authentication credentials, and theGOOGLE_APPLICATION_CREDENTIALS
environment variable provides the path to the json file with those credentials.For more information on using the gcloud command line tool, see Google Cloud's Cloud Storage client libraries documentation.
-
Add your configuration to your Data Context.
All Data Docs sites have a unique name within a Data Context. Once your Data Docs site configuration has been defined, add it to the Data Context by updating the value of
site_name
in the following to something more descriptive and then execute the code::Pythonsite_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config) -
Optional. Build your Data Docs sites manually.
You can manually build a Data Docs site by executing the following code:
Pythoncontext.build_data_docs(site_names=site_name)
-
Optional. Automate Data Docs site updates with Checkpoint Actions.
You can automate the creation and update of Data Docs sites by including the
UpdateDataDocsAction
in your Checkpoints. This Action will automatically trigger a Data Docs site build whenever the Checkpoint it is included in completes itsrun()
method.Pythoncheckpoint_name = "my_checkpoint"
validation_definition_name = "my_validation_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
gx.checkpoint.actions.UpdateDataDocsAction(
name="update_my_site", site_names=[site_name]
)
]
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name=checkpoint_name,
validation_definitions=[validation_definition],
actions=actions,
)
)
result = checkpoint.run() -
Optional. View your Data Docs.
Once your Data Docs have been created, you can view them with:
Pythoncontext.open_data_docs()
GX Core supports the Data Docs configurations for the following environments:
- Filesystem
- Amazon S3
- Azure Blob Storage
- Google Cloud Service
import great_expectations as gx
context = gx.get_context(mode="file")
# Define a Data Docs site configuration dictionary
base_directory = "uncommitted/data_docs/local_site/" # this is the default path (relative to the root folder of the Data Context) but can be changed as required
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleFilesystemStoreBackend",
"base_directory": base_directory,
},
}
# Add the Data Docs configuration to the Data Context
site_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config)
# Manually build the Data Docs
context.build_data_docs(site_names=site_name)
# Automate Data Docs updates with a Checkpoint Action
checkpoint_name = "my_checkpoint"
validation_definition_name = "my_validation_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
gx.checkpoint.actions.UpdateDataDocsAction(
name="update_my_site", site_names=[site_name]
)
]
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name=checkpoint_name,
validation_definitions=[validation_definition],
actions=actions,
)
)
result = checkpoint.run()
# View the Data Docs
context.open_data_docs()
import great_expectations as gx
context = gx.get_context(mode="file")
# Build Data Docs configuration dictionary
bucket = "my_s3_bucket"
prefix = "data_docs/"
boto3_options = {
"endpoint_url": "${S3_ENDPOINT}", # Uses string substitution to get the endpoint url form the S3_ENDPOINT environment variable.
"region_name": "<your>", # Use the name of your AWS region.
}
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleS3StoreBackend",
"bucket": bucket,
"prefix": prefix,
"boto3_options": boto3_options,
},
}
# Add the Data Docs configuration to the Data Context
site_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config)
# Manually build the Data Docs
context.build_data_docs(site_names=site_name)
# Automate Data Docs updates with a Checkpoint Action
checkpoint_name = "my_checkpoint"
validation_definition_name = "my_validaton_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
gx.checkpoint.actions.UpdateDataDocsAction(
name="update_my_site", site_names=[site_name]
)
]
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name=checkpoint_name,
validation_definitions=[validation_definition],
actions=actions,
)
)
result = checkpoint.run()
# View the Data Docs
context.open_data_docs()
import great_expectations as gx
context = gx.get_context(mode="file")
# Define a Data Docs configuration dictionary
container = "my_abs_container"
prefix = "data_docs/"
connection_string = "${AZURE_STORAGE_CONNECTION_STRING}" # This uses string substitution to get the actual connection string from an environment variable or config file.
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleAzureBlobStoreBackend",
"container": container,
"prefix": prefix,
"connection_string": connection_string,
},
}
# Add the Data Docs configuration to the Data Context
site_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config)
# Manually build the Data Docs
context.build_data_docs(site_names=site_name)
# Automate Data Docs updates with a Checkpoint Action
checkpoint_name = "my_checkpoint"
validation_definition_name = "my_validaton_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
gx.checkpoint.actions.UpdateDataDocsAction(
name="update_my_site", site_names=[site_name]
)
]
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name=checkpoint_name,
validation_definitions=[validation_definition],
actions=actions,
)
)
result = checkpoint.run()
# View the Data Docs
context.open_data_docs()
import great_expectations as gx
context = gx.get_context(mode="file")
# Define a Data Docs configuration dictionary
project = "my_project"
bucket = "my_gcs_bucket"
prefix = "data_docs_site/"
site_config = {
"class_name": "SiteBuilder",
"site_index_builder": {"class_name": "DefaultSiteIndexBuilder"},
"store_backend": {
"class_name": "TupleGCSStoreBackend",
"project": project,
"bucket": bucket,
"prefix": prefix,
},
}
# Add the Data Docs configuration to the Data Context
site_name = "my_data_docs_site"
context.add_data_docs_site(site_name=site_name, site_config=site_config)
# Manually build the Data Docs
context.build_data_docs(site_names=site_name)
# Automate Data Docs updates with a Checkpoint Action
checkpoint_name = "my_checkpoint"
validation_definition_name = "my_validaton_definition"
validation_definition = context.validation_definitions.get(validation_definition_name)
actions = [
gx.checkpoint.actions.UpdateDataDocsAction(
name="update_my_site", site_names=[site_name]
)
]
checkpoint = context.checkpoints.add(
gx.Checkpoint(
name=checkpoint_name,
validation_definitions=[validation_definition],
actions=actions,
)
)
result = checkpoint.run()
# View the Data Docs
context.open_data_docs()