Data Docs compiles raw Great Expectations objects including Expectations and Validations into structured documents such as HTML documentation that display key characteristics of a dataset. Together, Data Docs, Profiling, and Validation are the three core services offered by GE.
Data Docs is implemented in the
HTML documentation takes expectation suites and validation results and produces clear, functional, and self-healing documentation of expected and observed data characteristics. Together with profiling, it can help to rapidly create a clearer picture of your data, and keep your entire team on the same page as data evolves.
For example, the default BasicDatasetProfiler in GE will produce validation_results which compile to a page for each table or DataFrame including an overview section:
And then detailed statistics for each column:
The GE DataContext uses a configurable “data documentation site” to define which artifacts to compile and how to render them as documentation. Multiple sites can be configured inside a project, each suitable for a particular data documentation use case.
For example, we have identified three common use cases for using documentation in a data project. They are to:
Visualize all Great Expectations artifacts in the local repo of a project as HTML: expectation suites, validation results and profiling results.
Maintain a “shared source of truth” for a team working on a data project. This documentation renders all the artifacts committed in the source control system (expectation suites and profiling results) and a continuously updating data quality report, built from a chronological list of validations by run id.
Share a spec of a dataset with a client or a partner. This is similar to API documentation in software development. This documentation would include profiling results of the dataset to give the reader a quick way to grasp what the data looks like, and one or more expectation suites that encode what is expected from the data to be considered valid.
To support these (and possibly other) use cases GE has a concept of “data documentation site”. Multiple sites can be configured inside a project, each suitable for a particular data documentation use case.
Here is an example of a site:
The behavior of a site is controlled by configuration in the DataContext’s great_expectations.yml file.
Users can specify
which datasources to document (by default, all)
whether to include expectations, validations and profiling results sections
where the expectations and validations should be read from (filesystem, S3, or GCS)
where the HTML files should be written (filesystem, S3, or GCS)
which renderer and view class should be used to render each section