Skip to main content

How to Use Great Expectations in Databricks

Great Expectations works well with many types of Databricks workflows. This guide will help you run Great Expectations in Databricks.

Prerequisites: This how-to guide assumes you have:
  • Completed the Getting Started Tutorial
  • Have completed Databricks setup including having a running Databricks cluster with attached notebook
  • If you are using the file based version of this guide, you'll need to have DBFS set up

There are several ways to set up Databricks, this guide centers around an AWS deployment using Databricks Data Science & Engineering Notebooks and Jobs. If you use Databricks on GCP or Azure and there are steps in this guide that don't work for you please reach out to us.

We will cover a simple configuration to get you up and running quickly, and link to our other guides for more customized configurations. For example:

  • If you want to validate files stored in DBFS select one of the "File" tabs below. You can also watch our video walkthrough of these steps.
    • If you are using a different file store (e.g. s3, GCS, ABS) take a look at our how-to guides in the "Cloud" section of "Connecting to Your Data" for example configurations.
  • If you already have a spark dataframe loaded, select one of the "Dataframe" tabs below.

This guide parallels notebook workflows from the Great Expectations CLI, so you can optionally prototype your setup with a local sample batch before moving to Databricks. You can also use examples and code from the notebooks that the CLI generates, and indeed much of the examples that follow parallel those notebooks closely.

1. Install Great Expectations#

Install Great Expectations as a notebook-scoped library by running the following command in your notebook:

  %pip install great-expectations
What is a notebook-scoped library?
A notebook-scoped library is what it sounds like - "custom Python environments that are specific to a notebook." You can also install a library at the cluster or workspace level. See the Databricks documentation on Libraries for more information.

After that we will take care of some imports that will be used later. Choose your configuration options to show applicable imports:

2. Set up Great Expectations#

In this guide, we will be using the Databricks File Store (DBFS) for your Metadata Stores and Data Docs store. This is a simple way to get up and running within the Databricks environment without configuring external resources. For other options for storing data see our "Metadata Stores" and "Data Docs" sections in the "How to Guides" for "Setting up Great Expectations."

What is DBFS?
Paraphrased from the Databricks docs: DBFS is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. Files on DBFS can be written and read as if they were on a local filesystem, just by adding the /dbfs/ prefix to the path. It is also persisted to object storage, so you wonโ€™t lose data after you terminate a cluster. See the Databricks documentation for best practices including mounting object stores.

Run the following code to set up a Data Context in code using the appropriate defaults:

What is an "in code" Data Context?
When you don't have easy access to a file system, instead of defining your Data Context via great_expectations.yml you can do so by instantiating a BaseDataContext with a config. Take a look at our how-to guide to learn more: How to instantiate a Data Context without a yml file. In Databricks, you can do either since you have access to a filesystem - we've simply shown the in code version here for simplicity.
What do we mean by "root_directory" in the below code?
The root_directory here refers to the directory that will hold the data for your Metadata Stores (e.g. Expectations Store, Validations Store, Data Docs Store). We are using the FilesystemStoreBackendDefaults since DBFS acts sufficiently like a filesystem that we can simplify our configuration with these defaults. These are all more configurable than is shown in this simple guide, so for other options please see our "Metadata Stores" and "Data Docs" sections in the "How to Guides" for "Setting up Great Expectations."

3. Prepare your data#

4. Connect to your data#

๐Ÿš€๐Ÿš€ Congratulations! ๐Ÿš€๐Ÿš€ You successfully connected Great Expectations with your data.

Now let's keep going to create an Expectation Suite and validate our data.

5. Create Expectations#

Here we will use a Validator to interact with our batch of data and generate an Expectation Suite.

Each time we evaluate an Expectation (e.g. via validator.expect_*), the Expectation configuration is stored in the Validator. When you have run all of the Expectations you want for this dataset, you can call validator.save_expectation_suite() to save all of your Expectation configurations into an Expectation Suite for later use in a checkpoint.

This is the same method of interactive Expectation Suite editing used in the CLI interactive mode notebook accessed via great_expectations --v3-api suite new --interactive. For more information, see our documentation on How to create and edit Expectations with instant feedback from a sample Batch of data. You can also create Expectation Suites using a profiler to automatically create expectations based on your data or manually using domain knowledge and without inspecting data directly.

6. Validate your data#

7. Build and view Data Docs#

Since we used a SimpleCheckpoint, our Checkpoint already contained an UpdateDataDocsAction which rendered our Data Docs from the validation we just ran. That means our Data Docs store will contain a new rendered validation result.

How do I customize these actions?
Check out our docs on "Validating your data" for more info on how to customize your Checkpoints. Also, to see the full Checkpoint configuration, you can run: `print(my_checkpoint.get_substituted_config().to_yaml_str())`

Since we used DBFS for our Data Docs store, we need to download our data docs locally to view them. If you use a different store, you can host your data docs in a place where they can be accessed directly by your team. To learn more, see our documentation on Data Docs for other locations e.g. filesystem, s3, GCS, ABS.

Run the following Databricks CLI command to download your data docs (replacing the paths as appropriate), then open the local copy of index.html to view your updated Data Docs:

databricks fs cp -r dbfs:/great_expectations/uncommitted/data_docs/local_site/ great_expectations/uncommitted/data_docs/local_site/

8. Congratulations!#

You've successfully validated your data with Great Expectations using Databricks and viewed the resulting human-readable Data Docs. Check out our other guides for more customization options and happy validating!

View the full scripts used in this page on GitHub: