How to instantiate a Data Context on Databricks Spark cluster

This guide will help you instantiate a Data Context on an Databricks Spark cluster.

The guide demonstrates the recommended path for instantiating a Data Context without a full configuration directory and without using the Great Expectations command line interface (CLI).

Prerequisites: This how-to guide assumes you have already:

  • Followed the Getting Started tutorial and have a basic familiarity with the Great Expectations configuration.

Steps

This how-to-guide assumes that you are using a Databricks Notebook, and using the Databricks File Store (DBFS) as the Metadata Store and DataDocs store. The DBFS is a file store that is native to Databricks clusters and Notebooks. Files on DBFS can be written and read as if they were on a local filesystem, just by adding the /dbfs/ prefix to the path. For information on how to configure Databricks for filesystems on Azure and AWS, please see the associated documentation in the Additional Notes section below.

  1. Install Great Expectations on your Databricks Spark cluster.

    Copy this code snippet into a cell in your Databricks Spark notebook and run it:

    dbutils.library.installPyPI("great_expectations")
    
  2. Configure a Data Context in Memory.

The following snippet shows Python code that instantiates and configures a Data Context in memory. Copy this snippet into a cell in your Databricks Spark notebook, replace the TODO stubs with paths to your stores, and run.

Note

If you are using DBFS for your stores, make sure to prepend your base_directory with /dbfs/ as in the examples below to make sure you are writing to DBFS and not the Spark driver node filesystem.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import great_expectations.exceptions as ge_exceptions
from great_expectations.data_context.types.base import DataContextConfig
from great_expectations.data_context import BaseDataContext

project_config = DataContextConfig(
    config_version=2,
    plugins_directory=None,
    config_variables_file_path=None,

    datasources={
        "my_spark_datasource": {
            "data_asset_type": {
                "class_name": "SparkDFDataset",
                "module_name": "great_expectations.dataset",
            },
            "class_name": "SparkDFDatasource",
            "module_name": "great_expectations.datasource",
            "batch_kwargs_generators": {},
        }
    },
    stores={
    "expectations_store": {
        "class_name": "ExpectationsStore",
        "store_backend": {
            "class_name": "TupleFilesystemStoreBackend",
            "base_directory": "/dbfs/FileStore/expectations/",  # TODO: replace with the path to your Expectations Store on DBFS
        },
    },
    "validations_store": {
        "class_name": "ValidationsStore",
        "store_backend": {
            "class_name": "TupleFilesystemStoreBackend",
            "base_directory": "/dbfs/FileStore/validations/",  # TODO: replace with the path to your Validations Store on DBFS
        },
    },
    "evaluation_parameter_store": {"class_name": "EvaluationParameterStore"},
 },
 expectations_store_name="expectations_store",
 validations_store_name="validations_store",
 evaluation_parameter_store_name="evaluation_parameter_store",
 data_docs_sites={
    "local_site": {
        "class_name": "SiteBuilder",
        "store_backend": {
            "class_name": "TupleFilesystemStoreBackend",
            "base_directory": "/dbfs/FileStore/docs/",  # TODO: replace with the path to your DataDocs Store on DBFS
        },
        "site_index_builder": {
            "class_name": "DefaultSiteIndexBuilder",
            "show_cta_footer": True,
        },
    }
 },
 validation_operators={
    "action_list_operator": {
        "class_name": "ActionListValidationOperator",
        "action_list": [
            {
                "name": "store_validation_result",
                "action": {"class_name": "StoreValidationResultAction"},
            },
            {
                "name": "store_evaluation_params",
                "action": {"class_name": "StoreEvaluationParametersAction"},
            },
            {
                "name": "update_data_docs",
                "action": {"class_name": "UpdateDataDocsAction"},
            },
        ],
    }
 },
 anonymous_usage_statistics={
  "enabled": True
 }
 )

context = BaseDataContext(project_config=project_config)
  1. Test your configuration.

    Execute the cell with the snippet above.

    Then copy this code snippet into a cell in your Databricks Spark notebook, run it and verify that no error is displayed:

    context.list_datasources()
    

Additional notes

  • If you’re continuing to work in a Databricks notebook, the following code-snippet could be used to load and run Expectations on a csv file that lives in DBFS.

from great_expectations.data_context import BaseDataContext

file_location = "/FileStore/tables/dc_wikia_data.csv"
file_type = "csv"

# CSV options
infer_schema = "false"
first_row_is_header = "false"
delimiter = ","

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type) \
  .option("inferSchema", infer_schema) \
  .option("header", first_row_is_header) \
  .option("sep", delimiter) \
  .load(file_location)

context = BaseDataContext(project_config=project_config)
context.create_expectation_suite("my_new_suite")

my_batch = context.get_batch({
   "dataset": df,
   "datasource": "my_local_datasource",
}, "my_new_suite")

my_batch.expect_table_row_count_to_equal(140)

Additional resources