How to instantiate a Data Context on an EMR Spark cluster

This guide will help you instantiate a Data Context on an EMR Spark cluster.

The guide demonstrates the recommended path for instantiating a Data Context without a full configuration directory and without using the Great Expectations command line interface (CLI).

Steps

  1. Install Great Expectations on your EMR Spark cluster.

    Copy this code snippet into a cell in your EMR Spark notebook and run it:

    sc.install_pypi_package("great_expectations")
    
  2. Configure a Data Context in Memory.

The snippet below shows Python code that instantiates and configures a Data Context in memory. Copy this snippet into a cell in your EMR Spark notebook.

Follow the steps below to update the configuration with values that are specific for your environment.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import great_expectations.exceptions as ge_exceptions
from great_expectations.data_context.types.base import DataContextConfig
from great_expectations.data_context import BaseDataContext


project_config = DataContextConfig(
    config_version=2,
    plugins_directory=None,
    config_variables_file_path=None,
    datasources={
        "my_spark_datasource": {
            "data_asset_type": {
                "class_name": "SparkDFDataset",
                "module_name": "great_expectations.dataset",
            },
            "class_name": "SparkDFDatasource",
            "module_name": "great_expectations.datasource",
            "batch_kwargs_generators": {},
        }
    },
    stores={
        "expectations_S3_store": {
            "class_name": "ExpectationsStore",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket": "REPLACE ME",  # TODO: replace with your value
                "prefix": "REPLACE ME",  # TODO: replace with your value
            },
        },
        "validations_S3_store": {
            "class_name": "ValidationsStore",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket": "REPLACE ME",  # TODO: replace with your value
                "prefix": "REPLACE ME",  # TODO: replace with your value
            },
        },
        "evaluation_parameter_store": {"class_name": "EvaluationParameterStore"},
    },
    expectations_store_name="expectations_S3_store",
    validations_store_name="validations_S3_store",
    evaluation_parameter_store_name="evaluation_parameter_store",
    data_docs_sites={
        "s3_site": {
            "class_name": "SiteBuilder",
            "store_backend": {
                "class_name": "TupleS3StoreBackend",
                "bucket":  "REPLACE ME",  # TODO: replace with your value
            },
            "site_index_builder": {
                "class_name": "DefaultSiteIndexBuilder",
                "show_cta_footer": True,
            },
        }
    },
    validation_operators={
        "action_list_operator": {
            "class_name": "ActionListValidationOperator",
            "action_list": [
                {
                    "name": "store_validation_result",
                    "action": {"class_name": "StoreValidationResultAction"},
                },
                {
                    "name": "store_evaluation_params",
                    "action": {"class_name": "StoreEvaluationParametersAction"},
                },
                {
                    "name": "update_data_docs",
                    "action": {"class_name": "UpdateDataDocsAction"},
                },
            ],
        }
    },
    anonymous_usage_statistics={
      "enabled": True
    }
)

context = BaseDataContext(project_config=project_config)
  1. Configure an Expectation store in Amazon S3.

    Replace the “REPLACE ME” on lines 26-27 of the code snippet. Follow this how-to guide.

  2. Configure a Validation Result store in Amazon S3.

    Replace the “REPLACE ME” on lines 34-35 of the code snippet. Follow this how-to guide.

  3. Configure a Data Docs website in Amazon S3.

    Replace the “REPLACE ME” on line 48 of the code snippet. Follow this how-to guide.

  4. Test your configuration.

    Execute the cell with the snippet above.

    Then copy this code snippet into a cell in your EMR Spark notebook, run it and verify that no error is displayed:

    context.list_datasources()
    

Additional notes

Additional resources