How to configure a self managed Spark Datasource

This guide will help you add a managed Spark dataset (Spark Dataframe created by Spark SQL Query) as a Datasource. This will allow you to run expectations against tables available within your Spark cluster.

When you use a managed Spark Datasource, the validation is done in Spark itself. Your data is not downloaded.

Prerequisites: This how-to guide assumes you have already:

Steps

To enable running Great Expectations against dataframe created by Spark SQL query, follow below steps:

  1. Run datasource new

    From the command line, run:

    great_expectations datasource new
    
  2. Choose “Files on a filesystem (for processing with Pandas or Spark)”

    What data would you like Great Expectations to connect to?
        1. Files on a filesystem (for processing with Pandas or Spark)
        2. Relational database (SQL)
    
    : 1
    
  3. Choose PySpark

    What are you processing your files with?
        1. Pandas
        2. PySpark
    
    : 2
    
  4. Enter /tmp (it doesn’t matter what you enter as we will replace this in a few steps).

    Enter the path (relative or absolute) of the root directory where the data files are stored.
    
    : /tmp
    
  5. Enter spark_dataframe

    Give your new Datasource a short name.
    [tmp__dir]: spark_dataframe
    
  6. Enter Y

    Would you like to proceed? [Y/n]: Y
    

Show Docs for Stable API (up to 0.12.x)

  1. Replace lines in great_expectations.yml file

datasources:
  spark_dataframe:
    data_asset_type:
      class_name: SparkDFDataset
      module_name: great_expectations.dataset
    batch_kwargs_generators:
      subdir_reader:
        class_name: SubdirReaderBatchKwargsGenerator
        base_directory: /tmp
    class_name: SparkDFDatasource
    module_name: great_expectations.datasource

with

datasources:
  spark_dataframe:
    data_asset_type:
      class_name: SparkDFDataset
      module_name: great_expectations.dataset
    batch_kwargs_generators:
      spark_sql_query:
        class_name: QueryBatchKwargsGenerator
        queries:
          ${query_name}: ${spark_sql_query}
    module_name: great_expectations.datasource
    class_name: SparkDFDatasource
  1. Fill values:

  • query_name - Name by which you want to reference the datasource. For next points we will use my_first_query name. You will use this name to select datasource when creating expectations.

  • spark_sql_query - Spark SQL Query that will create DataFrame against which GE validations will be run. For next points we will use select * from mydb.mytable query.

Now, when creating new expectation suite, query main will be available in the list of datasources.

Show Docs for Experimental API (0.13)

  1. Replace lines in great_expectations.yml file

datasources:
  spark_dataframe:
    data_asset_type:
      class_name: SparkDFDataset
      module_name: great_expectations.dataset
    batch_kwargs_generators:
      subdir_reader:
        class_name: SubdirReaderBatchKwargsGenerator
        base_directory: /tmp
    class_name: SparkDFDatasource
    module_name: great_expectations.datasource

with

datasources:
  spark_dataframe:
    class_name: Datasource
    execution_engine:
      class_name: SparkDFExecutionEngine
    data_connectors:
      simple_filesystem_data_connector:
        class_name: InferredAssetFilesystemDataConnector
        base_directory: /root/directory/containing/data/files
        glob_directive: '*'
        default_regex:
          pattern: (.+)\.csv
          group_names:
          - data_asset_name
  1. Fill values:

  • base_directory - Either absolute path or relative path with respect to Great Expectations installation directory is acceptable

  • class_name - A different DataConnector class with its corresponding configuration parameters may be substituted into the above snippet as best suitable for the given use case.

Additional Notes

Show Docs for Stable API (up to 0.12.x)

  1. Configuring Spark options

To provide custom configuration options either:

  1. Create curated spark-defaults.conf configuration file in $SPARK_HOME/conf directory

  2. Provide spark_context dictionary to Datasource config:

    datasources:
      spark_dataframe:
        data_asset_type:
          class_name: SparkDFDataset
          module_name: great_expectations.dataset
        batch_kwargs_generators:
          spark_sql_query:
            class_name: QueryBatchKwargsGenerator
            queries:
              ${query_name}: ${spark_sql_query}
        module_name: great_expectations.datasource
        class_name: SparkDFDatasource
        spark_context:
            spark.master: local[*]
    

Full list of Spark configuration options is available here: [https://spark.apache.org/docs/latest/configuration.html](https://spark.apache.org/docs/latest/configuration.html)

Spark catalog

Running SQL queries requires either registering temporary views or enabling Spark catalog (like Hive metastore).

This configuraiton options are enabled using Hive Metastore catalog - an equivalent of .enableHiveSupport().

spark.sql.catalogImplementation     hive
spark.sql.warehouse.dir             /tmp/hive
spark.hadoop.hive.metastore.uris    thrift://localhost:9083

Show Docs for Experimental API (0.13)

  1. Configuring Spark options

To provide custom configuration options either:

  1. Create curated spark-defaults.conf configuration file in $SPARK_HOME/conf directory

  2. Provide spark_context dictionary to Datasource config:

datasources:
  spark_dataframe:
    class_name: Datasource
    execution_engine:
      class_name: SparkDFExecutionEngine
      spark_config:
        spark_context:
          spark.master: local[*]
    data_connectors:
      simple_filesystem_data_connector:
        class_name: InferredAssetFilesystemDataConnector
        base_directory: /root/directory/containing/data/files
        glob_directive: '*'
        default_regex:
          pattern: (.+)\.csv
          group_names:
          - data_asset_name

Full list of Spark configuration options is available here: [https://spark.apache.org/docs/latest/configuration.html](https://spark.apache.org/docs/latest/configuration.html)