How to configure a self managed Spark Datasource

This guide will help you add a managed Spark dataset (Spark Dataframe created by Spark SQL Query) as a Datasource. This will allow you to run expectations against tables available within your Spark cluster.

When you use a managed Spark Datasource, the validation is done in Spark itself. Your data is not downloaded.

Prerequisites: This how-to guide assumes you have already:

Steps

To enable running Great Expectations against dataframe created by Spark SQL query, follow below steps:

  1. Run datasource new

    From the command line, run:

    great_expectations datasource new
    
  2. Choose “Files on a filesystem (for processing with Pandas or Spark)”

    What data would you like Great Expectations to connect to?
        1. Files on a filesystem (for processing with Pandas or Spark)
        2. Relational database (SQL)
    
    : 1
    
  3. Choose PySpark

    What are you processing your files with?
        1. Pandas
        2. PySpark
    
    : 2
    
  4. Enter /tmp (it doesn’t matter what you enter as we will replace this in a few steps).

    Enter the path (relative or absolute) of the root directory where the data files are stored.
    
    : /tmp
    
  5. Enter spark_dataframe

    Give your new Datasource a short name.
    [tmp__dir]: spark_dataframe
    
  6. Enter Y

    Would you like to proceed? [Y/n]: Y
    
  7. Replace lines in great_expectations.yml file

    datasources:
      spark_dataframe:
        data_asset_type:
          class_name: SparkDFDataset
          module_name: great_expectations.dataset
        batch_kwargs_generators:
          subdir_reader:
            class_name: SubdirReaderBatchKwargsGenerator
            base_directory: /tmp
        class_name: SparkDFDatasource
        module_name: great_expectations.datasource
    

    with

    datasources:
      spark_dataframe:
        data_asset_type:
          class_name: SparkDFDataset
          module_name: great_expectations.dataset
        batch_kwargs_generators:
          spark_sql_query:
            class_name: QueryBatchKwargsGenerator
            queries:
              ${query_name}: ${spark_sql_query}
        module_name: great_expectations.datasource
        class_name: SparkDFDatasource
    
  8. Fill values:

  • query_name - Name by which you want to reference the datasource. For next points we will use my_first_query name. You will use this name to select datasource when creating expectations.

  • spark_sql_query - Spark SQL Query that will create DataFrame against which GE validations will be run. For next points we will use select * from mydb.mytable query.

Now, when creating new expectation suite, query main will be available in the list of datasources.

Additional Notes

  1. Configuring Spark options

To provide custom configuration options either:

  1. Create curated spark-defaults.conf configuration file in $SPARK_HOME/conf directory

  2. Provide spark_context dictionary to Datasource config:

    datasources:
      spark_dataframe:
        data_asset_type:
          class_name: SparkDFDataset
          module_name: great_expectations.dataset
        batch_kwargs_generators:
          spark_sql_query:
            class_name: QueryBatchKwargsGenerator
            queries:
              ${query_name}: ${spark_sql_query}
        module_name: great_expectations.datasource
        class_name: SparkDFDatasource
        spark_context:
            spark.master: local[*]
    

Full list of Spark configuration options is available here: [https://spark.apache.org/docs/latest/configuration.html](https://spark.apache.org/docs/latest/configuration.html)

Spark catalog

Running SQL queries requires either registering temporary views or enabling Spark catalog (like Hive metastore).

This configuraiton options enable using Hive Metastore catalog - an equivalent of .enableHiveSupport()

spark.sql.catalogImplementation     hive
spark.sql.warehouse.dir             /tmp/hive
spark.hadoop.hive.metastore.uris    thrift://localhost:9083