How to configure a Spark/filesystem Datasource

This guide shows how to connect to a Spark Datasource such that the data is accessible in the form of files on a local or NFS type of a filesystem.

Prerequisites: This how-to guide assumes you have already:

Steps

To add a filesystem-backed Spark datasource do this:

  1. Run datasource new

    From the command line, run:

    great_expectations datasource new
    
  2. Choose “Files on a filesystem (for processing with Pandas or Spark)”

    What data would you like Great Expectations to connect to?
        1. Files on a filesystem (for processing with Pandas or Spark)
        2. Relational database (SQL)
    : 1
    
  3. Choose PySpark

    What are you processing your files with?
        1. Pandas
        2. PySpark
    : 2
    
  4. Specify the directory path for data files

    Enter the path (relative or absolute) of the root directory where the data files are stored.
    : /path/to/directory/containing/your/data/files
    
  5. Give your Datasource a name

    When prompted, provide a custom name for your filesystem-backed Spark data source, or hit Enter to accept the default.

    Give your new Datasource a short name.
     [my_data_files_dir]:
    

    Great Expectations will now add a new Datasource ‘my_data_files_dir’ to your deployment, by adding this entry to your great_expectations.yml:

    my_data_files_dir:
      data_asset_type:
        class_name: SparkDFDataset
        module_name: great_expectations.dataset
      spark_config: {}
      batch_kwargs_generators:
        subdir_reader:
          class_name: SubdirReaderBatchKwargsGenerator
          base_directory: /path/to/directory/containing/your/data/files
      class_name: SparkDFDatasource
    
      Would you like to proceed? [Y/n]:
    
  6. Wait for confirmation

    If all goes well, it will be followed by the message:

    A new datasource 'my_data_files_dir' was added to your project.
    

    If you run into an error, you will see something like:

    Error: Directory '/nonexistent/path/to/directory/containing/your/data/files' does not exist.
    
    Enter the path (relative or absolute) of the root directory where the data files are stored.
    :
    

    In this case, please check your data directory path, permissions, etc. and try again.

  7. Finally, if all goes well and you receive a confirmation on your Terminal screen, you can proceed with exploring the data sets in your new filesystem-backed Spark data source.

Additional Notes

  1. Relative path locations should be specified from the perspective of the directory, in which the

    great_expectations datasource new
    

    command is executed.

Comments