Skip to main content

How to connect to data on a filesystem using Pandas

This guide will help you connect to your data stored on a filesystem using pandas. This will allow you to validate and explore your data.

Prerequisites: This how-to guide assumes you have:
  • Completed the Getting Started Tutorial
  • Have a working installation of Great Expectations
  • Have access to data on a filesystem

Steps#

1. Choose how to run the code in this guide#

Get an environment to run the code in this guide. Please choose an option below.

If you use the Great Expectations CLI, run this command to automatically generate a pre-configured Jupyter Notebook. Then you can follow along in the YAML-based workflow below:

great_expectations --v3-api datasource new

2. Instantiate your project's DataContext#

Import these necessary packages and modules.

from ruamel import yaml
import great_expectations as gefrom great_expectations.core.batch import BatchRequest, RuntimeBatchRequest

Load your DataContext into memory using the get_context() method.

context = ge.get_context()

3. Configure your Datasource#

Using this example configuration add in the path to a directory that contains some of your data:

datasource_yaml = f"""name: taxi_datasourceclass_name: Datasourcemodule_name: great_expectations.datasourceexecution_engine:  module_name: great_expectations.execution_engine  class_name: PandasExecutionEnginedata_connectors:    default_runtime_data_connector_name:        class_name: RuntimeDataConnector        batch_identifiers:            - default_identifier_name    default_inferred_data_connector_name:        class_name: InferredAssetFilesystemDataConnector        base_directory: <PATH_TO_YOUR_DATA_HERE>        default_regex:          group_names:            - data_asset_name          pattern: (.*)"""

Run this code to test your configuration.

context.test_yaml_config(datasource_yaml)

If you specified a directory containing CSV files you will see them listed as Available data_asset_names in the output of test_yaml_config().

Feel free to adjust your configuration and re-run test_yaml_config() as needed.

4. Save the Datasource configuration to your DataContext#

Save the configuration into your DataContext by using the add_datasource() function.

context.add_datasource(**yaml.load(datasource_yaml))

5. Test your new Datasource#

Verify your new Datasource by loading data from it into a Validator using a BatchRequest.

Add the path to your CSV in the path key under runtime_parameters in your BatchRequest.

batch_request = RuntimeBatchRequest(    datasource_name="taxi_datasource",    data_connector_name="default_runtime_data_connector_name",    data_asset_name="<YOUR_MEANINGFUL_NAME>",  # This can be anything that identifies this data_asset for you    runtime_parameters={"path": "<PATH_TO_YOUR_DATA_HERE>"},  # Add your path here.    batch_identifiers={"default_identifier_name": "default_identifier"},)

Then load data into the Validator.

context.create_expectation_suite(    expectation_suite_name="test_suite", overwrite_existing=True)validator = context.get_validator(    batch_request=batch_request, expectation_suite_name="test_suite")print(validator.head())

πŸš€πŸš€ Congratulations! πŸš€πŸš€ You successfully connected Great Expectations with your data.

Additional Notes#

If you are working with nonstandard CSVs, read one of these guides:

To view the full scripts used in this page, see them on GitHub:

Next Steps#

Now that you've connected to your data, you'll want to work on these core skills: