Skip to main content
Version: 1.6.4

Connect to GX Cloud with Python

Learn how to use GX Cloud from a Python script or interpreter, such as a Jupyter Notebook. You'll install Great Expectations, configure your GX Cloud environment variables, connect to sample data, build your first Expectation, validate data, and review the validation results through Python code.

Prerequisites

Prepare your environment

  1. Download and install Python. GX supports Python versions 3.9 to 3.12.

  2. Download and install pip. See the pip documentation.

Install GX

  1. Run the following command in an empty base directory inside a Python virtual environment:

    Terminal input
    pip install great_expectations

    It can take several minutes for the installation to complete.

Get your credentials

You'll need your user access token and organization ID to set your environment variables. Don't commit your access token to your version control software.

  1. In GX Cloud, click Tokens.

  2. In the User access tokens pane, click Create user access token.

  3. In the Token name field, enter a name for the token that will help you quickly identify it.

  4. Click Create.

  5. Copy and then paste the user access token into a temporary file. The token can't be retrieved after you close the dialog.

  6. Click Close.

  7. Copy the value in the Organization ID field into the temporary file with your user access token and then save the file.

If your organization has multiple workspaces, you'll also need your workspace ID.

  1. In GX Cloud, select the relevant Workspace.
  2. Observe the URL in your browser and copy the first segment after /workspaces/. For example, if the URL is app.greatexpectations.io/organizations/my-org/workspaces/abc123/data-health, copy abc123 into the temporary file with your other credentials and then save the file.

GX recommends deleting the temporary file after you set the environment variables.

Set your credentials as environment variables

Environment variables securely store your GX Cloud access credentials.

  1. Save your credentials as GX_CLOUD_ACCESS_TOKEN and GX_CLOUD_ORGANIZATION_ID environment variables by entering export ENV_VAR_NAME=env_var_value in the terminal or adding the command to your ~/.bashrc or ~/.zshrc file. If your organization has multiple workspaces, set your GX_CLOUD_WORKSPACE_ID as well. For example:

    Terminal input
    export GX_CLOUD_ACCESS_TOKEN=<user_access_token>
    export GX_CLOUD_ORGANIZATION_ID=<organization_id>
    export GX_CLOUD_WORKSPACE_ID=<workspace_id>
    Note

    After you save your credentials as environment variables, you can use Python scripts to access GX Cloud and complete other tasks. See the API reference.

  2. Optional. If you created a temporary file to record your credentials, delete it.

Create a Data Context

  1. Run the following Python code to create a Data Context object:

    Python
    import great_expectations as gx

    context = gx.get_context(mode="cloud")

    # Optional. Specify a workspace ID.
    # context = gx.get_context(mode="cloud", workspace_id="abc123")

    The Data Context will detect the previously set environment variables and connect to your GX Cloud account.

    If you are a member of multiple workspaces, note that you can pass a workspace ID in the get_context call to override the workspace ID set in your environment variables.

  2. Optional. Verify that you have a GX Cloud Data Context:

    Python
    print(type(context).__name__)

Connect to a Data Asset

Working with Data Sources

The Data Context you created includes a built-in pandas_default Data Source. This Data Source gives access to all of the read_*(...) methods available in pandas. This allows you to connect to a pandas Data Asset without adding your own Data Source first as demonstrated in this section.

Cloud API instructions for connecting to other Data Sources such as Amazon S3, Azure Blob Storage, Google Cloud Storage, BigQuery, and Spark are under construction. In the meantime, you can refer to the GX Core docs for guidance as the Cloud API uses the same methods for connecting Data Sources.

  • Run the following Python code to connect to existing .csv data stored in the great_expectations GitHub repository and create a Validator object:

    Python
    batch = context.data_sources.pandas_default.read_csv(
    "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
    )

    The code example uses the default Data Source for Pandas to access the .csv data from the file at the specified URL path.

    Alternatively, if you have already configured your data in GX Cloud you can use it instead. To see your available Data Sources, run:

    Python
    print(context.list_datasources())

    Using the printed information, you can get the name of one of your existing Data Sources, one of its Data Assets, and the name of a Batch Definition on the Data Asset. Then, you can retrieve a Batch of data by updating the values for data_source_name, asset_name, and batch_definition_name in the following code and executing it:

    Python
    data_source_name = "my_data_source"
    asset_name = "my_data_asset"
    batch_definition_name = "my_batch_definition"
    batch = (
    context.data_sources.get(data_source_name)
    .get_asset(asset_name)
    .get_batch_definition(batch_definition_name)
    .get_batch()
    )

Create Expectations

  • Run the following Python code to create two Expectations and save them to the Expectation Suite:

    Python
    import great_expectations.expectations as gxe

    suite_name = "my_suite"
    suite = gx.ExpectationSuite(name=suite_name)

    suite.add_expectation(
    gxe.ExpectColumnValuesToNotBeNull(column="pickup_datetime", severity="warning")
    )
    suite.add_expectation(
    gxe.ExpectColumnValuesToBeBetween(
    column="passenger_count", min_value=1, max_value=6, severity="info"
    )
    )

    The first Expectation uses domain knowledge (the pickup_datetime shouldn't be null).

    The second Expectation uses explicit kwargs along with the passenger_count column.

Validate data

  1. Run the following Python code to examine the data and determine if it matches the defined Expectations. This will return Validation Results:

    Python
    results = batch.validate(suite)
  2. Run the following Python code to view a JSON representation of the Validation Results in the generated Data Docs:

    Python
    print(results.describe())