Quickstart
Use this quickstart to install GX OSS, connect to sample data, build your first Expectation, validate data, and review the validation results. This is a great place to start if you're new to GX OSS and aren't sure if it's the right solution for you or your organization. If you're using Databricks or SQL to store data, see Get Started with GX and Databricks or Get Started with GX and SQL.
Windows support for the open source Python version of GX OSS is currently unavailable. If you’re using GX OSS in a Windows environment, you might experience errors or performance issues.
Data validation workflow
The following diagram illustrates the end-to-end GX OSS data validation workflow that you'll implement with this quickstart. Click a workflow step to view the related content.
Prerequisites
- An installation of Python, version 3.8 to 3.11. To download and install Python, see Python downloads.
- pip
- An internet browser
Install GX OSS
-
Run the following command in an empty base directory inside a Python virtual environment:
Terminal inputpip install great_expectations
It can take several minutes for the installation to complete.
-
Run the following Python code to import the
great_expectations
module:Pythonimport great_expectations as gx
Create a Data Context
-
Run the following command to create a Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. object:
Pythoncontext = gx.get_context()
Connect to data
-
Run the following command to connect to existing
.csv
data stored in thegreat_expectations
GitHub repository and create a ValidatorUsed to run an Expectation Suite against data. object:Pythonvalidator = context.sources.pandas_default.read_csv(
"https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)The code example uses the default Data ContextThe primary entry point for a Great Expectations deployment, with configurations and methods for all supporting components. Data SourceProvides a standard API for accessing and interacting with data from a wide variety of source systems. for Pandas to access the
.csv
data from the file at the specified URL path.
Create Expectations
-
Run the following commands to create two ExpectationsA verifiable assertion about data. and save them to the Expectation SuiteA collection of verifiable assertions about data.:
Pythonvalidator.expect_column_values_to_not_be_null("pickup_datetime")
validator.expect_column_values_to_be_between(
"passenger_count", min_value=1, max_value=6
)
validator.save_expectation_suite(discard_failed_expectations=False)The first ExpectationA verifiable assertion about data. uses domain knowledge (the
pickup_datetime
shouldn't be null).The second ExpectationA verifiable assertion about data. uses explicit kwargs along with the
passenger_count
column.The basic workflow when creating an Expectation Suite is to populate it with Expectations that accurately describe the state of the associated data. Therefore, when an Expectation Suite is saved failed Expectations are not kept by default. However, the
discard_failed_expectations
parameter ofsave_expectation_suite(...)
can be used to override this behavior if you have created Expectations that describe the ideal state of your data rather than its current state.
Validate data
-
Run the following command to define a CheckpointThe primary means for validating data in a production deployment of Great Expectations. and examine the data to determine if it matches the defined ExpectationsA verifiable assertion about data.:
Pythoncheckpoint = context.add_or_update_checkpoint(
name="my_quickstart_checkpoint",
validator=validator,
) -
Run the following command to return the Validation ResultsGenerated when data is Validated against an Expectation or Expectation Suite.:
Pythoncheckpoint_result = checkpoint.run()
-
Run the following command to view an HTML representation of the Validation ResultsGenerated when data is Validated against an Expectation or Expectation Suite. in the generated Data DocsHuman readable documentation generated from Great Expectations metadata detailing Expectations, Validation Results, etc.:
Pythoncontext.view_validation_result(checkpoint_result)