Apply Expectation conditions to specific rows within a Batch
By default, Expectations apply to the entire dataset retrieved in a Batch. However, there are instances when an Expectation may not be relevant for every row. Validating every row could lead to false positives or false negatives in the Validation Results.
For example, you might define an Expectation that a column indicating the country of origin for a product should not be null. If this Expectation is only applicable when the product is an import, applying it to every row in the Batch could result in many false negatives when the country of origin column is null for products produced locally.
To address this issue, GX allows you to define Expectation conditions that apply only to a subset of the data retrieved in a Batch.
Create an Expectation condition
Great Expectations allows you to specify conditions for validating rows using the row_condition
argument, which can be applied to all Expectations that assess rows within a Dataset. The row_condition
argument should be a string that represents a boolean expression. Rows will be validated when the row_condition
expression evaluates to True
. Conversely, if the row_condition
evaluates to False
, the corresponding row will not be validated by the Expectation.
Prerequisites
- Python version 3.9 to 3.12.
- An installation of GX Core.
- A preconfigured Data Context.
- Recommended. A preconfigured Data Source and Data Asset connected to your data for testing your customized Expectation.
Procedure
- pandas
- Spark
- SQL
In this procedure, it is assumed that your Data Context is stored in the variable context
, and your Expectation Suite is stored in the variable suite
. The suite
can either be a newly created and empty Expectation Suite or an existing Expectation Suite retrieved from the Data Context.
The examples in this procedure use passenger data from the Titanic, which includes details about the class of ticket held by the passenger and whether or not they survived the journey.
-
Determine the
condition_parser
for yourrow_condition
.The
condition_parser
defines the syntax ofrow_condition
strings. When implementing Expectation conditions with pandas, set this argument to"pandas"
.Note that the Expectation with conditions will fail if the Batch being validated is from a different type of Data Source than indicated by the
condition_parser
.
In this procedure, it is assumed that your Data Context is stored in the variable context
, and your Expectation Suite is stored in the variable suite
. The suite
can either be a newly created and empty Expectation Suite or an existing Expectation Suite retrieved from the Data Context.
The examples in this procedure use passenger data from the Titanic, which includes details about the class of ticket held by the passenger and whether or not they survived the journey.
-
Determine the
condition_parser
for yourrow_condition
.The
condition_parser
defines the syntax ofrow_condition
strings. When implementing Expectation conditions with Spark, set this argument to"great_expectations"
.Note that the Expectation with conditions will fail if the Batch being validated is from a different type of Data Source than indicated by the
condition_parser
.
In this procedure, it is assumed that your Data Context is stored in the variable context
, and your Expectation Suite is stored in the variable suite
. The suite
can either be a newly created and empty Expectation Suite or an existing Expectation Suite retrieved from the Data Context.
The examples in this procedure use passenger data from the Titanic, which includes details about the class of ticket held by the passenger and whether or not they survived the journey.
-
Determine the
condition_parser
for yourrow_condition
.The
condition_parser
defines the syntax ofrow_condition
strings. When implementing Expectation conditions with SQL, set this argument to"great_expectations"
.Note that the Expectation with conditions will fail if the Batch being validated is from a different type of Data Source than indicated by the
condition_parser
.
-
Determine the
row_condition
expression.The
row_condition
argument should be a boolean expression string that is evaluated for each row in the Batch that the Expectation validates. If therow_condition
evaluates toTrue
, the row will be included in the Expectation's validations. If it evaluates toFalse
, the Expectation will be skipped for that row.The syntax of the
row_condition
argument is based on thecondition_parser
specified earlier. -
Create the Expectation.
An Expectation with conditions is created like a regular Expectation, with the addition of the
row_condition
andcondition_parser
parameters alongside the Expectation's other arguments.In pandas, the
row_condition
value is passed topandas.DataFrame.query()
prior to Expectation Validation, and the resulting rows from the evaluated Batch will undergo validation by the Expectation.Pythoncondition_parser="pandas",
row_condition='PClass=="1st"',Do not use single quotes, newlines, or
\n
in the specifiedrow_condition
as shown in the following examples:Pythonrow_condition = "PClass=='1st'" # Don't do this. Single quotes aren't valid!
row_condition="""
PClass=="1st"
""" # Don't do this. Newlines and \n aren't valid!
row_condition = 'PClass=="1st"' # Do this instead.In Spark, the
row_condition
uses custom syntax, which is parsed as a data filter or query prior to Expectation Validation.Pythoncondition_parser="great_expectations",
row_condition='col("PClass")=="1st"',Do not use single quotes, newlines, or
\n
in the specifiedrow_condition
as shown in the following examples:Pythonrow_condition = "col('PClass')=='1st'" # Don't do this. Single quotes aren't valid!
row_condition="""
col("PClass")=="1st"
""" # Don't do this. Newlines and \n aren't valid!
row_condition = 'col("PClass")=="1st"' # Do this instead.In SQL, the
row_condition
uses custom syntax, which is parsed as a data filter or query prior to Expectation Validation.Pythoncondition_parser="great_expectations",
row_condition='col("PClass")=="1st"',Do not use single quotes, newlines, or
\n
in the specifiedrow_condition
as shown in the following examples:Pythonrow_condition = "col('PClass')=='1st'" # Don't do this. Single quotes aren't valid!
row_condition="""
col("PClass")=="1st"
""" # Don't do this. Newlines and \n aren't valid!
row_condition = 'col("PClass")=="1st"' # Do this instead.In pandas, you can reference variables from the environment by prefixing them with
@
. Additionally, when a column name contains spaces, you can specify it by enclosing the name in backticks:`
.Some examples of valid
row_condition
values for pandas include:Pythonrow_condition = '`foo foo`=="bar bar"' # The value of the column "foo foo" is "bar bar"
row_condition = 'foo==@bar' # the value of the foo field is equal to the value of the bar environment variableFor more information on the syntax accepted by pandas
row_condition
values see pandas.DataFrame.query.For Spark, you should specify your columns using the
col()
function.Here are examples of the valid
row_condition
patterns for Spark:Pythonrow_condition='col("foo") == "Two Two"' # foo is 'Two Two'
row_condition='col("foo").notNull()' # foo is not null
row_condition='col("foo") > 5' # foo is greater than 5
row_condition='col("foo") <= 3.14' # foo is less than 3.14
row_condition='col("foo") <= date("2023-03-13")' # foo is earlier than 2023-03-13For SQL, you should specify your columns using the
col()
function.Here are examples of the valid
row_condition
patterns for SQL:Pythonrow_condition='col("foo") == "Two Two"' # foo is 'Two Two'
row_condition='col("foo").notNull()' # foo is not null
row_condition='col("foo") > 5' # foo is greater than 5
row_condition='col("foo") <= 3.14' # foo is less than 3.14
row_condition='col("foo") <= date("2023-03-13")' # foo is earlier than 2023-03-13 -
Optional. Create additional Expectation conditions.
Expectations that have different conditions are treated as unique, even if they belong to the same type and apply to the same column within an Expectation Suite. This approach allows you to create one unconditional Expectation and an unlimited number of Conditional Expectations, each with a distinct condition.
For instance, the following code establishes an Expectation that the value in the
"Survived"
column is either 0 or 1:Pythonexpectation = gx.expectations.ExpectColumnValuesToBeInSet(
column="Survived", value_set=[0, 1]
)And this code adds a condition to the Expectation that specifies the value of the
"Survived"
column is1
if the individual was a first-class passenger:Pythonexpectation_with_condition = gx.expectations.ExpectColumnValuesToBeInSet(
column="Survived",
value_set=[1],
condition_parser="pandas",
row_condition='PClass=="1st"',
)Pythonexpectation_with_condition = gx.expectations.ExpectColumnValuesToBeInSet(
column="Survived",
value_set=[1],
condition_parser="great_expectations",
row_condition='col("PClass")=="1st"',
)Pythonexpectation_with_condition = gx.expectations.ExpectColumnValuesToBeInSet(
column="Survived",
value_set=[1],
condition_parser="great_expectations",
row_condition='col("PClass")=="1st"',
)
Data Docs and Expectation conditions
Expectations with conditions are presented differently from standard Expectations in the Data Docs. Each Expectation with conditions is prefaced with if 'row_condition_string', then values must be... as illustrated in the following image:
If the 'row_condition_string' is a complex expression, it will be divided into several components to enhance readability.
Scope and limitations
While conditions can be applied to most Expectations, the following Expectations cannot be conditioned and do not accept the row_condition
argument:
expect_column_to_exist
expect_table_columns_to_match_ordered_list
expect_table_columns_to_match_set
expect_table_column_count_to_be_between
expect_table_column_count_to_equal
unexpected_rows_expectation