Skip to main content
Version: 0.18.9

Create a Custom Regex-Based Column Map Expectation

RegexBasedColumnMapExpectations are a sub-type of ColumnMapExpectationA verifiable assertion about data. that allow for highly-extensible, regex-powered validation of your data.

They are evaluated for a single column and ask a yes/no, regex-based question for every row in that column. Based on the result, they then calculate the percentage of rows that gave a positive answer. If that percentage meets a specified threshold (100% by default), the Expectation considers that data valid. This threshold is configured via the mostly parameter, which can be passed as input to your Custom RegexBasedColumnMapExpectation as a float between 0 and 1.

This guide will walk you through the process of creating a Custom RegexBasedColumnMapExpectation.

Prerequisites

Choose a name for your Expectation

First, decide on a name for your own Expectation. By convention, all ColumnMapExpectations, including RegexBasedColumnMapExpectations, start with expect_column_values_. You can see other naming conventions in the Expectations section of the code Style Guide.

Your Expectation will have two versions of the same name: a CamelCaseName and a snake_case_name. For example, this tutorial will use:

  • ExpectColumnValuesToOnlyContainVowels
  • expect_column_values_to_only_contain_vowels

Copy and rename the template file

By convention, each Expectation is kept in its own python file, named with the snake_case version of the Expectation's name.

You can find the template file for a custom RegexBasedColumnMapExpectation here. Download the file, place it in the appropriate directory, and rename it to the appropriate name.

cp regex_based_column_map_expectation_template.py /SOME_DIRECTORY/expect_column_values_to_only_contain_vowels.py

Storing Expectation files

During development, you don't need to store Expectation files in a specific location. Expectation files are self-contained and can be executed anywhere as long as GX is installed However, to use your new Expectation with other GX components, you'll need to make sure the file is stored one of the following locations:

  • If you're building a Custom ExpectationAn extension of the `Expectation` class, developed outside of the Great Expectations library. for personal use, you'll need to put it in the great_expectations/plugins/expectations folder of your GX deployment, and import your Custom Expectation from that directory whenever it will be used. When you instantiate the corresponding DataContext, it will automatically make all PluginsExtends Great Expectations' components and/or functionality. in the directory available for use.

  • If you're building a Custom Expectation to contribute to the open source project, you'll need to put it in the repo for the Great Expectations library itself. Most likely, this will be within a package within contrib/: great_expectations/contrib/SOME_PACKAGE/SOME_PACKAGE/expectations/. To use these Expectations, you'll need to install the package.

For more information about Custom Expectations, see Use a Custom Expectation.

Generate a diagnostic checklist for your Expectation

Once you've copied and renamed the template file, you can execute it as follows.

python expect_column_values_to_only_contain_vowels.py

The template file is set up so that this will run the Expectation's print_diagnostic_checklist() method. This will run a diagnostic script on your new Expectation, and return a checklist of steps to get it to full production readiness.

Completeness checklist for ExpectColumnValuesToMatchSomeRegex:
✔ Has a valid library_metadata object
Has a docstring, including a one-line short description that begins with "Expect" and ends with a period
Has at least one positive and negative example case, and all test cases pass
Has core logic and passes tests on at least one Execution Engine
Passes all linting checks
Has basic input validation and type checking
Has both Statement Renderers: prescriptive and diagnostic
Has core logic that passes tests for all applicable Execution Engines and SQL dialects
Has a robust suite of tests, as determined by a code owner
Has passed a manual review by a code owner for code standards and style guides

When in doubt, the next step to implement is the first one that doesn't have a ✔ next to it. This guide covers the first five steps on the checklist.

Change the Expectation class name and add a docstring

Let's start by updating your Expectation's name and docstring.

Replace the Expectation class name

Python
class ExpectColumnValuesToMatchSomeRegex(RegexBasedColumnMapExpectation):

with your real Expectation class name, in upper camel case:

Python
class ExpectColumnValuesToOnlyContainVowels(RegexBasedColumnMapExpectation):

You can also go ahead and write a new one-line docstring, replacing

Python
"""TODO: Add a docstring here"""

with something like:

Python
"""Expect values in this column to only contain vowels."""

Make sure your one-line docstring begins with "Expect " and ends with a period. You'll also need to change the class name at the bottom of the file, by replacing this line:

Python
ExpectColumnValuesToMatchSomeRegex().print_diagnostic_checklist()

with this one:

Python
ExpectColumnValuesToOnlyContainVowels(
column="only_vowels"
).print_diagnostic_checklist()

Later, you can go back and write a more thorough docstring. See Expectation Docstring Formatting.

At this point you can re-run your diagnostic checklist. You should see something like this:

$ python expect_column_values_to_only_contain_vowels.py

Completeness checklist for ExpectColumnValuesToOnlyContainVowels:
✔ Has a valid library_metadata object
✔ Has a docstring, including a one-line short description that begins with "Expect" and ends with a period
Has at least one positive and negative example case, and all test cases pass
Has core logic and passes tests on at least one Execution Engine
Passes all linting checks
...

Add example cases

You're going to search for examples = [] in your file, and replace it with at least two test examples. These examples serve the following purposes:

  • They provide test fixtures that Great Expectations can execute automatically with pytest.

  • They help users understand the logic of your Expectation by providing tidy examples of paired input and output. If you contribute your Expectation to open source, these examples will appear in the Gallery.

Your examples will look similar to this example:

Python
examples = [
{
"data": {
"only_vowels": ["a", "e", "I", "O", "U", "y", ""],
"mixed": ["A", "b", "c", "D", "E", "F", "g"],
"longer_vowels": ["aei", "YAY", "oYu", "eee", "", "aeIOUY", None],
"contains_vowels_but_also_other_stuff": [
"baa",
"aba",
"aab",
"1a1",
"a a",
" ",
"*",
],
},
"only_for": ["pandas", "spark", "sqlite", "postgresql"],
"tests": [
{
"title": "positive_test",
"exact_match_out": False,
"in": {"column": "only_vowels"},
"out": {
"success": True,
},
"include_in_gallery": True,
},
{
"title": "negative_test",
"exact_match_out": False,
"in": {"column": "mixed"},
"out": {
"success": False,
"unexpected_index_list": [1, 2, 3, 5, 6],
},
"include_in_gallery": True,
},
{
"title": "another_postive_test",
"exact_match_out": False,
"in": {"column": "longer_vowels"},
"out": {
"success": True,
},
"include_in_gallery": True,
},
{
"title": "another_negative_test",
"exact_match_out": False,
"in": {"column": "contains_vowels_but_also_other_stuff"},
"out": {
"success": False,
"unexpected_index_list": [0, 1, 2, 3, 4, 5, 6],
},
"include_in_gallery": True,
},
{
"title": "mostly_positive_test",
"exact_match_out": False,
"in": {"column": "mixed", "mostly": 0.1},
"out": {
"success": True,
},
"include_in_gallery": True,
},
{
"title": "mostly_negative_test",
"exact_match_out": False,
"in": {"column": "mixed", "mostly": 0.3},
"out": {
"success": False,
},
"include_in_gallery": True,
},
],
}
]

Here's a quick overview of how to create test cases to populate examples. The overall structure is a list of dictionaries. Each dictionary has two keys:

  • data: defines the input data of the example as a table/data frame. In this example the table has columns named only_vowels, mixed, longer_vowels, and contains_vowels_but_also_other_stuff. All of these columns have 7 rows. (Note: if you define multiple columns, make sure that they have the same number of rows.)
  • tests: a list of test cases to validate against the data frame defined in the corresponding data.
    • title should be a descriptive name for the test case. Make sure to have no spaces.
    • include_in_gallery: This must be set to True if you want this test case to be visible in the Gallery as an example.
    • in contains exactly the parameters that you want to pass in to the Expectation. "in": {"column": "mixed", "mostly": .1} in the example above is equivalent to expect_column_values_to_only_contain_vowels(column="mixed", mostly=0.1)
    • out is based on the Validation Result returned when executing the Expectation.
    • exact_match_out: if you set exact_match_out=False, then you don’t need to include all the elements of the Validation Result object - only the ones that are important to test.

If you run your Expectation file again, you won't see any new checkmarks, as the logic for your Custom Expectation hasn't been implemented yet. However, you should see that the tests you've written are now being caught and reported in your checklist:

$ python expect_column_values_to_only_contain_vowels.py

Completeness checklist for ExpectColumnValuesToOnlyContainVowels:
✔ Has a valid library_metadata object
✔ Has a docstring, including a one-line short description that begins with "Expect" and ends with a period
...
Has core logic that passes tests for all applicable Execution Engines and SQL dialects
Only 0 / 2 tests for pandas are passing
Only 0 / 2 tests for spark are passing
Only 0 / 2 tests for sqlite are passing
Failing: basic_positive_test, basic_negative_test
...
Passes all linting checks
note

For more information on tests and example cases,
see our guide on how to create example cases for a Custom Expectation.

Define your regex and connect it to your Expectation

This is the stage where you implement the actual business logic for your Expectation.

In the case of your Custom RegexBasedColumnMapExpectation, Great Expectations will handle the actual application of the regex to your data.

To do this, we replace these:

Python
regex_camel_name = "RegexName"
regex = "regex pattern"

with something like this:

Python
regex_camel_name = "Vowel"
regex = "^[aeiouyAEIOUY]*$"

For more detail when rendering your Custom Expectation, you can optionally specify the plural form of a Semantic Type you're validating.

For example:

Python
semantic_type_name_plural = None

becomes:

Python
semantic_type_name_plural = "vowels"

Great Expectations will use these values to tell your Custom Expectation to apply your specified regex as a MetricA computed attribute of data such as the mean of a column. to be utilized in validating your data.

This is all that you need to define for now. The RegexBasedColumnMapExpectation class has built-in logic to handle all the machinery of data validation, including standard parameters like mostly, generation of Validation Results, etc.

Other parameters

Expectation Success Keys - A tuple consisting of values that must / could be provided by the user and defines how the Expectation evaluates success.

Expectation Default Kwarg Values (Optional) - Default values for success keys and the defined domain, among other values.

Running your diagnostic checklist at this point should return something like this:

$ python expect_column_values_to_only_contain_vowels.py

Completeness checklist for ExpectColumnValuesToOnlyContainVowels:
✔ Has a valid library_metadata object
✔ Has a docstring, including a one-line short description that begins with "Expect" and ends with a period
✔ Has at least one positive and negative example case, and all test cases pass
✔ Has core logic and passes tests on at least one Execution Engine
Passes all linting checks
...

Linting

Finally, we need to lint our now-functioning Custom Expectation. Our CI system will test your code using black, and ruff.

If you've set up your dev environment, these libraries will already be available to you, and can be invoked from your command line to automatically lint your code:

black <PATH/TO/YOUR/EXPECTATION.py>
ruff <PATH/TO/YOUR/EXPECTATION.py> --fix
info

If desired, you can automate this to happen at commit time. See our guidance on linting for more on this process.

Once this is done, running your diagnostic checklist should now reflect your Custom Expectation as meeting our linting requirements:

$ python expect_column_values_to_only_contain_vowels.py

Completeness checklist for ExpectColumnValuesToOnlyContainVowels:
✔ Has a valid library_metadata object
✔ Has a docstring, including a one-line short description that begins with "Expect" and ends with a period
✔ Has at least one positive and negative example case, and all test cases pass
✔ Has core logic and passes tests on at least one Execution Engine
✔ Passes all linting checks
...
note

If you've already built a Custom Expectation of a different type, you may notice that we didn't explicitly implement a _validate method or Metric class here. While we have to explicitly create these for other types of Custom Expectations, the RegexBasedColumnMapExpectation class handles Metric creation and result validation implicitly; no extra work needed!

Contribute (Optional)

This guide will leave you with a Custom Expectation sufficient for contribution back to Great Expectations at an Experimental level.

If you plan to contribute your Expectation to the public open source project, you should update the library_metadata object before submitting your Pull Request. For example:

Python
library_metadata = {
"tags": [], # Tags for this Expectation in the Gallery
"contributors": [ # Github handles for all contributors to this Expectation.
"@your_name_here", # Don't forget to add your github handle here!
],
}

would become

Python
library_metadata = {
"tags": ["regex"],
"contributors": ["@joegargery"],
}

This is particularly important because we want to make sure that you get credit for all your hard work!

note

For more information on our code standards and contribution, see our guide on Levels of Maturity for Expectations.

To view the full script used in this page, see it on GitHub: