Validate data schema with GX
Data schema refers to the structural blueprint of a dataset, encompassing elements such as column names, data types, and the overall organization of information. When working with data, ensuring that it adheres to its predefined schema is a critical aspect of data quality management. This process, known as schema validation, is among the top priority use cases for data quality platforms.
Validating your data's schema is crucial for maintaining data reliability and usability in downstream tasks. This process involves checking that the structure of your dataset conforms to established rules, such as verifying column names, data types, and the presence of required fields. Schema changes, whether planned or unexpected, can significantly impact data integrity and the performance of data-dependent systems.
Great Expectations (GX) provides schema-focused Expectations that allow you to define and enforce the structural integrity of your datasets. These tools enable you to establish robust schema validation within your data pipelines, helping to catch and address schema-related issues before they propagate through your data ecosystem. This guide will walk you through leveraging these Expectations to implement effective schema validation in your data workflows.
Prerequisite knowledge
This article assumes basic familiarity with GX components and workflows. See the GX Overview for additional content on GX fundamentals.
Data preview
Below is a sample of the dataset that is referenced by examples and explanations within this article.
type | sender_account_number | recipient_fullname | transfer_amount | transfer_date |
---|---|---|---|---|
domestic | 244084670977 | Jaxson Duke | 9143.40 | 2024-05-01 01:12 |
domestic | 954005011218 | Nelson O’Connell | 3285.21 | 2024-05-01 05:08 |
This dataset is representative of financial transfers recorded by banking institutions. Its fields include account type, sender account, sender name, transfer amount, and transfer date.
You can access this dataset from the great_expectations
GitHub repo in order to reproduce the code recipes provided in this article.
Key schema Expectations
GX offers a collection of Expectations for schema validation, all of which can be added directly in GX Cloud or GX Core.
The schema Expectations provide basic practical solutions for common validation scenarios and can also be used to satisfy more nuanced validation needs.
Column-level Expectations
Column-level schema Expectations ensure that the individual columns within your dataset adhere to specific criteria. These Expectations are designed to validate various aspects such as data type and permissible value ranges within columns.
Expect Column Values To Be Of Type
Validates that the values within a column are of a specific data type. This is useful for scenarios needing strict type adherence.
Use Case: Handling data transferred using formats that do not embed schema (e.g., CSV), where apparent type changes can occur when new values appear.
gxe.ExpectColumnValuesToBeOfType(column="transfer_amount", type_="DOUBLE_PRECISION")
ExpectColumnValuesToBeOfType
in the Expectation
Gallery.
Expect Column Values To Be In Type List
Ensures that the values in a specified column are within a specified type list. This Expectation is useful for columns with varied permissible types, such as mixed-type fields often found in legacy databases.
Use Case: Suitable for datasets transitioning from older systems where type consistency might not be strictly enforced, aiding smooth data migration and validation.
gxe.ExpectColumnValuesToBeInTypeList(column="type", type_list=["INTEGER", "STRING"])
ExpectColumnValuesToBeInTypeList
in the Expectation
Gallery.
Combine ExpectColumnValuesToBeInTypeList
with detailed logging to track which types are most
frequently encountered, aiding in eventual standardization efforts.
Table-level Expectations
Table-level schema Expectations focus on the overall structure of your dataset. These Expectations are aimed at ensuring the dataset conforms to predefined schema constraints like the presence of necessary columns, column count, and column order.
Expect Column To Exist
Ensures the presence of a specified column in your dataset. This Expectation is foundational for schema validation, verifying that critical columns are included, thus preventing data processing errors due to missing fields.
Use Case: Ideal during data ingestion or integration of multiple data sources to ensure that essential fields are present before proceeding with downstream processing.
gxe.ExpectColumnToExist(column="sender_account_number")
ExpectColumnToExist
in the Expectation Gallery.
Expect Table Column Count To Equal
Ensures the dataset has an exact number of columns. This precise Expectation is for datasets with a fixed schema structure, providing a strong safeguard against unexpected changes.
Use Case: Perfect for regulatory reporting scenarios where the schema is strictly defined, and any deviation can lead to compliance violations.
gxe.ExpectTableColumnCountToEqual(value=5)
ExpectTableColumnCountToEqual
in the Expectation
Gallery.