Version: 1.7.1

Validate data uniqueness with GX

Data uniqueness is a fundamental aspect of data quality that ensures distinct values are present only once where expected in a dataset. Uniqueness constraints are often applied to columns that serve as primary keys, composite keys, or other unique identifiers. Validating uniqueness is critical for maintaining data integrity, preventing duplication, and enabling accurate analysis.

Failing to validate uniqueness can lead to various data quality problems:

Duplicates can skew analytics, leading to incorrect conclusions and flawed decision-making. For example, duplicate transactions could overstate revenue.
Non-unique identifiers can cause data corruption when merging or joining datasets, resulting in lost data or mismatched records.
Redundant data wastes storage space and complicates data management. It also slows query performance by unnecessarily increasing table size.
Inconsistencies from uniqueness violations erode trust in the data. Analysts and executives may doubt data-derived recommendations and hesitate to act on the insights.

Great Expectations (GX) provides a collection of Expectations to validate data uniqueness. By codifying uniqueness rules and continuously validating data against them, organizations can catch issues early and ensure high quality data for downstream consumption. This article will discuss how to leverage GX to implement uniqueness checks as part of your data validation.

Prerequisite knowledge

This article assumes basic familiarity with GX components and workflows. If you're new to GX, start with the GX Cloud and GX Core overviews to familiarize yourself with key concepts and setup procedures.

Data preview

The examples in this article use a sample customer dataset. The data is available from a public Postgres database, as shown in the examples, or can also be accessed in CSV format.

customer_id	first_name	last_name	email_address	secondary_email	phone_number	country_code	government_id
1	John	Doe	johndoe@email.com		+1 123 456-7890	US	123-45-6789
2	Jane	Smith	jsmith@email.com	jsmith@email.com	+1 409 437-3210	CA	987 654 321
3	Bob	Thatcher	bobthatcher@email.com			US	235-98-4389
4	Jon	Doe	jon.doe@email.com	jon.doe2@email.com	+1 888 999-9999	US
5	J	Doe	jd@email.com			US	123-45-6789
6	Jenny	Williams				CA	298 367 456
7	Johnathan	Doe	johndoe@email.co.uk		+44 333 981537	UK	RC 23 94 27 B
8	Liz	Brown	l2@email.com	lizzeb@email.com	+44 1224 587623	UK
9	Jonathan	Doe		jdoe@email.co.uk	+44 333 991537	UK	RC 23 94 27 C
10	Elizabeth	Brown	lizzeb@email.com	l2@email.com	+44 1224 587623	UK

Primary key fields, such as customer_id, can uniquely identify a row of data but cannot always uniquely identify the corresponding real-world entity, such as a person. In this dataset, some individual customers are represented by multiple rows, each with slight variations in their registered information. This scenario is common in real-world customer databases and presents a challenge for validating and maintaining data uniqueness.

Uniqueness is particularly crucial for fields like customer_id, email_address, government_id. However, due to data entry errors, multiple registrations, or system migrations, duplicates can still occur. When combined, fields such as country_code and government_id might ideally form a unique identifier for each customer. This ensures that each customer is distinctly recorded and prevents issues like fragmented customer profiles or incorrect communications.

Key uniqueness Expectations

Duplicate data can manifest as duplicate rows within a dataset, or as duplicate values within a single row. GX provides Expectations, available in both GX Cloud and GX Core, that validate for uniqueness across rows and within rows.

Expect column values to be unique

This Expectation checks that each value in a column is unique. It is useful to validate that there are no duplicates in a column that should contain only unique values, such as a primary key column.

Example: Expect the customer_id column to contain only unique values.

Python
gxe.ExpectColumnValuesToBeUnique(column="customer_id")

View ExpectColumnValuesToBeUnique in the Expectation Gallery.

Expect compound columns to be unique

This Expectation validates that the combination of values across multiple columns is unique for each row. It can be used to check uniqueness across a set of columns that together form a unique identifier, such as a composite key.

Example: Expect that the combination of country_code and government_id to uniquely identify a customer record.

Python
gxe.ExpectCompoundColumnsToBeUnique(
    column_list=["country_code", "government_id"],
)

View ExpectCompoundColumnsToBeUnique in the Expectation Gallery.

Expect column proportion of unique values to be between

This Expectation validates that the proportion of unique values in a column is between a specified minimum and maximum value. It is useful for ensuring a certain level of uniqueness in a column without requiring full uniqueness.

Example: Validate that at least 90% of all customer email_address values are unique.

Python
gxe.ExpectColumnProportionOfUniqueValuesToBeBetween(
    column="email_address", min_value=0.9, max_value=1.0
)

View ExpectColumnProportionOfUniqueValuesToBeBetween in the Expectation Gallery.

Expect column unique value count to be between

This Expectation validates that the number of unique values in a column is between a specified minimum and maximum value. It is useful for validating that the quantity of unique values falls within an expected range, for example, checking for a known range of distinct category values in a categorical column.

Example: Ensure that the country_code column contains between 1 and 5 unique values.

Python
gxe.ExpectColumnUniqueValueCountToBeBetween(
    column="country_code", min_value=1, max_value=5
)

View ExpectColumnUniqueValueCountToBeBetween in the Expectation Gallery.

Expect select column values to be unique within record

This Expectation validates that, for each row, the values across a specified set of columns are unique. It is useful to check for the presence of duplicate column values within a single row. Note that this Expectation allows for duplicate rows within the dataset, as long as the specified columns have unique values within each individual row.

Example: Expect each customer record to have a unique email_address and secondary_email.

Python
gxe.ExpectSelectColumnValuesToBeUniqueWithinRecord(
    column_list=["email_address", "secondary_email"],
)

View ExpectSelectColumnValuesToBeUniqueWithinRecord in the Expectation Gallery.

GX tip for uniqueness Expectations

If your data allows for a small number of duplicates, consider using ExpectColumnProportionOfUniqueValuesToBeBetween or ExpectColumnUniqueValueCountToBeBetween instead of strict uniqueness Expectations. These Expectations allow you to set a threshold for the proportion or count of unique values, providing more flexibility in cases where perfect uniqueness is not required or where a small number of duplicates are acceptable.
When validating uniqueness, consider the level of granularity required for your use case. Column-level Expectations like ExpectColumnValuesToBeUnique ensure uniqueness within a single column, while row-level Expectations like ExpectCompoundColumnsToBeUnique validate uniqueness across multiple columns. Choose the appropriate Expectation based on whether you need to validate a unique identifier, a composite key, or a combination of fields that should be unique within each row.

Example: Identify potential duplicate entities

Context: Primary key fields can be used uniquely identify rows of data, but cannot always uniquely identify the corresponding real-world entity represented by the data. Preventing duplicate records and maintaining unique data often cannot be accomplished by primary key constraints alone, additional uniqueness checks must be used to validate groups of attributes that can be used to uniquely identify entities.

For instance, in customer datasets, a customer_id can be created for all customer records, but duplicate customers (real world individuals represented by multiple rows of data) can still enter the dataset for a variety of reasons: inconsistent contact information provided by the customer themselves, errors in data entry, or bad data migrations.

Goal: Using uniqueness Expectations in GX Cloud or GX Core, identify potential duplicate entities in the sample customer dataset.

GX Cloud
GX Core

Use the GX Cloud UI to walk through the following steps:

Create a Postgres Data Asset for the uniqueness_customers table using the following connection string:

Connection string
postgresql+psycopg2://try_gx:try_gx@postgres.workshops.greatexpectations.io/gx_learn_data_quality

Add an Expectation to validate the uniqueness of the customer_id column, which serves as a primary key for the dataset.
- Expectation: Expect column values to be unique
- Column: customer_id
Add two Expectations that validate the uniqueness of composite keys that should represent a single real-world entity (the customer).
- Expect that individual customers have a unique government id, relative to their country.
  - Expectation: Expect compound columns to be unique
  - Column List: country_code, government_id
  - Ignore Row If: Any value is missing
- Expect that individual customers have unique last name and phone number.
  - Expectation: Expect compound columns to be unique
  - Column List: last_name, phone_number
  - Ignore Row If: Any value is missing
Validate the uniqueness_customers Data Asset with the newly created Expectations.
Review the Validation Results. Under Batches & run history, select the individual Validation run (not All Runs) to view the sample unexpected values that were identified for failing Expectations.

Result: Based on the sample unexpected values shown for each failing Expectation, you can see that two potential groups of duplicate customer records are identified.

Run the following GX Core workflow:

Python
import great_expectations as gx
import great_expectations.expectations as gxe

# Create a Data Context.
context = gx.get_context(mode="ephemeral")

# Connect to data and create the Data Source, Data Asset, Batch Definition, and Batch.
CONNECTION_STRING = "postgresql+psycopg2://try_gx:try_gx@postgres.workshops.greatexpectations.io/gx_learn_data_quality"

data_source = context.data_sources.add_postgres(
    "postgres database", connection_string=CONNECTION_STRING
)
data_asset = data_source.add_table_asset(
    name="customers", table_name="uniqueness_customers"
)

batch_definition = data_asset.add_batch_definition_whole_table("batch definition")
batch = batch_definition.get_batch()

# Create an Expectation Suite containing uniqueness Expectations.
expectation_suite = context.suites.add(gx.ExpectationSuite(name="expectation suite"))


# Validate uniqueness of the primary key column.
expectation_suite.add_expectation(
    gxe.ExpectColumnValuesToBeUnique(
        column="customer_id",
    )
)

# Validate composite keys that potentially represent a single entity
# (the individual customer) for uniqueness.
expectation_suite.add_expectation(
    gxe.ExpectCompoundColumnsToBeUnique(column_list=["country_code", "government_id"])
)

expectation_suite.add_expectation(
    gxe.ExpectCompoundColumnsToBeUnique(column_list=["last_name", "phone_number"])
)

# Validate the Batch using the uniqueness Expectation Suite.
validation_result = batch.validate(expectation_suite)

# Parse the Validation Result for potential duplicate customers.
for expectation_result in validation_result["results"]:
    expectation = expectation_result["expectation_config"]["type"]
    result = expectation_result["success"]

    if result is False:
        unexpected_sample = expectation_result["result"]["partial_unexpected_list"]
        print(
            f"Expectation: {expectation}, Potential duplicates found:\n{unexpected_sample}\n"
        )

Result:

Python output
Expectation: expect_compound_columns_to_be_unique, Potential duplicates found:
[{'country_code': 'US', 'government_id': '123-45-6789'}, {'country_code': 'US', 'government_id': '123-45-6789'}]

Expectation: expect_compound_columns_to_be_unique, Potential duplicates found:
[{'last_name': 'Brown', 'phone_number': '+44 1224 587623'}, {'last_name': 'Brown', 'phone_number': '+44 1224 587623'}]

Parsing the Validation Result object for failing Expectations enables you to access a sample of the unexpected values that caused the validation to fail. These values can be used to identify duplicate rows, based on the definition of uniqueness for an individual customer.

country_code: US, government_id : 123-45-6789

This result suggests that customer_id 1 and 5 represent the same customer, John Doe, identified by duplicate rows containing his unique US government id.
last_name: Brown, government_id : +44 1224 587623

This result suggests that customer_id 8 and 10 represent the same customer, Elizabeth (Liz) Brown, identified by duplicate rows containing her last name and phone number.

GX solution: GX Cloud and GX Core both enable validating data uniqueness in flexible ways: within single columns or across compound columns, across records in a dataset or fields within a record. The Validation Results returned by GX can be used to identify duplicate data in addition to monitoring uniqueness.

Scenarios

Detecting duplicate transactions

Context: In financial systems, duplicate transactions can lead to incorrect account balances, unhappy customers, and accounting discrepancies. Monitoring key fields for uniqueness helps detect and prevent such issues.

GX solution: Use ExpectCompoundColumnsToBeUnique to validate that the combination of fields that uniquely identify a transaction (e.g., timestamp, sender account, recipient account, amount) is unique across all rows.

Ensuring integrity of customer records

Context: In a customer database, each customer should have a unique identifier. Duplicate customer IDs can lead to severe data integrity issues, such as incorrectly merged customer profiles, misdirected communications, or inaccurate analytics. If not caught early, resolving duplicate records can become a complex, error-prone, and resource-intensive process.

GX solution: Use ExpectColumnValuesToBeUnique to ensure that the customer ID column contains only unique values. If duplicates are found, investigate and resolve them to maintain data integrity.

Detecting anomalies in user agent strings

Context: In web analytics, user agent strings provide information about visitors' browsers and devices. Anomalies in user agent strings, such as a low proportion of unexpected or unique values, could indicate bot traffic coming from a single source or other potential security issues.

GX solution: Use ExpectColumnProportionOfUniqueValuesToBeBetween to check that the proportion of unique user agent strings falls within an expected range. If the proportion is unusually low, investigate the traffic sources and patterns.

Avoid common uniqueness analysis pitfalls

Not considering business context: Understand specific uniqueness requirements for each dataset and use case. Blindly applying generic checks can lead to false alarms or missed issues.
Checking at the wrong granularity: Validate uniqueness at the appropriate level, whether it's individual columns or combinations of columns, based on business requirements.
Mishandling missing or null values: Decide whether to consider null values as distinct or ignore them in uniqueness validation. Be consistent to avoid skewed results.
Ignoring subtle differences: Be aware of whitespace, case sensitivity, and type mismatches that could cause false negatives. Clean and normalize data before uniqueness checks.
Not monitoring over time: Continuously monitor uniqueness metrics to detect changes or anomalies. Set up alerts and track unique value counts over time.
Focusing solely on uniqueness: Combine uniqueness validation with other data quality dimensions, such as completeness, consistency, and validity, for a comprehensive approach.

The path forward

Monitoring data for uniqueness is an essential technique in the effort to maintain high-quality data. However, when duplicate data is inadvertently introduced into your organization's data, uniqueness checks must be paired with effective data deduplication and entity resolution to fix the issue.

Uniqueness is an important measure of data quality, but is only one facet of a comprehensive approach to data quality. Effective data quality monitoring and management requires validation of data across other quality dimensions, such as completeness, schema, integrity, volume, and distribution. Explore our data quality use cases for more insights and best practices to expand your data validation to encompass key quality dimensions.

Prerequisite knowledge​

Data preview​

Key uniqueness Expectations​

Expect column values to be unique​

Expect compound columns to be unique​

Expect column proportion of unique values to be between​

Expect column unique value count to be between​

Expect select column values to be unique within record​

Example: Identify potential duplicate entities​

Scenarios​

Detecting duplicate transactions​

Ensuring integrity of customer records​

Detecting anomalies in user agent strings​

Avoid common uniqueness analysis pitfalls​

The path forward​