Skip to main content
Version: 1.6.4

Validate unstructured data with GX Cloud

Enterprise data often consists of large amounts of unstructured data such as PDFs, images, emails, and sensor logs, but many often find it difficult to validate the quality of it. Data quality issues related to unstructured data can often go unnoticed, leading to downstream problems. For example, an AI model may be compromised if duplicate documents and failed OCR (Optical Character Recognition) do not get immediately flagged, leading to poor outputs.

This tutorial provides a working, hands-on example of how to validate unstructured data using sample PDF data and GX Cloud. An OCR process on a PDF doesn't just extract text; it produces metadata like confidence scores, and word counts. GX Cloud allows you to set up data quality checks on this metadata to maximize the confidence in your unstructured data, all while allowing your collaborators to view the results.

Prerequisite knowledge

This article assumes basic familiarity with GX components and workflows. If you're new to GX, start with the GX Cloud and GX Core overviews to familiarize yourself with key concepts and setup procedures.

Prerequisites

Install dependencies

  1. Open a terminal window and navigate to the folder you want to use for this tutorial.

  2. Install poppler and tesseract. Poppler is a PDF rendering library that this tutorial uses to read the PDFs. Tesseract is an open source OCR engine that this tutorial uses to perfrom OCR on the PDFs.

    Terminal input
    brew install poppler
    brew install tesseract
  3. Optional. Create a Python virtual environment and start it.

    Terminal input
    python -m venv my_venv
    source my_venv/bin/activate
  4. Install the Python libraries that you will use in this tutorial, including the Great Expectations library.

    Terminal input
    pip install pandas
    pip install datasets
    pip install pdf2image
    pip install pytesseract
    pip install great_expectations

Import the required Python libraries

  1. Create the Python file for this project.

    Terminal input
    touch gx_unstructured_data.py
  2. Open the Python file in your code editor of choice.

  3. Import the libraries you will be using for data validation in this tutorial.

    Python
    import pandas as pd  # Data manipulation

    import great_expectations as gx # Data validation
    import great_expectations.exceptions.exceptions as gxexceptions # For exceptions
    import great_expectations.expectations as gxe # For Expectations

Load the dataset and convert it into a dataframe

This tutorial uses an open source dataset of PDFs from Hugging Face. You will convert the first page of the first 5 PDFs into an image, run OCR on that page, and finally extract the metrics from it.

  1. Load the dataset.

    Python
    from datasets import load_dataset  # Load PDF OCR dataset from Hugging Face

    ds = load_dataset("broadfield-dev/pdf-ocr-dataset", split="train[:5]")
  2. Iterate through the PDFs, converting the first page into an image before running OCR and storing the metrics.

    Python
    import pytesseract  # OCR engine
    import requests
    from pdf2image import convert_from_bytes # Convert PDF pages to images
    from pytesseract import Output # Structured OCR output

    records = []

    for sample in ds:
    # Download PDF from URL in 'urls' field
    pdf_url = None
    urls = sample.get("urls")
    if isinstance(urls, list) and urls:
    pdf_url = urls[0]
    elif isinstance(urls, str):
    pdf_url = urls

    if not pdf_url:
    print(f"No PDF URL found in sample: {list(sample.keys())}")
    continue

    response = requests.get(pdf_url)
    if response.status_code != 200:
    print(f"Failed to download PDF from {pdf_url}")
    continue

    pdf_bytes = response.content
    print(f"Processing PDF: {sample.get('ids', 'unknown')}")
    pages = convert_from_bytes(pdf_bytes, dpi=200)
    all_ocr_text = []
    all_confidences = []
    all_heights = []
    for image in pages:
    # Run OCR on the PDFs
    ocr_data = pytesseract.image_to_data(image, output_type=Output.DICT)
    ocr_text = pytesseract.image_to_string(image)
    all_ocr_text.append(ocr_text)
    # Collect confidences and heights for each page
    all_confidences.extend(
    [
    float(c)
    for t, c in zip(ocr_data["text"], ocr_data["conf"])
    if t.strip() and c != "-1"
    ]
    )
    all_heights.extend(
    [int(h) for t, h in zip(ocr_data["text"], ocr_data["height"]) if t.strip()]
    )

    full_text = "\n".join(all_ocr_text)
    avg_conf = sum(all_confidences) / len(all_confidences) if all_confidences else 0
    header_count = sum(1 for h in all_heights if h > 20)

    # Store metrics for validation
    records.append(
    {
    "file_name": sample.get("ids", "unknown"),
    "text_length": len(full_text),
    "ocr_confidence": round(avg_conf, 2),
    "num_detected_headers": header_count,
    }
    )
  3. Convert the metrics into a dataframe for validation.

    Python
    df = pd.DataFrame(records)

Connect to GX Cloud and define Expectations

In this tutorial, you will connect to your GX Cloud organization using the GX Cloud API. You will either get or create a pandas Data Source and a dataframe Data Asset. Batch Definitions both organize a Data Asset's records into Batches and provide a method for retrieving those records. The Batch Definition in this tutorial will use the whole dataframe that you created in the previous step.

  1. Instantiate the GX Data Context and get or create the Data Source, Data Asset, and Batch Definition.

    Python
    context = gx.get_context()
    try:
    datasource = context.data_sources.get("PDF Scans")
    except KeyError:
    datasource = context.data_sources.add_pandas("PDF Scans")

    try:
    asset = datasource.get_asset("OCR Results")
    except LookupError:
    asset = datasource.add_dataframe_asset("OCR Results")

    try:
    batch_definition = asset.get_batch_definition("default")
    except KeyError:
    batch_definition = asset.add_batch_definition_whole_dataframe("default")
  2. Get or create an Expectation Suite and create Expectations to validate the metrics generated from the PDFs. This tutorial utilizes the ExpectColumnValuesToBeBetween Expectation in order to validate that the metrics we stored in the dataframe meet our parameters. You can also try using different Expectations or value ranges.

    Python
    try:
    suite = context.suites.get(name="OCR Metrics Suite")
    except gxexceptions.DataContextError:
    suite = gx.ExpectationSuite("OCR Metrics Suite")
    suite = context.suites.add(suite)
    suite.add_expectation(
    gxe.ExpectColumnValuesToBeBetween(column="text_length", min_value=500)
    ) # at least 500 characters
    suite.add_expectation(
    gxe.ExpectColumnValuesToBeBetween(column="ocr_confidence", min_value=70)
    ) # at least 70% confidence
    suite.add_expectation(
    gxe.ExpectColumnValuesToBeBetween(column="num_detected_headers", min_value=2)
    ) # at least 2 headers
    suite.save()

Validate your Expectations

GX uses a Validation Definition to link a Batch Definition and Expectation Suite. A Checkpoint will be used to execute Validations. The results of the Validations can be later viewed through the GX Cloud UI.

  1. Create the Validation Definition.

    Python
    try:
    vd = context.validation_definitions.get("OCR Results VD")
    except gxexceptions.DataContextError:
    vd = gx.ValidationDefinition(
    data=batch_definition, suite=suite, name="OCR Results VD"
    )
    context.validation_definitions.add(vd)
  2. Create and run the Checkpoint.

    Python
    try:
    checkpoint = context.checkpoints.get("OCR Checkpoint")
    except gxexceptions.DataContextError:
    checkpoint = gx.Checkpoint(name="OCR Checkpoint", validation_definitions=[vd])
    context.checkpoints.add(checkpoint)

    checkpoint.run(batch_parameters={"dataframe": df})

Review the results

Now that you have set up the Data Source, Data Asset, Expectations, and have run the Checkpoint, the Validation Results can be viewed in the GX Cloud UI.

  1. Log in to GX Cloud, navigate to the Data Assets page, and find the OCR Results Data Asset that we used earlier in the tutorial.

    The Data Assets page lists all of the Data Assets that have been created. The list can be filtered by using the search function.

  2. Click into the Data Asset and then to the Validations tab. Under Expectation Suites, select the OCR Metrics Suite suite that you created above, and then under Batches & run history, select the Validation you just ran.

    Data Assets can have multiple Expectation Suites. Each Expectation Suite may have many Validation Results.

The path forward

Using this tutorial as a framework, you can try plugging in your own unstructured data, as well as add other Expectations from the Expectation Gallery to the Expectation Suite. You can also explore validating your unstructured data within a data pipeline by using this code with an orchestrator.

Businesses that rely on unstructured data should take the steps necessary to ensure the quality of it, but this is only one of many data quality scenarios that is relevant to an organization. Explore our other data quality use cases for more insights and best practices to expand your data validation to encompass key quality dimensions.