Skip to main content
Version: 0.18.9

How to organize Batches in a SQL-based Data Asset

In this guide we will demonstrate the ways in which Batches can be organized in a SQL-based Data Asset. We will discuss how to use Splitters to divide the data in a table or query based on the contents of a provided field. We will also show how to add Batch Sorters to a Data Asset in order to specify the order in which Batches are returned.

Prerequisites

    • A working installation of Great Expectations
    • A Data Asset in a SQL-based Data Source

    Steps

    1. Import GX and instantiate a Data Context

    Run the following Python code to import GX and instantiate a Data Context:

    Python
    import great_expectations as gx

    context = gx.get_context()

    2. Retrieve a SQL Data Source and Data Asset

    For this guide, we will use a previously defined SQL Data Source named "my_datasource" with a Table Data Asset called "my_asset" which points to a table with taxi data.

    To retrieve this Data Source, we will supply the get_datasource(...) method of our Data Context with the name of the Data Source we wish to retrieve:

    Python
    my_datasource = context.get_datasource("my_datasource")
    my_table_asset = my_datasource.get_asset(asset_name="my_table_asset")

    3. Add a Splitter to the Data Asset

    Our table has a datetime column called "pickup_datetime" which we will use to split our TableAsset into Batches.

    Python
    my_table_asset.add_splitter_year_and_month(column_name="pickup_datetime")

    4. (Optional) Add Batch Sorters to the Data Asset

    We will now add Batch Sorters to our Data Asset. This will allow us to explicitly state the order in which our Batches are returned when we request data from the Data Asset. To do this, we will pass a list of sorters to the add_sorters(...) method of our Data Asset.

    In this example we split "pickup_datetime" column on "year" and "month", so our list of sorters can have up to two elements. We will add an ascending sorter based on the contents of the splitter group "year" and a descending sorter based on the contents of the splitter group "month":

    Python
    my_asset = my_table_asset.add_sorters(["+year", "-month"])

    5. Use a Batch Request to verify the Data Asset works as desired

    To verify that our Data Asset will return the desired files as Batches, we will define a quick Batch Request that will include all the Batches available in the Data asset. Then we will use that Batch Request to get a list of the returned Batches.

    Python
    my_batch_request = my_table_asset.build_batch_request()
    batches = my_table_asset.get_batch_list_from_batch_request(my_batch_request)

    Because a Batch List contains a lot of metadata, it will be easiest to verify which files were included in the returned Batches if we only look at the batch_spec of each returned Batch:

    Python
    for batch in batches:
    print(batch.batch_spec)

    Next steps

    Now that you have further configured a file-based Data Asset, you may want to look into:

    Requesting Data from a Data Asset

    Using Data Assets to create Expectations