In this guide we will demonstrate the ways in which Batches can be organized in a SQL-based Data Asset. We will discuss how to use Splitters to divide the data in a table or query based on the contents of a provided field. We will also show how to add Batch Sorters to a Data Asset in order to specify the order in which Batches are returned.
- A working installation of Great Expectations
- A Data Asset in a SQL-based Data Source
If you're using a Data Source that was created with the advanced block-config method, see How to configure a SQL Data Source with the block-config method.
1. Import GX and instantiate a Data Context
Run the following Python code to import GX and instantiate a Data Context:
import great_expectations as gx
context = gx.get_context()
2. Retrieve a SQL Data Source and Data Asset
For this guide, we will use a previously defined SQL Data Source named
"my_datasource" with a Table Data Asset called
"my_asset" which points to a table with taxi data.
To retrieve this Data Source, we will supply the
get_datasource(...) method of our Data Context with the name of the Data Source we wish to retrieve:
my_datasource = context.get_datasource("my_datasource")
my_table_asset = my_datasource.get_asset(asset_name="my_table_asset")
3. Add a Splitter to the Data Asset
Our table has a datetime column called "
pickup_datetime" which we will use to split our TableAsset into Batches.
4. (Optional) Add Batch Sorters to the Data Asset
We will now add Batch Sorters to our Data Asset. This will allow us to explicitly state the order in which our Batches are returned when we request data from the Data Asset. To do this, we will pass a list of sorters to the
add_sorters(...) method of our Data Asset.
In this example we split
"pickup_datetime" column on
"month", so our list of sorters can have up to two elements. We will add an ascending sorter based on the contents of the splitter group
"year" and a descending sorter based on the contents of the splitter group
my_asset = my_table_asset.add_sorters(["+year", "-month"])
5. Use a Batch Request to verify the Data Asset works as desired
To verify that our Data Asset will return the desired files as Batches, we will define a quick Batch Request that will include all the Batches available in the Data asset. Then we will use that Batch Request to get a list of the returned Batches.
my_batch_request = my_table_asset.build_batch_request()
batches = my_table_asset.get_batch_list_from_batch_request(my_batch_request)
Because a Batch List contains a lot of metadata, it will be easiest to verify which files were included in the returned Batches if we only look at the
batch_spec of each returned Batch:
for batch in batches:
Now that you have further configured a file-based Data Asset, you may want to look into: