Profiling

Profiling evaluates a data asset and summarizes its observed characteristics. By computing the observed properties of data, Profiling helps to reason about the data’s expected properties when creating expectation suites.

Profiling results are usually rendered into HTML - see Data Documentation. GE ships with the default BasicDatasetProfiler, which will produce an expectation_suite and so validation_results that compile to a page for each table or DataFrame including an overview section:

../_images/movie_db_profiling_screenshot_2.jpg

And then detailed statistics for each column:

../_images/movie_db_profiling_screenshot_1.jpg

Profiling is still a beta feature in Great Expectations. Over time, we plan to extend and improve the BasicDatasetProfiler and also add additional profilers.

Profiling relies on automated inspection of data batches to generate and encode expectations. Together, encoding expectations, testing data, and presenting expectation validation results are the three core services offered by GE.

Warning: BasicDatasetProfiler will evaluate the entire batch without limits or sampling, which may be very time consuming. As a rule of thumb, we recommend starting with batches smaller than 100MB.

How to Run Profiling

Run During Init

The great_expectations init command offers to profile a newly added datasource. If you agree, data assets in that datasource will be profiled (e.g., tables in the database). By default the profiler will select the first 20 data assets.

Expectation suites generated by the profiler will be saved in the configured expectations directory for expectation suites. The expectation suite name by default is the name of hte profiler that generated it. Validation results will be saved in the uncommitted/validations directory by default; the CLI will then offer to move them to the fixtures/validations directory from which data documentation is built.

Run From Command Line

The GE command-line interface can also profile a datasource:

great_expectations profile DATASOURCE_NAME

Just as when running during init, expectation suites generated by the profiler will be saved in the configured expectations directory for expectation suites. The expectation suite name by default is the name of the profiler that generated it. Validation results will be saved in the uncommitted/validations directory by default. The CLI will offer to move resulting validations to the fixtures/validations directory from which data documentation is built and to regenerate the HTML documentation.

See Data Documentation for more information.

Run From Jupyter Notebook

If you want to profile just one data asset in a datasource (e.g., one table in the database), you can do it using Python in a Jupyter notebook:

from great_expectations.profile.basic_dataset_profiler import BasicDatasetProfiler

# obtain the DataContext object
context = ge.data_context.DataContext()

# load a batch from the data asset
batch = context.get_batch('ratings')

# run the profiler on the batch - this returns an expectation suite and validation results for this suite
expectation_suite, validation_result = BasicDatasetProfiler.profile(batch)

# save the resulting expectation suite with a custom name
context.save_expectation_suite(expectation_suite, "ratings", "my_profiled_expectations")

Custom Profilers

Like most things in Great Expectations, Profilers are designed to be extensibile. You can develop your own profiler by subclassing DataetProfiler, or from the parent DataAssetProfiler class itself. For help, advice, and ideas on developing custom profilers, please get in touch on the Great Expectations slack channel .

Known Issues

When profiling CSV files, the profiler makes assumptions, such as considering the first line to be the header. Overriding these assumptions is currently possible only when running profiling in Python by passing extra arguments to get_batch.