Profiling evaluates a data asset and summarizes its observed characteristics. By computing the observed properties of data, Profiling helps to reason about the data’s expected properties when creating expectation suites.
Profiling results are usually rendered into HTML - see Data Documentation. GE ships with the default BasicDatasetProfiler, which will produce an expectation_suite and so validation_results that compile to a page for each table or DataFrame including an overview section:
And then detailed statistics for each column:
Profiling is still a beta feature in Great Expectations. Over time, we plan to extend and improve the
BasicDatasetProfiler and also add additional profilers.
Profiling relies on automated inspection of data batches to generate and encode expectations. Together, encoding expectations, testing data, and presenting expectation validation results are the three core services offered by GE.
BasicDatasetProfiler will evaluate the entire batch
without limits or sampling, which may be very time consuming. As a rule of thumb, we recommend starting with batches
smaller than 100MB.
How to Run Profiling¶
Run During Init¶
great_expectations init command offers to profile a newly added datasource. If you agree, data assets in that
datasource will be profiled (e.g., tables in the database). By default the profiler will select the first 20 data
Expectation suites generated by the profiler will be saved in the configured
expectations directory for expectation
suites. The expectation suite name by default is the name of hte profiler that generated it. Validation results will be
saved in the
uncommitted/validations directory by default; the CLI will then offer to move them to the
fixtures/validations directory from which data documentation is built.
Run From Command Line¶
The GE command-line interface can also profile a datasource:
great_expectations profile DATASOURCE_NAME
Just as when running during init, expectation suites generated by the profiler will be saved in the configured
expectations directory for expectation suites. The expectation suite name by default is the name of the profiler
that generated it. Validation results will be saved in the
uncommitted/validations directory by default.
The CLI will offer to move resulting validations to the
fixtures/validations directory from which data documentation is built and to regenerate the HTML documentation.
See Data Documentation for more information.
Run From Jupyter Notebook¶
If you want to profile just one data asset in a datasource (e.g., one table in the database), you can do it using Python in a Jupyter notebook:
from great_expectations.profile.basic_dataset_profiler import BasicDatasetProfiler # obtain the DataContext object context = ge.data_context.DataContext() # load a batch from the data asset batch = context.get_batch('ratings') # run the profiler on the batch - this returns an expectation suite and validation results for this suite expectation_suite, validation_result = BasicDatasetProfiler.profile(batch) # save the resulting expectation suite with a custom name context.save_expectation_suite(expectation_suite, "ratings", "my_profiled_expectations")
Like most things in Great Expectations, Profilers are designed to be extensibile. You can develop your own profiler
DataetProfiler, or from the parent
DataAssetProfiler class itself. For help, advice, and ideas
on developing custom profilers, please get in touch on the Great Expectations slack channel .
When profiling CSV files, the profiler makes assumptions, such as considering the first line to be the header. Overriding these assumptions is currently possible only when running profiling in Python by passing extra arguments to get_batch.