How to Run Profiling¶
Run During Init¶
great_expectations init command offers to profile a newly added datasource. If you agree, data assets in that
datasource will be profiled (e.g., tables in the database). By default the profiler will select the first 20 data
Expectation suites generated by the profiler will be saved in the configured
expectations directory for expectation
suites. The expectation suite name by default is the name of hte profiler that generated it. Validation results will be
saved in the
uncommitted/validations directory by default; the CLI will then offer to move them to the
fixtures/validations directory from which data documentation is built.
Run From Command Line¶
The GE command-line interface can also profile a datasource:
great_expectations profile DATASOURCE_NAME
Just as when running during init, expectation suites generated by the profiler will be saved in the configured
expectations directory for expectation suites. The expectation suite name by default is the name of the profiler
that generated it. Validation results will be saved in the
uncommitted/validations directory by default.
The CLI will offer to move resulting validations to the
fixtures/validations directory from which data documentation is built and to regenerate the HTML documentation.
See Data Docs for more information.
Run From Jupyter Notebook¶
If you want to profile just one data asset in a datasource (e.g., one table in the database), you can do it using Python in a Jupyter notebook:
from great_expectations.profile.basic_dataset_profiler import BasicDatasetProfiler # obtain the DataContext object context = ge.data_context.DataContext() # load a batch from the data asset batch = context.get_batch('ratings') # run the profiler on the batch - this returns an expectation suite and validation results for this suite expectation_suite, validation_result = BasicDatasetProfiler.profile(batch) # save the resulting expectation suite with a custom name context.save_expectation_suite(expectation_suite, "ratings", "my_profiled_expectations")
Like most things in Great Expectations, Profilers are designed to be extensibile. You can develop your own profiler
DataetProfiler, or from the parent
DataAssetProfiler class itself. For help, advice, and ideas
on developing custom profilers, please get in touch on the Great Expectations slack channel .
Inferring Data Types¶
When profiling CSV files, the profiler makes assumptions, such as considering the first line to be the header. Overriding these assumptions is currently possible only when running profiling in Python by passing extra arguments to get_batch.
Since profiling and expectations are so tightly linked, getting samples of expected data requires a slightly different approach than the normal path for profiling. Stay tuned for more in this area!