Profiling Reference

How to Run Profiling

Run During Init

The great_expectations init command offers to profile a newly added datasource. If you agree, data assets in that datasource will be profiled (e.g., tables in the database). By default the profiler will select the first 20 data assets.

Expectation suites generated by the profiler will be saved in the configured expectations directory for expectation suites. The expectation suite name by default is the name of hte profiler that generated it. Validation results will be saved in the uncommitted/validations directory by default; the CLI will then offer to move them to the fixtures/validations directory from which data documentation is built.

Run From Command Line

The GE command-line interface can also profile a datasource:

great_expectations profile DATASOURCE_NAME

Just as when running during init, expectation suites generated by the profiler will be saved in the configured expectations directory for expectation suites. The expectation suite name by default is the name of the profiler that generated it. Validation results will be saved in the uncommitted/validations directory by default. The CLI will offer to move resulting validations to the fixtures/validations directory from which data documentation is built and to regenerate the HTML documentation.

See Data Docs for more information.

Run From Jupyter Notebook

If you want to profile just one data asset in a datasource (e.g., one table in the database), you can do it using Python in a Jupyter notebook:

from great_expectations.profile.basic_dataset_profiler import BasicDatasetProfiler

# obtain the DataContext object
context = ge.data_context.DataContext()

# load a batch from the data asset
batch = context.get_batch('ratings')

# run the profiler on the batch - this returns an expectation suite and validation results for this suite
expectation_suite, validation_result = BasicDatasetProfiler.profile(batch)

# save the resulting expectation suite with a custom name
context.save_expectation_suite(expectation_suite, "ratings", "my_profiled_expectations")

Custom Profilers

Like most things in Great Expectations, Profilers are designed to be extensibile. You can develop your own profiler by subclassing DataetProfiler, or from the parent DataAssetProfiler class itself. For help, advice, and ideas on developing custom profilers, please get in touch on the Great Expectations slack channel .

Profiling Limitations

Inferring Data Types

When profiling CSV files, the profiler makes assumptions, such as considering the first line to be the header. Overriding these assumptions is currently possible only when running profiling in Python by passing extra arguments to get_batch.

Data Samples

Since profiling and expectations are so tightly linked, getting samples of expected data requires a slightly different approach than the normal path for profiling. Stay tuned for more in this area!