great_expectations.dataset.util

Module Contents

Functions

is_valid_partition_object(partition_object)

Tests whether a given object is a valid continuous or categorical partition object.

is_valid_categorical_partition_object(partition_object)

Tests whether a given object is a valid categorical partition object.

is_valid_continuous_partition_object(partition_object)

Tests whether a given object is a valid continuous partition object. See Partition Objects.

categorical_partition_data(data)

Convenience method for creating weights from categorical data.

kde_partition_data(data, estimate_tails=True)

Convenience method for building a partition and weights using a gaussian Kernel Density Estimate and default bandwidth.

partition_data(data, bins=’auto’, n_bins=10)

continuous_partition_data(data, bins=’auto’, n_bins=10, **kwargs)

Convenience method for building a partition object on continuous data

build_continuous_partition_object(dataset, column, bins=’auto’, n_bins=10, allow_relative_error=False)

Convenience method for building a partition object on continuous data from a dataset and column

build_categorical_partition_object(dataset, column, sort=’value’)

Convenience method for building a partition object on categorical data from a dataset and column

infer_distribution_parameters(data, distribution, params=None)

Convenience method for determining the shape parameters of a given distribution

_scipy_distribution_positional_args_from_dict(distribution, params)

Helper function that returns positional arguments for a scipy distribution using a dict of parameters.

validate_distribution_parameters(distribution, params)

Ensures that necessary parameters for a distribution are present and that all parameters are sensical.

create_multiple_expectations(df, columns, expectation_type, *args, **kwargs)

Creates an identical expectation for each of the given columns with the specified arguments, if any.

get_approximate_percentile_disc_sql(selects: List, sql_engine_dialect: Any)

check_sql_engine_dialect(actual_sql_engine_dialect: Any, candidate_sql_engine_dialect: Any)

great_expectations.dataset.util.logger
great_expectations.dataset.util.is_valid_partition_object(partition_object)

Tests whether a given object is a valid continuous or categorical partition object. :param partition_object: The partition_object to evaluate :return: Boolean

great_expectations.dataset.util.is_valid_categorical_partition_object(partition_object)

Tests whether a given object is a valid categorical partition object. :param partition_object: The partition_object to evaluate :return: Boolean

great_expectations.dataset.util.is_valid_continuous_partition_object(partition_object)

Tests whether a given object is a valid continuous partition object. See Partition Objects.

Parameters

partition_object – The partition_object to evaluate

Returns

Boolean

great_expectations.dataset.util.categorical_partition_data(data)

Convenience method for creating weights from categorical data.

Parameters

data (list-like) – The data from which to construct the estimate.

Returns

A new partition object:

{
    "values": (list) The categorical values present in the data
    "weights": (list) The weights of the values in the partition.
}

See Partition Objects.

great_expectations.dataset.util.kde_partition_data(data, estimate_tails=True)

Convenience method for building a partition and weights using a gaussian Kernel Density Estimate and default bandwidth.

Parameters
  • data (list-like) – The data from which to construct the estimate

  • estimate_tails (bool) – Whether to estimate the tails of the distribution to keep the partition object finite

Returns

A new partition_object:

{
    "bins": (list) The endpoints of the partial partition of reals,
    "weights": (list) The densities of the bins implied by the partition.
}

See :ref:`partition_object`.

great_expectations.dataset.util.partition_data(data, bins='auto', n_bins=10)
great_expectations.dataset.util.continuous_partition_data(data, bins='auto', n_bins=10, **kwargs)

Convenience method for building a partition object on continuous data

Parameters
  • data (list-like) – The data from which to construct the estimate.

  • bins (string) – One of ‘uniform’ (for uniformly spaced bins), ‘ntile’ (for percentile-spaced bins), or ‘auto’ (for automatically spaced bins)

  • n_bins (int) – Ignored if bins is auto.

  • kwargs (mapping) – Additional keyword arguments to be passed to numpy histogram

Returns

A new partition_object:

{
    "bins": (list) The endpoints of the partial partition of reals,
    "weights": (list) The densities of the bins implied by the partition.
}
See :ref:`partition_object`.

great_expectations.dataset.util.build_continuous_partition_object(dataset, column, bins='auto', n_bins=10, allow_relative_error=False)

Convenience method for building a partition object on continuous data from a dataset and column

Parameters
  • dataset (GE Dataset) – the dataset for which to compute the partition

  • column (string) – The name of the column for which to construct the estimate.

  • bins (string) – One of ‘uniform’ (for uniformly spaced bins), ‘ntile’ (for percentile-spaced bins), or ‘auto’ (for automatically spaced bins)

  • n_bins (int) – Ignored if bins is auto.

  • allow_relative_error – passed to get_column_quantiles, set to False for only precise values, True to allow approximate values on systems with only binary choice (e.g. Redshift), and to a value between zero and one for systems that allow specification of relative error (e.g. SparkDFDataset).

Returns

A new partition_object:

{
    "bins": (list) The endpoints of the partial partition of reals,
    "weights": (list) The densities of the bins implied by the partition.
}

See :ref:`partition_object`.

great_expectations.dataset.util.build_categorical_partition_object(dataset, column, sort='value')

Convenience method for building a partition object on categorical data from a dataset and column

Parameters
  • dataset (GE Dataset) – the dataset for which to compute the partition

  • column (string) – The name of the column for which to construct the estimate.

  • sort (string) – must be one of “value”, “count”, or “none”. - if “value” then values in the resulting partition object will be sorted lexigraphically - if “count” then values will be sorted according to descending count (frequency) - if “none” then values will not be sorted

Returns

A new partition_object:

{
    "values": (list) the categorical values for which each weight applies,
    "weights": (list) The densities of the values implied by the partition.
}
See :ref:`partition_object`.

great_expectations.dataset.util.infer_distribution_parameters(data, distribution, params=None)

Convenience method for determining the shape parameters of a given distribution

Parameters
  • data (list-like) – The data to build shape parameters from.

  • distribution (string) – Scipy distribution, determines which parameters to build.

  • params (dict or None) – The known parameters. Parameters given here will not be altered. Keep as None to infer all necessary parameters from the data data.

Returns

A dictionary of named parameters:

{
    "mean": (float),
    "std_dev": (float),
    "loc": (float),
    "scale": (float),
    "alpha": (float),
    "beta": (float),
    "min": (float),
    "max": (float),
    "df": (float)
}

See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kstest.html#scipy.stats.kstest

great_expectations.dataset.util._scipy_distribution_positional_args_from_dict(distribution, params)

Helper function that returns positional arguments for a scipy distribution using a dict of parameters.

See the cdf() function here https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.beta.html#Methods to see an example of scipy’s positional arguments. This function returns the arguments specified by the scipy.stat.distribution.cdf() for tha distribution.

Parameters
  • distribution (string) – The scipy distribution name.

  • params (dict) – A dict of named parameters.

Raises

AttributeError – If an unsupported distribution is provided.

great_expectations.dataset.util.validate_distribution_parameters(distribution, params)

Ensures that necessary parameters for a distribution are present and that all parameters are sensical.

If parameters necessary to construct a distribution are missing or invalid, this function raises ValueError with an informative description. Note that ‘loc’ and ‘scale’ are optional arguments, and that ‘scale’ must be positive.

Parameters
  • distribution (string) – The scipy distribution name, e.g. normal distribution is ‘norm’.

  • params (dict or list) –

    The distribution shape parameters in a named dictionary or positional list form following the scipy cdf argument scheme.

    params={‘mean’: 40, ‘std_dev’: 5} or params=[40, 5]

Exceptions:

ValueError: With an informative description, usually when necessary parameters are omitted or are invalid.

great_expectations.dataset.util.create_multiple_expectations(df, columns, expectation_type, *args, **kwargs)

Creates an identical expectation for each of the given columns with the specified arguments, if any.

Parameters
  • df (great_expectations.dataset) – A great expectations dataset object.

  • columns (list) – A list of column names represented as strings.

  • expectation_type (string) – The expectation type.

Raises
  • KeyError if the provided column does not exist.

  • AttributeError if the provided expectation type does not exist or df is not a valid great expectations dataset.

Returns

A list of expectation results.

great_expectations.dataset.util.get_approximate_percentile_disc_sql(selects: List, sql_engine_dialect: Any) → str
great_expectations.dataset.util.check_sql_engine_dialect(actual_sql_engine_dialect: Any, candidate_sql_engine_dialect: Any) → bool