autotst package

Submodules

autotst.autotst_types module

autotst.autotst_types.Dataset

Numpy array of any shape and type, used for datasets

autotst.autotst_types.Labels

One dimentional array of ints, used for labels (with values 0 and 1)

autotst.autotst_types.ListFloats

One dimentional array of floats, used for weights and predictions

autotst.autotst_types.Samples

Numpy array of any shape and type, used for the distribution’s samples

autotst.functions module

autotst.functions.fit_witness(data_train: NDArray[Any, Ellipsis, Any], label_train: NDArray[Any, Ellipsis, Any], model: Model, **kwargs) None

Calls the fit function of the model on the provided dataset, weighted to account for the difference of representation of the two labels. :param predictions: one-dimensional array with the witness predictions of the test data :param labels: one-dimensional array with labels 1 and 0 indicating data coming from one sample or the other :param model: the model on which the fit function is applied.

autotst.functions.get_default_model() Model

Returns an instance of the AutoGluonTabularPredictor, with default parameters

autotst.functions.get_weights(label_train: NDArray[1, UInt[int, unsignedinteger]]) NDArray[1, Float]

Labels being a one-dimensional array with labels 1 and 0, returns an array of weights that gives higher values to indexes corresponding to the less represented label.

autotst.functions.interpret(data_test: NDArray[Any, Ellipsis, Any], predictions: NDArray[1, Float], k: int = 1) Tuple[NDArray[Any, Ellipsis, Any], NDArray[Any, Ellipsis, Any]]

Returns the k most typical examples from the two distributions :param data_test: dataset with the first items corresponding to the first distribution and the last items to the second distributions :param predictions: label prediction corresponding to the dataset :param k: number of items to extract from the dataset, for each distribution :return: the k most typical examples from the two distributions

autotst.functions.p_value(sample_p: NDArray[Any, Ellipsis, Any], sample_q: NDArray[Any, Ellipsis, Any], model: Optional[Model] = None, split_ratio: float = 0.5, permutations: int = 10000, **fit_kwargs) float

Split the datasets unto a training and a test set, fit the model using the training set and uses the test set to compute the p-value. :param sample_p: samples drawn from a first distribution :param sample_q: samples drawn from a second distribution :param model: instance of model for fitting and prediction. If None (the default): an AutoGluonTabularPredictor will be used :param split_ratio: for splitting into learning and testing sets :param permutations: number of permutations used to estimate the p value :param fit_kwargs: parameters to the model’s fit function :return: p value

autotst.functions.p_value_evaluate(model: Model, data_test: NDArray[Any, Ellipsis, Any], labels_test: NDArray[1, UInt[int, unsignedinteger]], permutations: int = 10000) Tuple[NDArray[Any, Ellipsis, Any], float]

Apply the model to generate predictions, and uses these predictions to evaluate the p value. :param model: the model used for prediction, assumed to have been fitted :param dataset: dataset :param labels: one-dimensional array with labels 1 and 0 indicating data coming from one sample or the other :param permutations: number of permutations when estimating the p-value :return: the predictions and the p value

autotst.functions.permutations_p_value(predictions: NDArray[1, Float], labels: NDArray[1, UInt[int, unsignedinteger]], permutations: int = 10000) float

Compute p value of the witness mean discrepancy test statistic via permutations

Parameters
  • predictions – one-dimensional array with the witness predictions of the test data

  • labels – one-dimensional array with labels 1 and 0 indicating data coming from P or Q

  • permutations (int) – Number of permutations

Returns

p value

autotst.model module

class autotst.model.AutoGluonImagePredictor(**kwargs)

Bases: Model

Wrapper model for the Image Classifier of the AutoGluon package. The objective is classification, and the witness function uses the predicted probabilities.

__init__(**kwargs) None
fit(data_train: NDArray[Any, Ellipsis, Any], label_train: NDArray[Any, Ellipsis, Any], weights: NDArray[1, Float], presets: str = 'best_quality', time_limit: int = 60, **kwargs) None

Wrapper around fit routine. :param data_train: training data - provided as a list of image paths! :param label_train: training labels :param weights: weights for the loss - will be ignored here!!! :param presets: Autogluon preset :param time_limit: time limit for train (seconds) :param kwargs: other arguments to be passed to AutoGluon’s fit routine. :return:

predict(data_test: NDArray[Any, Ellipsis, Any]) NDArray[1, Float]
class autotst.model.AutoGluonTabularPredictor(**kwargs)

Bases: Model

Wrapper model for the Tabular Predictor of the AutoGluon package

__init__(**kwargs) None
fit(data_train: NDArray[Any, Ellipsis, Any], label_train: NDArray[Any, Ellipsis, Any], weights: NDArray[1, Float], presets: str = 'best_quality', time_limit: int = 60, verbosity: int = 0, **kwargs) None

Wrapper around fit routine. :param data_train: training data :param label_train: training labels :param weights: weights for the loss :param presets: Autogluon preset :param time_limit: time limit for train (seconds) :param verbosity: control output of Autogluon :param kwargs: other arguments to be passed to AutoGluon’s fit routine. :return:

predict(data_test: NDArray[Any, Ellipsis, Any]) NDArray[1, Float]
class autotst.model.Model(**kwargs)

Bases: object

Generic model class for two-sample tests

__init__(**kwargs)
fit(data_train, label_train, weights, **kwargs)
predict(data_test)

autotst.splitted_sets module

class autotst.splitted_sets.SplittedSets(training_set: NDArray[Any, Ellipsis, Any], test_set: NDArray[Any, Ellipsis, Any], training_labels: NDArray[1, UInt[int, unsignedinteger]], test_labels: NDArray[1, UInt[int, unsignedinteger]])

Bases: object

Class encapsulating datasets and labels dividing into testing and training.

__init__(training_set: NDArray[Any, Ellipsis, Any], test_set: NDArray[Any, Ellipsis, Any], training_labels: NDArray[1, UInt[int, unsignedinteger]], test_labels: NDArray[1, UInt[int, unsignedinteger]])
classmethod from_samples(sample_p: NDArray[Any, Ellipsis, Any], sample_q: NDArray[Any, Ellipsis, Any], split_ratio: float = 0.5) object

Creates a labeled dataset that concatenates the samples drawn from the distributions P and Q, and splits it between a training and a testing sets. Labels are binaries with values 1 for samples drawn from P and 0 for samples drawn from Q.

static split(X: NDArray[Any, Ellipsis, Any], Y: NDArray[Any, Ellipsis, Any], split_ratio: float) Tuple[NDArray[Any, Ellipsis, Any], NDArray[Any, Ellipsis, Any], NDArray[1, UInt[int, unsignedinteger]], NDArray[1, UInt[int, unsignedinteger]]]

Creates a labeled dataset that concatenates the samples drawn from the distributions X and Y, and splits it between a training and a testing sets. Labels are binaries with values 1 for samples drawn from P and 0 for samples drawn from Q. The returned tuples has for values: training set, testing set, labels for training set, labels for testing set.

test_split() Tuple[int, int]

Similar to training_split, but for the testing set.

training_split() Tuple[int, int]

Returns the number p and q of items that have been drawn respectively from the distributions P and Q in the training set. The first pth items of the trainign set correspond to P, and the following qth items correspond to Q.

autotst.test module

class autotst.test.AutoTST(sample_p: ~nptyping.types._ndarray.NDArray[(typing.Any, Ellipsis), ~typing.Any], sample_q: ~nptyping.types._ndarray.NDArray[(typing.Any, Ellipsis), ~typing.Any], split_ratio: float = 0.5, model: ~typing.Type[~autotst.model.Model] = <class 'autotst.model.AutoGluonTabularPredictor'>, **model_kwargs)

Bases: object

AutoML Two-Sample Test

Documentation with example of the class goes here

Constructor

Parameters
  • sample_p – Sample drawn from P

  • sample_q – Sample drawn from Q

  • split_ratio – Ratio that defines how much data is used for training the witness

  • model – Model used to learn the witness function

  • **model_kwargs

    Keyword arguments to initialize the model

Returns

None

__init__(sample_p: ~nptyping.types._ndarray.NDArray[(typing.Any, Ellipsis), ~typing.Any], sample_q: ~nptyping.types._ndarray.NDArray[(typing.Any, Ellipsis), ~typing.Any], split_ratio: float = 0.5, model: ~typing.Type[~autotst.model.Model] = <class 'autotst.model.AutoGluonTabularPredictor'>, **model_kwargs) None

Constructor

Parameters
  • sample_p – Sample drawn from P

  • sample_q – Sample drawn from Q

  • split_ratio – Ratio that defines how much data is used for training the witness

  • model – Model used to learn the witness function

  • **model_kwargs

    Keyword arguments to initialize the model

Returns

None

fit_witness(**kwargs) None

Fit witness

Parameters

kwargs – Keyword arguments to be passed to fit method of model

Returns

None

interpret(k=1)

Return the k most typical examples from P and Q.

Returns

Tuple: (k most significant examples from P, k most significant examples from Q)

p_value(permutations: int = 1000, **fit_kwargs)

Run the complete pipeline and return p value with default settings.

Returns

p-value

p_value_evaluate(permutations: int = 10000) float

Evaluate p value.

Parameters

permutations – number of permutations when estimating the p-value

Returns

p value

split_data() SplittedSets

Split & label data using the instances splitting ratio. The splits are stored as attributes but also returned.

Module contents