mvtsdatatoolkit.sampling package

Submodules

mvtsdatatoolkit.sampling.input_validator module

mvtsdatatoolkit.sampling.input_validator.validate_sampling_input(class_populations: dict, desired_ratios: dict = None, desired_populations: dict = None)[source]

This method validates the three arguments against the following rules:

  • Both desired_ratios and desired_populations cannot be None at the same time.

  • Both desired_ratios and desired_populations cannot be given at the same time.

  • Class labels in desired_populations must match with those in class_populations.

  • Class labels in desired_ratios must match with those in class_populations.

  • Class populations in desired_populations MUST be either positive or -1.

  • Class ratios in desired_ratios MUST be either positive or -1. The positive values may be larger than 1.0.

Parameters
  • class_populations – The class labels present in the data.

  • desired_ratios – The desired ratios of each class to be sampled.

  • desired_populations – The desired population of each class to be sampled.

Returns

True, if no exception was raised.

mvtsdatatoolkit.sampling.input_validator.validate_under_over_sampling_input(class_populations, minority_labels, majority_labels, base_minority=None, base_majority=None)[source]

This is to validate the arguments of two methods in the class Sampler in sampling.sampler, namely undersample and oversample. :param class_populations: See the corresponding docstring in Sampler. :param minority_labels: See the corresponding docstring in Sampler. :param majority_labels: See the corresponding docstring in Sampler. :param base_majority: See the corresponding docstring in Sampler. :param base_minority: See the corresponding docstring in Sampler.

Returns

True, if no exception was raised.

mvtsdatatoolkit.sampling.sampler module

class mvtsdatatoolkit.sampling.sampler.Sampler(extracted_features_df: pandas.core.frame.DataFrame, label_col_name)[source]

Bases: object

This module contains several methods that assist sampling for a number of purposes, among which, to remedy the class-imbalance issue, is the primary objective.

get_labels()list[source]

A getter method for the class_labels.

Returns

The class field class_labels; a list of all class labels in the data.

oversample(minority_labels: list, majority_labels: list, base_majority: str)[source]

Oversamples from the majority classes to achieve a 1:1 balance between the minority and majority classes. This is done in such a way that the outcome follows these criteria:

  • The minority classes have an equal population, equal to that of base_minority class.

  • The majority classes have an equal population, such that the next criterion is held true.

  • Total population of the majority classes is (oversampled to become) equal to the total

population of the minority classes.

Example: Consider mvts data with 5 classes:

|A| = 100, |B| = 400, |C| = 300, |D| = 700, |E| = 2000

where

|A| + |B| = 500, |C| + |D| + |E| = 3000

and given is:

minority_labels = ['A', 'B'], majority_labels = ['C', 'D', 'E'], base_majority = 'D'.

Then, the sampled dataframe would have the following populations:

|A| = 2100/2, |B| = 2100/2, |C| = 700, |D| = 700, |E| = 700

where

|A| + |B| = 2100, |C| + |D| + |E| = 2100.

sample(desired_populations: dict = None, desired_ratios: dict = None)[source]

Using this method one could do either undersampling or oversampling, in the most generic fashion. That is, the user determines the expected population size or ratios that they would like to get from the MVTS data. Example: Consider an MVTS data with 5 classes:

|A| = 100, |B| = 400, |C| = 300, |D| = 700, |E| = 2000

and given is: desired_ratios = [-1, -1, 0.33, 0.33, 0.33]

Then, the instances of classes A and B will not change, while the population size of each of the C, D, and E classes would be one-third of the sample size (3500 * 0.33).

Note:
  1. One and only one of the arguments must be provided.

  2. The dictionary must contain all class class_labels present in the mvts dataframe.

  3. The number -1 can be used wherever the population or ratio should not change.

Parameters

desired_populations – A dictionary of label-integer pairs, where each integer specifies

the desired population of the corresponding class. The integers must be positive, but -1 can be used to indicate that the population of the corresponding class should remain unchanged. :param desired_ratios: A dictionary of label-float pairs, where each float specifies the desired ratios (with respect to the total sample size) of the corresponding class. The floats must be positive, but -1 can be used to indicate that the ratio of the corresponding class should remain unchanged.

sample_each_class(input_dfs: pandas.core.frame.DataFrame, new_sample_size: int) → pandas.core.frame.DataFrame[source]

This method samples new_sample_size instances from a given dataframe. If the desired sample size is larger than the original population (i.e., new_sample_size > input_dfs.shape[0]), then the entire population will be used, as well as the extra samples needed to achieve the desired sample size. The extra instances will be sampled from input_dfs with replacement.

Parameters
  • input_dfs – The input dataset to sample from.

  • new_sample_size – The size of the desired sample.

Returns

The sampled dataframe.

undersample(minority_labels: list, majority_labels: list, base_minority: str)[source]

Undersamples from the majority classes to achieve a 1:1 balance between the minority and majority classes. This is done in such a way that the outcome follows these criteria:

  • The minority classes have an equal population, equal to that of base_minority class.

  • The majority classes have an equal population, such that the next criterion is held true.

  • Total population of the majority classes is (undersampled to become) equal to the total

population of the minority classes.

Example: Consider an mvts dataset with 5 classes, A, B, C, D, and E:

|A| = 100, |B| = 400, |C| = 300, |D| = 700, |E| = 2000

where

|A| + |B| = 500, |C| + |D| + |E| = 3000

and suppose given is:

minority_labels = ['A', 'B'], majority_labels = ['C', 'D', 'E'], base_minority = 'A'

Then, the sampled dataframe would have the following populations:

|A| = 100, |B| = 100, |C| = 200/3, |D| = 200/3, |E| = 200/3

where

|A| + |B| = 200, |C| + |D| + |E| = 200

Parameters
  • minority_labels – A list of class labels considered to be the minority classes.

  • majority_labels – A list of class labels considered to be the majority classes.

  • base_minority – The class label based on which, the sampling method is decided.

Module contents