mvtsdatatoolkit.normalizing package

Submodules

mvtsdatatoolkit.normalizing.normalizer module

mvtsdatatoolkit.normalizing.normalizer.negativeone_one_normalize(df: pandas.core.frame.DataFrame, excluded_colnames: list = None) → pandas.core.frame.DataFrame[source]

Applies the MinMaxScaler from the module sklearn.preprocessing to find the min and max of each column and transforms the values into the range of [-1,1]. The transformation is given by:

X_scaled = scale * X - 1 - X.min(axis=0) * scale
where::

scale = 2 / (X.max(axis=0) - X.min(axis=0))

Note: In case multiple dataframes are used (i.e., several partitions of the dataset in training and testing), make sure that all of them will be passed to this method at once, and as one single dataframe. Otherwise, the normalization will be carried out on local (as opposed to global) extrema, which is incorrect.

Parameters
  • df – The dataframe to be normalized.

  • excluded_colnames – The name of non-numeric columns (e.g. TimeStamp,

ID etc) that must be excluded before normalization takes place. They will be added back to the normalized data.

Returns

The same dataframe as input, with the label column unchanged,

except that now the numerical values are transformed into a [-1, 1] range.

mvtsdatatoolkit.normalizing.normalizer.robust_standardize(df: pandas.core.frame.DataFrame, excluded_colnames: list = None) → pandas.core.frame.DataFrame[source]

Applies the RobustScaler from the module sklearn.preprocessing by removing the median and scaling the data according to the quantile range (IQR). This transformation is robust to outliers.

Note: In case multiple dataframes are used (i.e., several partitions of the dataset in training and testing), make sure that all of them will be passed to this method at once, and as one single dataframe. Otherwise, the normalization will be carried out on local (as opposed to global) extrema, hence unrepresentative IQR. This is a bad practice.

Parameters
  • df – The dataframe to be normalized.

  • excluded_colnames – The name of non-numeric (e.g., TimeStamp,

ID etc.) that must be excluded before normalization takes place. They will be added back to the normalized data.

Returns

The same dataframe as input, with the label column unchanged,

except that now the numerical values are transformed into new range determined by IQR.

mvtsdatatoolkit.normalizing.normalizer.standardize(df: pandas.core.frame.DataFrame, excluded_colnames: list = None) → pandas.core.frame.DataFrame[source]

Applies the StandardScaler from the module sklearn.preprocessing by removing the mean and scaling to unit variance. The transformation is given by:

\[z = (x - u) / s\]

where x is a feature vector, u is the mean of the vector, and s represents its standard deviation.

Note: In case multiple dataframes are used (i.e., several partitions of the dataset in training and testing), make sure that all of them will be passed to this method at once, and as one single dataframe. Otherwise, the normalization will be carried out on local (as opposed to global) extrema, which is incorrect.

Parameters
  • df – The dataframe to be normalized.

  • excluded_colnames – The name of non-numeric columns (e.g. TimeStamp,

ID etc) that must be excluded before normalization takes place. They will be added back to the normalized data.

Returns

The same dataframe as input, with the label column unchanged,

except that now the numeric values are transformed into a range with mean at 0 and unit standard deviation.

mvtsdatatoolkit.normalizing.normalizer.zero_one_normalize(df: pandas.core.frame.DataFrame, excluded_colnames: list = None) → pandas.core.frame.DataFrame[source]

Applies the MinMaxScaler from the module sklearn.preprocessing to find the min and max of each column and transforms the values into the range of [0,1]. The transformation is given by:

X_scaled = (X - X.min(axis=0)) / ranges
where::

range = X.max(axis=0) - X.min(axis=0)

Note: In case multiple dataframes are used (i.e., several partitions of the dataset in training and testing), make sure that all of them will be passed to this method at once, and as one single dataframe. Otherwise, the normalization will be carried out on local (as opposed to global) extrema, which is incorrect.

Parameters
  • df – The dataframe to be normalized.

  • excluded_colnames – The name of non-numeric columns (e.g. TimeStamp,

ID etc.) that must be excluded before normalization takes place. They will be added back to the normalized data.

Returns

The same dataframe as input, with the label column unchanged,

except that now the numerical values are transformed into a [0, 1] range.

Module contents