mvtsdatatoolkit.features package

Submodules

mvtsdatatoolkit.features.extractor_utils module

mvtsdatatoolkit.features.extractor_utils.calculate_one_mvts(df_mvts: pandas.core.frame.DataFrame, features_list: list) → pandas.core.frame.DataFrame[source]

This method computes a list of F statistical features on the given multivariate time series of P parameters. The output is a dataframe of dimension P X F, that looks like:

-----------------------
    f1    f2    ...
p1  val   val   ...
p2  val   val   ...
... ...   ...   ...
-----------------------

Note: The statistical features will be extracted from all the give columns. So, in case it is needed only over some of the time series, then only those selected columns should be passed in.

Parameters
  • df_mvts – An MVTS dataframe from which the features are to be extracted.

  • features_list – A list of all callable functions (from features.feature_collection) to be executed on the given MVTS.

Returns

A dataframe with the parameters as rows, and statistical features as columns.

mvtsdatatoolkit.features.extractor_utils.flatten_to_row_df(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]

For a given dataframe of dimension P X F, where the row names (i.e., original_mvts’s indices) are the time series (i.e., parameters’) names, and the column names are the statistical features, this method flattens the given dataframe into a single-row dataframe of dimension 1 X (P X F). The columns names in the resultant dataframe is derived from the given dataframe original_mvts, by combining the row and column names of the given dataframe.

For example, for a given original_mvts like the one below:

-----------------------------------------------
    f1    f2    ...
p1  val   val   ...
p2  val   val   ...
... ...   ...   ...
-----------------------------------------------

the column names in the output dataframe would be:

-----------------------------------------------
    P1_f1   P1_f2   ... P2_f1   P2_f2   ...
1   val     val         val     val
-----------------------------------------------
Parameters

df – The data frame to be flattened.

Returns

A dataframe with one row and P X F columns, with values similar to the given dataframe.

mvtsdatatoolkit.features.extractor_utils.get_methods_for_names(method_names: list)[source]

For a given method-name, it finds it in feature_collection and returns it as a callable method.

Parameters

method_names – Name of the method of interest that exists in feature_collection.

Returns

A callable instance of the method whose name is given.

mvtsdatatoolkit.features.extractor_utils.split(l: list, n_of_partitions: int)list[source]

Splits the given list l into n_of_paritions partitions of approximately equal size.

Parameters
  • l – The list to be split.

  • n_of_partitions – Number of partitions.

Returns

A list of the partitions, where each partition is a list itself.

mvtsdatatoolkit.features.feature_collection module

mvtsdatatoolkit.features.feature_collection.get_average_absolute_change(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The average absolute first difference of a univariate time series.

mvtsdatatoolkit.features.feature_collection.get_average_absolute_derivative_change(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The average absolute first difference of a derivative of univariate time series.

mvtsdatatoolkit.features.feature_collection.get_avg_mono_decrease_slope(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The average slope of monotonically decreasing segments.

mvtsdatatoolkit.features.feature_collection.get_avg_mono_increase_slope(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The average slope of monotonically increasing segments.

mvtsdatatoolkit.features.feature_collection.get_dderivative_kurtosis(uni_ts: Union[pandas.core.series.Series, numpy.ndarray], step_size: int = 1) → numpy.float64[source]
Returns

The kurtosis of the difference derivative of univariate time series within the function we use step_size to find derivative (default value of step_size is 1).

mvtsdatatoolkit.features.feature_collection.get_dderivative_mean(uni_ts: Union[pandas.core.series.Series, numpy.ndarray], step_size: int = 1) → numpy.float64[source]
Returns

The mean of the difference-derivative of univariate time series within the function we use step_size to find derivative (default value of step_size is 1).

mvtsdatatoolkit.features.feature_collection.get_dderivative_skewness(uni_ts: Union[pandas.core.series.Series, numpy.ndarray], step_size: int = 1) → numpy.float64[source]
Returns

The skewness of the difference derivative of univariate time series within the function we use step_size to find derivative (default value of step_size is 1).

mvtsdatatoolkit.features.feature_collection.get_dderivative_stddev(uni_ts: Union[pandas.core.series.Series, numpy.ndarray], step_size: int = 1) → numpy.float64[source]
Returns

The std.dev of the difference derivative of univariate time series within the function we use step_size to find derivative (default value of step_size is 1).

mvtsdatatoolkit.features.feature_collection.get_difference_of_maxs(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The absolute difference between the maximums of the first and the second halves of a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_difference_of_means(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The absolute difference between the means of the first and the second halves of a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_difference_of_medians(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The absolute difference between the medians of the first and the second halves of a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_difference_of_mins(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The absolute difference between the minimums of the first and the second halves of a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_difference_of_stds(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The absolute difference between the standard dev. of the first and the second halves of a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_difference_of_vars(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The absolute difference between the variances of the first and the second halves of a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_gderivative_kurtosis(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The kurtosis of the gradient derivative of the univariate time series.

mvtsdatatoolkit.features.feature_collection.get_gderivative_mean(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The mean of the gradient-derivative of univariate time series.

mvtsdatatoolkit.features.feature_collection.get_gderivative_skewness(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The skewness of the gradient derivative of the univariate time series.

mvtsdatatoolkit.features.feature_collection.get_gderivative_stddev(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The std.dev of the gradient derivative of univariate time series.

mvtsdatatoolkit.features.feature_collection.get_kurtosis(uni_ts: Union[pandas.core.series.Series, numpy.ndarray])[source]
Returns

The kurtosis of a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_last_K(uni_ts: Union[pandas.core.series.Series, numpy.ndarray], k: int) → pandas.core.series.Series[source]
Returns

The last k values in a univariate time series.

mvtsdatatoolkit.features.feature_collection.get_last_value(uni_ts) → numpy.float64[source]
Returns

The last value in a univariate time series. This seems redundant since get_last_K already does this job, but it is necessary because the return type is different ( numpy.int64) than what get_last_K returns (numpy.ndarray). This is especially important if the methods in this module are going to be called from a list.

mvtsdatatoolkit.features.feature_collection.get_linear_weighted_average(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]

Computes the linear weighted average of a univariate time series. It simply, for each x_i in uni_ts computes the following:

2/(n*(n+1)) * sum(i* x_i)

where n is the length of the time series.

Returns

The linear weighted average of uni_ts.

mvtsdatatoolkit.features.feature_collection.get_longest_monotonic_decrease(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.int64[source]
Returns

The length of the time series segment with the longest monotonic increase.

mvtsdatatoolkit.features.feature_collection.get_longest_monotonic_increase(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.int64[source]
Returns

The length of the time series segment with the longest monotonic increase.

mvtsdatatoolkit.features.feature_collection.get_longest_negative_run(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.int64[source]
Returns

The longest negative run in a univariate time series.

mvtsdatatoolkit.features.feature_collection.get_longest_positive_run(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.int64[source]
Returns

The longest positive run in a univariate time series.

mvtsdatatoolkit.features.feature_collection.get_max(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The maximum value of a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_mean(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The arithmetic mean value of a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_mean_last_K(uni_ts: Union[pandas.core.series.Series, numpy.ndarray], k: int = 10) → numpy.float64[source]
Returns

The mean of last k-values in a univariate time series.

mvtsdatatoolkit.features.feature_collection.get_mean_local_maxima_value(uni_ts: Union[pandas.core.series.Series, numpy.ndarray], only_positive: bool = False) → numpy.float64[source]

Returns the mean of local maxima values.

Parameters
  • uni_ts – Univariate time series.

  • only_positive – Only positive flag for local maxima. When True only positive local maxima are considered. Default is False.

Returns

Mean of local maxima values.

mvtsdatatoolkit.features.feature_collection.get_mean_local_minima_value(uni_ts: Union[pandas.core.series.Series, numpy.ndarray], only_negative: bool = False) → numpy.float64[source]

Returns the mean of local minima values.

Parameters
  • uni_ts – Univariate time series.

  • only_negative – Only negative flag for local minima. When True only negative local minima are considered. Default is False.

Returns

Mean of local minima values.

mvtsdatatoolkit.features.feature_collection.get_median(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The median value of a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_min(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The minimum value of a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_negative_fraction(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The fraction of negative numbers in uni_ts.

mvtsdatatoolkit.features.feature_collection.get_no_local_extrema(uni_ts: Union[pandas.core.series.Series, numpy.ndarray])[source]
Returns

The number of local extrema in a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_no_local_maxima(uni_ts: Union[pandas.core.series.Series, numpy.ndarray])[source]
Returns

The number of local maxima in a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_no_local_minima(uni_ts: Union[pandas.core.series.Series, numpy.ndarray])[source]
Returns

The number of local minima in a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_no_mean_local_maxima_upsurges(uni_ts: Union[pandas.core.series.Series, numpy.ndarray], only_positive: bool = False) → numpy.int64[source]

Returns the number of values in a given time series whose value is greater than the mean of local maxima values (# of upsurges).

Parameters
  • uni_ts – Univariate time series.

  • only_positive – Only positive flag for mean local maxima. When True only positive local maxima are considered. Default is False.

Returns

Number of points whose value is greater than mean local maxima.

mvtsdatatoolkit.features.feature_collection.get_no_mean_local_minima_downslides(uni_ts: Union[pandas.core.series.Series, numpy.ndarray], only_negative: bool = False) → numpy.int64[source]

Returns the number of values in a given time series whose value is less than the mean of local minima values (# of downslides).

Parameters
  • uni_ts – Univariate time series.

  • only_negative – Only negative flag for mean local minima. When True only negative local minima are considered. Default is False.

Returns

Number of points whose value is less than mean local minima.

mvtsdatatoolkit.features.feature_collection.get_no_zero_crossings(uni_ts: Union[pandas.core.series.Series, numpy.ndarray])[source]
Returns

The number of zero-crossings in a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_positive_fraction(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The fraction of positive numbers in uni_ts.

mvtsdatatoolkit.features.feature_collection.get_quadratic_weighted_average(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]

Computes the quadratic weighted average of a univariate time series. It simply, for each x_i in uni_ts, computes the following:

6/(n*(n+1)(2*n+1)) * sum(i^2 * x_i)

where n is the length of the time seires.

Returns

The quadratic weighted average of uni_ts.

mvtsdatatoolkit.features.feature_collection.get_skewness(uni_ts: Union[pandas.core.series.Series, numpy.ndarray])[source]
Returns

The skewness of a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_slope_of_longest_mono_decrease(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]

Identifies the longest monotonic decrease and gets the slope.

Returns

The slope of the longest monotonic decrease in uni_ts.

mvtsdatatoolkit.features.feature_collection.get_slope_of_longest_mono_increase(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]

Identifies the longest monotonic increase and gets the slope.

Returns

The slope of the longest monotonic increase in uni_ts.

mvtsdatatoolkit.features.feature_collection.get_stddev(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The standard deviation of a given univariate time series.

mvtsdatatoolkit.features.feature_collection.get_sum_of_last_K(uni_ts: Union[pandas.core.series.Series, numpy.ndarray], k: int = 10) → numpy.float64[source]
Returns

The sum of last k-values in a univariate time series.

mvtsdatatoolkit.features.feature_collection.get_var(uni_ts: Union[pandas.core.series.Series, numpy.ndarray]) → numpy.float64[source]
Returns

The variance of a given univariate time series.

mvtsdatatoolkit.features.feature_extractor module

class mvtsdatatoolkit.features.feature_extractor.FeatureExtractor(path_to_config: str)[source]

Bases: object

An instance of this class can extract a set of given statistical features from a large number of MVTS data, in both sequential and parallel fashions. It loads the configuration file provided by the user and reads the following pieces of information from it.

Below are the column names of the summary dataframe:
  • PATH_TO_MVTS: path to where the CSV (MVTS) files are stored.

  • MVTS_PARAMETERS: a list of time series name; only those listed here will be processed.

  • STATISTICAL_FEATURES: a list of statistical features to be computed on each time series.

  • META_DATA_TAGS: a list of tags used in the MVTS file names; to be used for extraction of

some metadata from file names. * PATH_TO_EXTRACTED_FEATURES: path to a directory where the extracted features (one CSV file) will be stored.

Based on these values, it walks through the directory PATH_TO_MVTS and for each of the MVTS files, it computes the statistical features listed in STATISTICAL_FEATURES on all time series listed in MVTS_PARAMETERS. It uses the tags in META_DATA_TAGS to extract some metadata, such as class label, time stamp, id, etc.

The resultant dataframe (i.e., the extracted features) will have T X F + x columns, where F is the total number of features (i.e., len(STATISTICAL_FEATURES)), T is the total number of time series parameters (i.e., len(MVTS_PARAMETERS)), and x is the number of metadata extracted from the file names (i.e., len(META_DATA_TAGS)).

In the extracted features dataframe, the column-name of the nominal attributes is of the following structure:

<TIME_SERIES_NAME>_<statistic_name>

For instance, for a time series named DENSITY and the statistical feature mean, the corresponding column-name would be DENSITY_mean.

Note: In do_extraction_in_parallel, each child process takes a list of file names (not the actual files) that is a partition of the entire dataset, and works independently on the MVTS in that partition. Therefore, the memory consumption of using n child processes is almost equal to n times the amount used in the sequential mode. That is, the parallel mode does not increase memory consumption exponentially with respect to the number of children. The number of partitions is equal to the number of child processes (i.e., n_jobs).

do_extraction(params_name: list = None, params_index: list = None, features_name: list = None, features_index: list = None, first_k: int = None, need_interp: bool = True, partition: list = None, proc_id: int = None, verbose: bool = False, output_list: list = None)[source]

Computes (based on the metadata loaded in the constructor) all of the statistical features on the MVTS data (per time series; column-wise) and stores the results in the public class field df_all_features.

Note that only if the configuration file passed to the class constructor contains a list of the desired parameters and features the optional arguments can be skipped. So, please keep in mind the followings:

  • For parameters: a selected list of parameters (i.e., column names in MVTS data) must be provided either through the configuration file or the method argument params_name. Also, the argument params_index can be used to work with a smaller list of parameters if a list of parameters is already provided in the config file.

  • For features: A selected list of parameters (i.e., statistical features available in features.feature_collection.py) MUST be provided, as mentioned above.

Parameters
  • params_name – (Optional) A list of parameter names of interest that can be used instead of the list MVTS_PARAMETERS given in the config file. If the list in the config file is NOT provided, then either this or params_index MIST be given.

  • params_index – (Optional) A list of column indices of interest that can be used instead of the list MVTS_PARAMETERS given in the config file. If the list in the config file is NOT provided, then either this or params_name MUST be given.

  • features_name – (Optional) A list of statistical features to be calculated on all time series of each MVTS file. The statistical features are the function names present in features.feature_collection.py’. If they are not provided in the config file (under `STATISTICAL_FEATURES), either this or features_index MUST be given.

  • features_index – (Optional) A list of indices corresponding to the features provided in the configuration file. If they are not provided in the config file (under STATISTICAL_FEATURES), either this or features_names MUST be given.

  • first_k – (Optional) If provided, only the fist first_k MVTS files will be processed. This is mainly for getting some preliminary results in case the number of MVTS files is too large.

  • need_interp – True if a linear interpolation is needed to alter the missing numerical values. This only takes care of the missing values and will not affect the existing ones. Set it to False otherwise. Default is True.

  • partition – (only for internal use)

  • proc_id – (only for internal use)

  • verbose – If set to True, the program prints on the console which files are being processed and what processes (if parallel) are doing the work. The default value is False.

  • output_list – (only for internal use)

Returns

None

do_extraction_in_parallel(n_jobs: int, params_name: list = None, params_index: list = None, features_name: list = None, features_index: list = None, first_k: int = None, need_interp: bool = True, verbose: bool = False)[source]

This method calls do_extraction in parallel (using multiprocessing library) with n_jobs processes.

For more info about this method and each of its arguments, see documentation of do_extraction.

Parameters

n_jobs – The number of processes to be employed. This number will be used to partition the dataset in a way that each process gets approximately the same number of files to extract features from.

Returns

None

plot_boxplot(feature_names: list, output_path: str = None)[source]

Generates a plot of box-plots, one for each extracted feature.

Parameters
  • feature_names – A list of feature-names indicating the columns of interest for this visualization.

  • output_path – If given, the generated plot will be stored instead of shown. Otherwise, it will be only shown if the running environment allows it.

Returns

None

plot_correlation_heatmap(feature_names: list, output_path: str = None)[source]

Generates a heat-map for the correlation matrix of all pairs of given features.

Note: Regardless of the range of correlations, the color-map is fixed to [-1, 1]. This is especially important to avoid mapping insignificant changes of values into significant changes of colors.

Parameters
  • feature_names – A list of feature-names indicating the columns of interest for this visualization.

  • output_path – If given, the generated plot will be stored instead of shown. Otherwise, it will be only shown if the running environment allows it.

Returns

None

plot_covariance_heatmap(feature_names: list, output_path: str = None)[source]

Generates a heat-map for the covariance matrix of all pairs of given features.

Note that covariance is not a standardized statistic, and because of this, the color-map might be confusing; when the difference between the largest and smallest covariance is insignificant, the colors may imply a significant difference. To avoid this, the values mapped to the colors (as shown next to the color-map) must be carefully taken into account in the analysis of the covariance.

Parameters
  • feature_names – A list of feature-names indicating the columns of interest for this visualization.

  • output_path – If given, the generated plot will be stored instead of shown. Otherwise, it will be only shown if the running environment allows it.

Returns

None

plot_splom(feature_names: list, output_path: str = None)[source]

Generates a SPLOM, or a scatter plot matrix, for all pairs of features. Note that for a large number of features this may take a while (since each cell of the matrix is a scatter plot on its own), and also the final plot may become very large.

Parameters
  • feature_names – A list of feature-names indicating the columns of interest for this visualization.

  • output_path – If given, the generated plot will be stored instead of shown. Otherwise, it will be only shown if the running environment allows it.

Returns

None

plot_violinplot(feature_names: list, output_path: str = None)[source]

Generates a plot of violin-plots, one for each extracted feature.

Parameters
  • feature_names – A list of feature-names indicating the columns of interest for this visualization.

  • output_path – If given, the generated plot will be stored instead of shown. Otherwise, it will be only shown if the running environment allows it.

Returns

None

store_extracted_features(output_filename: str, verbose: bool = True)[source]

Stores the dataframe of extracted features, calculated in the method do_extraction, as a CSV file. The output path is read from the configuration file, while the file name given as the argument here will be used as the file name.

If the output directory given in the configuration file does not exist, it will be created recursively.

Parameters
  • output_filename – The name of the output CSV file as the calculated data frame. If the ‘.csv’ extension is not provided, it will ba appended to the given name.

  • verbose – Set to False to prevent the output path be printed on console. Default is set to True.

Module contents