mvtsdatatoolkit.data_analysis package

Submodules

mvtsdatatoolkit.data_analysis.extracted_features_analysis module

class mvtsdatatoolkit.data_analysis.extracted_features_analysis.ExtractedFeaturesAnalysis(extracted_features_df: pandas.core.frame.DataFrame, exclude: list = None)[source]

Bases: object

This class is responsible for data analysis of the extracted statistical features. It takes the extracted features produced by features.feature_extractor.py (or features.feature_extractor_parallel.py) and provides some basic analytics as follows:

  • A histogram of classes,

  • The counts of the missing values,

  • A five-number summary for each extracted feature.

These summaries can be stored in a CSV file as well.

compute_summary()[source]

Using the extracted data, this method calculates all the basic analysis with respect to each statistical feature (each column of extracted_features_df).

It populates the summary dataframe of the class with all the required data corresponding to each feature.

Below are the column names of the summary dataframe:
  • ‘Feature Name’: Contains the time series statistical feature name,

  • ‘Non-null Count’: Contains the number of non-null entries per feature,

  • ‘Null Count’: Contains the number of null entries per feature,

  • ‘Min’: Contains the minimum value of the feature(Without considering the null or nan value),

  • ‘25th’: Contains the first quartile (25%) of the feature values (Without considering the null/nan value),

  • ‘Mean’: Contains the mean of the feature values (Without considering the null/nan value),

  • ‘50th’: Contains the median of the feature values (Without considering the null/nan value),

  • ‘75th’: Contains the third quartile (75%) of the feature values (Without considering the null/nan value),

  • ‘Max’: Contains the minimum value of the feature (Without considering the null/nan value),

  • ‘Std. Dev’: Contains the standard deviation of the feature (Without considering the null/nan value)

The computed summary will be stored in the class field summary.

get_class_population(label: str) → pandas.core.frame.DataFrame[source]

Gets the per-class population of the original dataset.

Parameters

label – The column-name corresponding to the class_labels.

Returns

A dataframe of two columns; class_labels and class counts.

get_five_num_summary() → pandas.core.frame.DataFrame[source]

Returns the seven number summary of each extracted feature. This method does not compute the statistics but only returns what was already computed in the compute_summary method.

Returns

A dataframe where the columns are [Feature-Name, mean, std, min, 25th, 50th, 75th,

max] and each row corresponds to the statistics on one of the extracted features.

get_missing_values() → pandas.core.frame.DataFrame[source]

Gets the missing-value counts for each extracted feature.

Returns

A dataframe of two columns; the extracted features (i.e., column names of

extracted_features_df) and the missing-value counts.

print_summary()[source]

Prints the summary dataframe to the console.

summary_to_csv(output_path, file_name)[source]

Stores the summary statistics.

Parameters
  • output_path – Path to where the summary should be stored.

  • file_name – Name of the csv file. If the extension is not given, .csv will be appended to the given name.

mvtsdatatoolkit.data_analysis.mvts_data_analysis module

class mvtsdatatoolkit.data_analysis.mvts_data_analysis.MVTSDataAnalysis(path_to_config)[source]

Bases: object

This class walks through a directory of CSV files (each being an MVTS) and calculates estimated statistics of each of the parameters.

It will perform the following tasks:
  1. Read each MVTS (.csv files) from the folder where the MVTS dataset is kept, i.e., /pet_datasets/subset_partition3. Parameter path_to_root will be provided by the user at the time of creating the instance of this class.

  2. Perform Exploratory Data Analysis (EDA) on the MVTS dataset
    1. Histogram of classes

    2. Missing Value count

    3. Six-Number summary of each physical parameter(Estimated Values)

  3. Summary report can be saved in .CSV file in output folder, i.e., /pet_datasets/mvts_analysis using summary_to_csv() method.

This class uses t-digest, a new data structure for accurate accumulation of rank-based statistics in the distributed system. TDigest module is installed in order to use this data structure.

compute_summary(params_name: list = None, params_index: list = None, first_k: int = None, partition: list = None, proc_id: int = None, verbose: bool = False, output_list: list = None)[source]

By reading each CSV file from the path listed in the configuration file, this method calculates all the basic statistics with respect to each parameter (each column of the MVTS). As the data is distributed in several CSV files, this method computes the statistics on each MVTS and updates the computed stats in a streaming fashion.

Note: Computing the quantiles of parameters globally requires loading the entire data into memory. To avoid this, we use TDigest data structure to estimate it, while loading one MVTS at a time.

As it calculates the statistics, it populates self.summary dataframe of the class with all the required data corresponding to each parameter. Below are the column names of the summary dataframe:

- `Parameter Name`: Contains the time series parameter name,
- `Val-Count`: Contains the count of the values of all processed time series,
- `Null Count`: Contains the number of null entries per parameter,
- `Min`: Contains the minimum value of each parameter (without considering the null/nan
  values),
- `25th`: Contains the 1-st quartile (25%) of each parameter (without considering
  the null/nan values),
- `Mean`: Contains the `mean` of each parameter (without considering the null/nan
  values),
- `50th`: Contains the `median` of each parameter (without considering the
  null/nan values),
- `75th`: Contains the 3-rd quartile (75%) of each parameter (without considering
  the null/nan values),
- `Max`: Contains the `min` value of each parameter (without considering the null/nan
  values)
Parameters
  • first_k – (Optional) If provided, only the first k MVTS will be processed. This is mainly for getting some preliminary results in case the number of MVTS files is too large.

  • params_name – (Optional) User may specify the list of parameters for which statistical analysis is needed. If no params_name is provided by the user then all existing numeric parameters are included in the list.

  • params_index – (Optional) User may specify the list of indices corresponding to the parameters provided in the configuration file.

  • partition – Only for internal use. Ignore this.

  • proc_id – Only for internal use. Ignore this.

  • verbose – Set it to True if you want to see more details as the function is running. Default is False.

  • output_list – Only for internal use. Ignore this.

Returns

None

compute_summary_in_parallel(n_jobs: int, params_name: list = None, params_index: list = None, first_k: int = None, verbose: bool = False)[source]

This method calls compute_summary in parallel. For more details, see compute_summary’s documentation.

Parameters
  • n_jobs – The number of processes to be employed.

  • params_name – (Optional) User may specify the list of parameters for which statistical analysis is needed. If no params_name is provided by the user then all existing numeric parameters are included in the list.

  • params_index – (Optional) User may specify the list of indices corresponding to the parameters provided in the configuration file.

  • first_k – (Optional) If provided, only the first k MVTS will be processed. This is mainly for getting some preliminary results in case the number of MVTS files is too large.

  • verbose – Set it to True if you want to see more details as the function is running. Default is False.

get_average_mvts_size()[source]
Returns

The average size (in bytes) of the MVTS files located at the root directory

listed in the configuration file.

get_missing_values() → pandas.core.frame.DataFrame[source]

Gets the missing values counts for each parameter in the MVTS files.

Returns

A dataframe with two columns, namely the parameter names and the counts of the

corresponding missing values.

get_number_of_mvts()[source]
Returns

The number of MVTS files located at the root directory listed in the

configuration file.

get_six_num_summary() → pandas.core.frame.DataFrame[source]

Gets the six-number summary of each parameter in the MVTS files.

Returns

A dataframe where the rows are mean, min, 25th, 50th, 75th, max and

the columns are the parameters of the given dataframe.

get_total_mvts_size()[source]
Returns

The total size (in bytes) of the MVTS files located at the root directory

listed in the configuration file.

print_stat_of_directory()[source]

Prints a summary of the MVTS files located at the root directory listed in the configuration file.

Returns

None

print_summary()[source]

Prints the summary dataframe to the console.

summary_to_csv(output_path, file_name)[source]

Stores the summary statistics.

Parameters
  • output_path – Path to where the summary should be stored.

  • file_name – Name of the CSV file. If the extension is not given, ‘.csv’ will be appended to the given name.

Returns

None

Module contents