mvtsdatatoolkit.data_analysis package¶
Submodules¶
mvtsdatatoolkit.data_analysis.extracted_features_analysis module¶
-
class
mvtsdatatoolkit.data_analysis.extracted_features_analysis.
ExtractedFeaturesAnalysis
(extracted_features_df: pandas.core.frame.DataFrame, exclude: list = None)[source]¶ Bases:
object
This class is responsible for data analysis of the extracted statistical features. It takes the extracted features produced by features.feature_extractor.py (or features.feature_extractor_parallel.py) and provides some basic analytics as follows:
A histogram of classes,
The counts of the missing values,
A five-number summary for each extracted feature.
These summaries can be stored in a CSV file as well.
-
compute_summary
()[source]¶ Using the extracted data, this method calculates all the basic analysis with respect to each statistical feature (each column of extracted_features_df).
It populates the summary dataframe of the class with all the required data corresponding to each feature.
- Below are the column names of the summary dataframe:
‘Feature Name’: Contains the time series statistical feature name,
‘Non-null Count’: Contains the number of non-null entries per feature,
‘Null Count’: Contains the number of null entries per feature,
‘Min’: Contains the minimum value of the feature(Without considering the null or nan value),
‘25th’: Contains the first quartile (25%) of the feature values (Without considering the null/nan value),
‘Mean’: Contains the mean of the feature values (Without considering the null/nan value),
‘50th’: Contains the median of the feature values (Without considering the null/nan value),
‘75th’: Contains the third quartile (75%) of the feature values (Without considering the null/nan value),
‘Max’: Contains the minimum value of the feature (Without considering the null/nan value),
‘Std. Dev’: Contains the standard deviation of the feature (Without considering the null/nan value)
The computed summary will be stored in the class field summary.
-
get_class_population
(label: str) → pandas.core.frame.DataFrame[source]¶ Gets the per-class population of the original dataset.
- Parameters
label – The column-name corresponding to the class_labels.
- Returns
A dataframe of two columns; class_labels and class counts.
-
get_five_num_summary
() → pandas.core.frame.DataFrame[source]¶ Returns the seven number summary of each extracted feature. This method does not compute the statistics but only returns what was already computed in the compute_summary method.
- Returns
A dataframe where the columns are [Feature-Name, mean, std, min, 25th, 50th, 75th,
max] and each row corresponds to the statistics on one of the extracted features.
mvtsdatatoolkit.data_analysis.mvts_data_analysis module¶
-
class
mvtsdatatoolkit.data_analysis.mvts_data_analysis.
MVTSDataAnalysis
(path_to_config)[source]¶ Bases:
object
This class walks through a directory of CSV files (each being an MVTS) and calculates estimated statistics of each of the parameters.
- It will perform the following tasks:
Read each MVTS (.csv files) from the folder where the MVTS dataset is kept, i.e., /pet_datasets/subset_partition3. Parameter path_to_root will be provided by the user at the time of creating the instance of this class.
- Perform Exploratory Data Analysis (EDA) on the MVTS dataset
Histogram of classes
Missing Value count
Six-Number summary of each physical parameter(Estimated Values)
Summary report can be saved in .CSV file in output folder, i.e., /pet_datasets/mvts_analysis using summary_to_csv() method.
This class uses t-digest, a new data structure for accurate accumulation of rank-based statistics in the distributed system. TDigest module is installed in order to use this data structure.
-
compute_summary
(params_name: list = None, params_index: list = None, first_k: int = None, partition: list = None, proc_id: int = None, verbose: bool = False, output_list: list = None)[source]¶ By reading each CSV file from the path listed in the configuration file, this method calculates all the basic statistics with respect to each parameter (each column of the MVTS). As the data is distributed in several CSV files, this method computes the statistics on each MVTS and updates the computed stats in a streaming fashion.
Note: Computing the quantiles of parameters globally requires loading the entire data into memory. To avoid this, we use TDigest data structure to estimate it, while loading one MVTS at a time.
As it calculates the statistics, it populates self.summary dataframe of the class with all the required data corresponding to each parameter. Below are the column names of the summary dataframe:
- `Parameter Name`: Contains the time series parameter name, - `Val-Count`: Contains the count of the values of all processed time series, - `Null Count`: Contains the number of null entries per parameter, - `Min`: Contains the minimum value of each parameter (without considering the null/nan values), - `25th`: Contains the 1-st quartile (25%) of each parameter (without considering the null/nan values), - `Mean`: Contains the `mean` of each parameter (without considering the null/nan values), - `50th`: Contains the `median` of each parameter (without considering the null/nan values), - `75th`: Contains the 3-rd quartile (75%) of each parameter (without considering the null/nan values), - `Max`: Contains the `min` value of each parameter (without considering the null/nan values)
- Parameters
first_k – (Optional) If provided, only the first k MVTS will be processed. This is mainly for getting some preliminary results in case the number of MVTS files is too large.
params_name – (Optional) User may specify the list of parameters for which statistical analysis is needed. If no params_name is provided by the user then all existing numeric parameters are included in the list.
params_index – (Optional) User may specify the list of indices corresponding to the parameters provided in the configuration file.
partition – Only for internal use. Ignore this.
proc_id – Only for internal use. Ignore this.
verbose – Set it to True if you want to see more details as the function is running. Default is False.
output_list – Only for internal use. Ignore this.
- Returns
None
-
compute_summary_in_parallel
(n_jobs: int, params_name: list = None, params_index: list = None, first_k: int = None, verbose: bool = False)[source]¶ This method calls compute_summary in parallel. For more details, see compute_summary’s documentation.
- Parameters
n_jobs – The number of processes to be employed.
params_name – (Optional) User may specify the list of parameters for which statistical analysis is needed. If no params_name is provided by the user then all existing numeric parameters are included in the list.
params_index – (Optional) User may specify the list of indices corresponding to the parameters provided in the configuration file.
first_k – (Optional) If provided, only the first k MVTS will be processed. This is mainly for getting some preliminary results in case the number of MVTS files is too large.
verbose – Set it to True if you want to see more details as the function is running. Default is False.
-
get_average_mvts_size
()[source]¶ - Returns
The average size (in bytes) of the MVTS files located at the root directory
listed in the configuration file.
-
get_missing_values
() → pandas.core.frame.DataFrame[source]¶ Gets the missing values counts for each parameter in the MVTS files.
- Returns
A dataframe with two columns, namely the parameter names and the counts of the
corresponding missing values.
-
get_number_of_mvts
()[source]¶ - Returns
The number of MVTS files located at the root directory listed in the
configuration file.
-
get_six_num_summary
() → pandas.core.frame.DataFrame[source]¶ Gets the six-number summary of each parameter in the MVTS files.
- Returns
A dataframe where the rows are mean, min, 25th, 50th, 75th, max and
the columns are the parameters of the given dataframe.
-
get_total_mvts_size
()[source]¶ - Returns
The total size (in bytes) of the MVTS files located at the root directory
listed in the configuration file.