Solar Flare Prediction from Time Series of Solar Magnetic Field Parameters

A Track in the IEEE Big Data 2019 Big Data Cup

Description of the Data

As stated in the competition overview, our dataset mainly relies on Spaceweather HMI Active Region Patches (SHARPs) available from the Joint Science Operations Center (JSOC). This data product stems from solar vector magnetograms obtained by the Helioseismic Magnetic Imager (HMI) onboard the Solar Dynamics Observatory (SDO). The processed dataset provided for this competition is a set of magnetic field parameters calculated from individual SHARPs. We have transformed the SHARPs input data into multivariate time series (MVTS) of magnetic field parameters, and have sliced these resultant MVTS data series into records of twelve hours in length with a sample cadence of twelve minutes. The sliced MVTS are annotated with class labels, and for the purposes of the BigData Cup Challenge, there shall be only two classes for participants to differentiate from. The classes shall be labeled as either an MVTS slice from a SHARP that has a major solar flare (M- or X-class) occurring within the next twenty-four hours, or one that does not (i.e., it may have B- or C-class flares or no flares at all). It is important to note that large flares (M- or X-class), which are the most commonly targeted in predictive analyses, are scarce. In our dataset, there is a large imbalance between our flaring and non-flaring classes with ~4K flaring samples and ~192K non-flaring samples in the training set. It is safe to assume that similar imbalances continue in the testing data, but specifics of this shall remain a closely held secret for the duration of this competition.

The magnetic field parameters used in this work are described in [Solar flare prediction using SDO/HMI vector magnetic field data with a machine-learning algorithm] and are listed in the table below. Those parameter names listed in red are described in the above document, but are not available in the SHARPs data accessible at JSOC. Our dataset also has the addition of the max x-ray luminosity observed over +/- six minutes around the representative time step. This is used because the reporting cadence from the GOES satellite is different than that of the source for the other parameters.

More details of the file format, can be found on the competition Kaggle data page.

https://www.kaggle.com/c/bigdata2019-flare-prediction/data