[SOLVED] Deciding on how to scale data and which scaler to use?

Deciding on how to scale data and which scaler to use?

I am trying to train an MLP model with two Dense layers in keras to do prediction for a small data set of around 100 uni-variate time series. This model should get values of 6 days and predict the 7th day value. As an input to the model, I first concatenate these time series after each other in a dataframe like follow:

 ts1    val1 
 ts1    val2 
   ...     
 ts1    varN 
 ts2    val1 
 ts2    val2 
   ...     
 ts3    varN 
 ts3    val1 
 ts3    val2 
   ...     
 ts3    varN 
   ...
 ts100  val1 
 ts100  val2 
   ...     
 ts100  varN

I wonder what is the best way to scale this data? First of all, Should I scale each time series (ts_n) independently, so there will be 100 scalers at the end? Or should I better scale them all together (one acaler at the end) so that I won't lose the correlation between them? Or since all of these time series are considered as the same feature, then there is no point in having correlation?!

My second question is about which scaling method I should choose? min-max or StandardScaler (from sklearn)? Some time series behave quite different from the others, and they have big variations in their values. If I use min-max scaler, it will ignore these differences, right? So isn't it better to use StandardScaler that (hopefully) considers the differences in score between each time series?

P.S. I should mention that 'after' the scaling is done, I will create timesteps and will have the final results like this:

        timestep1 | timestep2 | timestep3 | timestep4 | timestep5 | timestep6 | timestep7
 ts1      var1    |   var2    |   var3    |   var4    |   var5    |   var6    |   var7    
 ts1      var2    |   var3    |   var4    |   var5    |   var6    |   var7    |   var8    
 ts1      var3    |   var4    |   var5    |   var6    |   var7    |   var8    |   var9
 ...
 ts2      var1    |   var2    |   var3    |   var4    |   var5    |   var6    |   var7    
 ts2      var2    |   var3    |   var4    |   var5    |   var6    |   var7    |   var8    
 ts2      var3    |   var4    |   var5    |   var6    |   var7    |   var8    |   var9
 ...
 ts100      var1    |   var2    |   var3    |   var4    |   var5    |   var6    |   var7    
 ts100      var2    |   var3    |   var4    |   var5    |   var6    |   var7    |   var8    
 ts100      var3    |   var4    |   var5    |   var6    |   var7    |   var8    |   var9
 ...

Solution

In general, I've found very little difference in performance between MinMaxScaler and StandardScaler. Of course, since (it appears) you'll be scaling your target variable, as well, you should make sure you use a scaler that is consistent with your output activation function. For example, if you're output activation function is a ReLU, you wont be able to predict any negative values. In that case, I would lean towards the MinMaxScaler since all of your targets will end up in the interval [0,1].

In terms of whether to scale the time series together or independently, it may depend on the specific setting. If the scales tend to have different time-dependent behaviors, it's likely good to scale them together so the difference is preserved. If they all behave with a similar pattern, scaling them independently will likely work best.

It's also worth considering other network architectures for time-series forecasting, e.g. RNNs.