I am trying to train an MLP model with two Dense layers in keras to do prediction for a small data set of around 100 uni-variate time series. This model should get values of 6 days and predict the 7th day value. As an input to the model, I first concatenate these time series after each other in a dataframe like follow:
ts1 val1
ts1 val2
...
ts1 varN
ts2 val1
ts2 val2
...
ts3 varN
ts3 val1
ts3 val2
...
ts3 varN
...
ts100 val1
ts100 val2
...
ts100 varN
I wonder what is the best way to scale this data? First of all, Should I scale each time series (ts_n) independently, so there will be 100 scalers at the end? Or should I better scale them all together (one acaler at the end) so that I won't lose the correlation between them? Or since all of these time series are considered as the same feature, then there is no point in having correlation?!
My second question is about which scaling method I should choose? min-max or StandardScaler (from sklearn)? Some time series behave quite different from the others, and they have big variations in their values. If I use min-max scaler, it will ignore these differences, right? So isn't it better to use StandardScaler that (hopefully) considers the differences in score between each time series?
P.S. I should mention that 'after' the scaling is done, I will create timesteps and will have the final results like this:
timestep1 | timestep2 | timestep3 | timestep4 | timestep5 | timestep6 | timestep7
ts1 var1 | var2 | var3 | var4 | var5 | var6 | var7
ts1 var2 | var3 | var4 | var5 | var6 | var7 | var8
ts1 var3 | var4 | var5 | var6 | var7 | var8 | var9
...
ts2 var1 | var2 | var3 | var4 | var5 | var6 | var7
ts2 var2 | var3 | var4 | var5 | var6 | var7 | var8
ts2 var3 | var4 | var5 | var6 | var7 | var8 | var9
...
ts100 var1 | var2 | var3 | var4 | var5 | var6 | var7
ts100 var2 | var3 | var4 | var5 | var6 | var7 | var8
ts100 var3 | var4 | var5 | var6 | var7 | var8 | var9
...
In general, I've found very little difference in performance between MinMaxScaler and StandardScaler. Of course, since (it appears) you'll be scaling your target variable, as well, you should make sure you use a scaler that is consistent with your output activation function. For example, if you're output activation function is a ReLU, you wont be able to predict any negative values. In that case, I would lean towards the MinMaxScaler since all of your targets will end up in the interval [0,1].
In terms of whether to scale the time series together or independently, it may depend on the specific setting. If the scales tend to have different time-dependent behaviors, it's likely good to scale them together so the difference is preserved. If they all behave with a similar pattern, scaling them independently will likely work best.
It's also worth considering other network architectures for time-series forecasting, e.g. RNNs.