pythonnumpytensorflowkerasscikit-learn

How do I pass sklearns train_test_split actual dataseries and not single values as input argument?


I want to train an LSTM-based RNN model for binary classification and for that I wanted to use tensorflow keras model with LSTM layers. In order to do so, I need testing input and output as well as validation input and output, which I wanted to generate with sklearns train_test_split.

def prepare_data(self, satellites):
        """
        Prepare time-series data for RNN.
        """
        feature_sequences = []
        labels = []
        
        for sat in satellites:
            if sat.manoeuvrability is not None:
                # Stack the orbital parameters as time-series features (epochs will be the time dimension)
                features = np.column_stack((
                    sat.apoapses,
                    sat.periapses,
                    sat.inclinations,
                    sat.mean_motions,
                    sat.eccentricities,
                    sat.semimajor_axes,
                    sat.orbital_energy
                ))
                feature_sequences.append(features)
                labels.append(sat.manoeuvrability)
        
        X = np.array(feature_sequences, dtype=object)
        y = np.array(labels)
        
        return train_test_split(X, y, test_size=0.2, random_state=42)

train_test_split returns me a None-Value. Removing the dtype=object cast in the argument leads me to an

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (73,) + inhomogeneous part.

How do I properly form my features vector for sklearns train_test_split if I want to pass literal timeseries as arguments? The time-dependence is important in my case, so I really cant work around with manually breaking down time series to the average or something


Solution

  • I simplified your code to this:

    def prepare_data():
            feature_sequences = []
            labels = []
            
            for i in range(10):
    
                features = np.column_stack((2*i*5, "hello"))  # wrong?
                # features = (2*i*5, "hello")  # correct
    
                feature_sequences.append(features)
                labels.append(i)
    
            X = np.array(feature_sequences, dtype=object)
            y = np.array(labels)
            
            return train_test_split(X, y, test_size=0.2, random_state=42)
    

    The returned split for the features in a 3D array which it shouldn't be. Just replace the line with column_stack and the resulting split looks better.