pythonkerasautoencoderdata-integration

How to solve keras fit function error "All input arrays (x) should have the same number of samples"?


I'm following the example of Integrating CITEseq data with Deep Learning. The code worked until the third part of the example, where it is supposed to train the autoencoder. Since I'm new to keras models, I'm basically just copying and pasting the code, so I do not know how the one on the website is working and mine is not.

I've tried changing the fit funcion from

estimator = autoencoder.fit([X_scRNAseq, X_scProteomics],
                            [X_scRNAseq, X_scProteomics],
                            epochs = 100, batch_size = 128,
                            validation_split = 0.2, shuffle = True, verbose = 1)

to

estimator = autoencoder.fit([X_scRNAseq, X_scRNAseq],
                            [X_scRNAseq, X_scRNAseq],
                            epochs = 100, batch_size = 128,
                            validation_split = 0.2, shuffle = True, verbose = 1)

in order to fix the same number of samples problem and it worked, but that does not train the autoencoder the way it is supposed to.

Both X_scRNAseq and X_scProteomics are numpy arrays with shapes of (36280, 8617) and (13, 8617), respectively. The model summary is:

Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
scRNAseq (InputLayer)           (None, 8617)         0                                            
__________________________________________________________________________________________________
scProteomics (InputLayer)       (None, 8617)         0                                            
__________________________________________________________________________________________________
Encoder_scRNAseq (Dense)        (None, 50)           430900      scRNAseq[0][0]                   
__________________________________________________________________________________________________
Encoder_scProteomics (Dense)    (None, 10)           86180       scProteomics[0][0]               
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (None, 60)           0           Encoder_scRNAseq[0][0]           
                                                                 Encoder_scProteomics[0][0]       
__________________________________________________________________________________________________
Bottleneck (Dense)              (None, 50)           3050        concatenate_1[0][0]              
__________________________________________________________________________________________________
Concatenate_Inverse (Dense)     (None, 60)           3060        Bottleneck[0][0]                 
__________________________________________________________________________________________________
Decoder_scRNAseq (Dense)        (None, 8617)         525637      Concatenate_Inverse[0][0]        
__________________________________________________________________________________________________
Decoder_scProteomics (Dense)    (None, 8617)         525637      Concatenate_Inverse[0][0]        
==================================================================================================
Total params: 1,574,464
Trainable params: 1,574,464
Non-trainable params: 0
__________________________________________________________________________________________________

The error I get when I try to apply the fit function is:

ValueError: All input arrays (x) should have the same number of samples. Got array shapes: [(36280, 8617), (13, 8617)]

Thank you!


Solution

  • Keras expects the first axis of your input data to be the number of samples. As you said, X_scRNAseq shape is (36280, 8617) and the shape of X_scProteomics is (13, 8617). Keras expects the first axis to be the number of samples which isn't true in this case.

    The solution, I believe, is to reshape both X_scRNAseq and X_scProteomics like so:

    X_scRNAseq = np.swapaxes(X_scRNAseq, 0, 1)   #(8617, 36280)
    X_scProteomics = np.swapaxes(X_scProteomics, 0, 1)  #(8617, 13)
    
    

    Then, fit your model:

    estimator = autoencoder.fit([X_scRNAseq, X_scProteomics],
                                [X_scRNAseq, X_scProteomics],
                                epochs = 100, batch_size = 128,
                                validation_split = 0.2, shuffle = True, verbose = 1)