Trying to use ConcatLayer with different shape inputs

I am trying to work with nolearn and use the ConcatLayer to combine multiple inputs. It works great as long as every input has the same type and shape. I have three different types of inputs that will eventually produce a single scalar output value.

The first input is an image of dimensions (288,1001)
The second input is a vector of length 87
The third is a single scalar value

I am using Conv2DLayer(s) on the first input. The second input utilizes Conv1DLayer or DenseLayer (not sure which would be better since I can't get it far enough to see what happens) I'm not even sure how the third input should be set up since it is only a single value I want to feed into the network.

The code blows up at the ConcatLayer with: 'Mismatch: input shapes must be the same except in the concatenation axis'

It would be forever grateful if someone could write out a super simple network structure that can take these types of inputs and output a single scalar value. I have been googling all day and simply cannot figure this one out.

The fit function looks like this if it is helpful to know, as you can see I am inputting a dictionary with an item for each type of input:

X = {'base_input': X_base, 'header_input': X_headers, 'time_input':X_time}
net.fit(X, y)

Solution

It is hard to properly answer the question, because - it depends. Without having information on what you are trying to do and what data you are working on, we are playing the guessing game here and thus I have to fall back to giving general tips.

First it is totally reasonable, that ConcatLayer complains. It just does not make a lot of sense to append a scalar to the Pixel values of an Image. So you should think about what you actually want. This is most likely combining the information of the three sources.

You are right by suggesting to process the Image with 2D convolutions and the sequence data with 1D convolutions. If you want to generate a scalar value, you propably want to use dense layers later on, to condense the information. So it would be naturally, to leave the lowlevel processing of the three branches independent and then concatenate them later on.

Something along the lines of:

Image -> conv -> ... -> conv -> dense -> ... -> dense -> imValues
Timeseries -> conv -> ... -> conv -> dense ... -> dense -> seriesValues
concatLayer([imValues, seriesValues, Scalar] -> dense -> ... -> dense with num_units=1

Another less often reasonable Option would be, to add the Information at the lowlevel processing of the Image. This might make sense, if the local processing is much easier, given the knowledge of the scalar/timeseries.

This architecture might look like:

concatLayer(seriesValues, scalar) -> dense -> ... -> reshape((-1, N, 1, 1))
    -> Upscale2DLayer(Image.shape[2:3]) -> globalInformation
concatLayer([globalInformation, Image]) -> 2D conv filtersize=1 -> conv -> ... -> conv

Note that you will almost certainly want to go with the first Option.

One unrelated Thing I noticed, is the huge size of your Input Image. You should reduce it(resizing/patches). Unless you have a gigantic load of data and tons of Memory and computing power, you will otherwise either overfit or waste Hardware.