[SOLVED] Talos.Scan() stops short without error before completing permutations

Talos.Scan() stops short without error before completing permutations

I tried a number of options to debug, and I can't get talos to execute more than a few permutations before it stops, without any hint as to the problem. This scenario seems to be quite simple, so what am I doing wrong?

Input data is available here.

Following is my model function, parameter space and talos.Scan() call. The full code is available here.

# Create, compile and fit network
# This is rewritten for talos hyperparamter optimization
# Removed kernel_initializer='normal' from dense layers from example. Default is glorot_uniform
def createNetworkAndFit(trainVectors, trainLabels, validationVectors, validationLabels, params):
    # Create model
    model = Sequential()
    model.add(Dense(params['first_neuron'], input_dim=trainVectors.shape[1], activation=params['activation']))
    model.add(Dropout(params['dropout']))
    talos.model.layers.hidden_layers(model, params, 1)
    model.add(Dense(1, activation=params['last_activation']))
    # Compile model
    model.compile(loss=params['losses'], optimizer=params['optimizer'](), metrics=['accuracy', fmeasure_acc, 'mean_squared_error'])
    # Fit model
    history = model.fit(trainVectors, trainLabels, validation_data=[validationVectors, validationLabels], batch_size=params['batch_size'], epochs=params['epochs'], verbose=0)
    return history, model

# Define hyperparameter space
# As hidden layers are generated, "last_neuron" is the number of hidden units.
# Does this mean all hidden layers have the same number of hidden units?
p = {'first_neuron': [trainVectors.shape[1]],
    'dropout': [0, 0.25, 0.5],
    'hidden_layers': [2, 3],
    'shapes': ['brick', 'funnel'],
    'batch_size': [trainVectors.shape[0], int(trainVectors.shape[0]/10), int(trainVectors.shape[0]/100), int(trainVectors.shape[0]/1000)],
    'epochs': [300],
    'optimizer': [Nadam, Adam, RMSprop],
    'losses': [binary_crossentropy],
    'activation': [relu, elu],
    'last_activation': ['sigmoid']}

# Hyperparamter Search
experiment = talos.Scan(x=trainVectors,
                        y=trainLabels,
                        model=createNetworkAndFit,
                        grid_downsample=0.01,
                        params=p,
                        dataset_name='15000_talos',
                        experiment_no='1',
                        print_params=True,
                        disable_progress_bar=True,
                        clear_tf_session=True,
                        debug=True)

Here is my output:

Using TensorFlow backend.
{'batch_size': 312, 'hidden_layers': 3, 'activation': <function relu at 0x7f77e75e9510>, 'epochs': 300, 'optimizer': <class 'keras.optimizers.Nadam'>, 'shapes': 'brick', 'last_activation': 'sigmoid', 'losses': <function binary_crossentropy at 0x7f777dee6ae8>, 'first_neuron': 52, 'dropout': 0.25}
2019-06-02 10:46:45.248187: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-06-02 10:46:45.293153: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-02 10:46:45.293569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: GeForce GTX 780 major: 3 minor: 5 memoryClockRate(GHz): 0.941
pciBusID: 0000:01:00.0
totalMemory: 2.95GiB freeMemory: 2.84GiB
2019-06-02 10:46:45.293595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-06-02 10:46:45.478345: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-02 10:46:45.478378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-06-02 10:46:45.478395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-06-02 10:46:45.478491: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2560 MB memory) -> physical GPU (device: 0, name: GeForce GTX 780, pci bus id: 0000:01:00.0, compute capability: 3.5)
{'batch_size': 3120, 'hidden_layers': 3, 'activation': <function elu at 0x7f77e75e92f0>, 'epochs': 300, 'optimizer': <class 'keras.optimizers.RMSprop'>, 'shapes': 'brick', 'last_activation': 'sigmoid', 'losses': <function binary_crossentropy at 0x7f777dee6ae8>, 'first_neuron': 52, 'dropout': 0.5}
2019-06-02 10:46:56.373641: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-06-02 10:46:56.373692: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-02 10:46:56.373707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988]      0 
2019-06-02 10:46:56.373712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0:   N 
2019-06-02 10:46:56.373799: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2560 MB memory) -> physical GPU (device: 0, name: GeForce GTX 780, pci bus id: 0000:01:00.0, compute capability: 3.5)

EDIT1

I noticed some of my params in p were not used in the model function. After changing that, the search still stops short. I've edited the code above.

Solution

The problem was my choice of grid_downsample (0.01), which was too small for the space of possible permutations in the grid. It would be great if Talos provided more feedback on the size of the grid in relation to random downsampling. This is the Scan() call I ended up with:

# Hyperparamter Search
experiment = talos.Scan(x=trainVectors,
                        y=trainLabels,
                        model=createNetworkAndFit,
                        grid_downsample=1,
                        params=p,
                        dataset_name='15000_talos',
                        experiment_no='1',
                        print_params=True,
                        disable_progress_bar=True,
                        clear_tf_session=True,
                        debug=True)