pythonazuretensorflowazure-databricksvgg-net

Saving Custom TableNet Model (VGG19 based) for table extraction - Azure Databricks


I have a model based on TableNet and VGG19, the data (Marmoot) for training and the saving path is mapped to a datalake storage (using Azure).

I'm trying to save it in the following ways and get the following errors on Databricks:

  1. First approach:

    import pickle
    pickle.dump(model, open(filepath, 'wb'))
    

    This saves the model and gives the following output:

    WARNING:absl:Found untraced functions such as _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op while saving (showing 5 of 31). These functions will not be directly callable after loading.
    

    Now when I try to reload the mode using:

    loaded_model = pickle.load(open(filepath, 'rb'))
    

    I get the following error (Databricks show in addition to the following error the entire stderr and stdout but this is the gist):

    ValueError: Unable to restore custom object of type _tf_keras_metric. Please make sure that any 
    custom layers are included in the `custom_objects` arg when calling `load_model()` and make 
    sure that all layers implement `get_config` and `from_config`.
    
  2. Second approach:

    model.save(filepath)
    

    and for the I get the following error:

    Fatal error: The Python kernel is unresponsive.
    The Python process exited with exit code 139 (SIGSEGV: Segmentation fault).
    
    The last 10 KB of the process's stderr and stdout can be found below. See driver logs for full logs.
    ---------------------------------------------------------------------------
    Last messages on stderr:
    Mon Jan  9 08:04:31 2023 Connection to spark from PID  1285
    Mon Jan  9 08:04:31 2023 Initialized gateway on port 36597
    Mon Jan  9 08:04:31 2023 Connected to spark.
    2023-01-09 08:05:53.221618: I tensorflow/core/platform/cpu_feature_guard.cc:193] This 
    TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the 
    following CPU 
    instructions in performance-critical operations:  AVX2 FMA
    

    and much more, its hard to find the proper place of error form all of the stderr and stdout. It shows the entire stderr and stdout which makes it very hard to find the solution (it shows all the stderr and stdout including the training and everything)

  3. Third approach (partially):

    I also tried:

    model.save_weights(weights_path)
    

    but once again I was unable to reload them (this approach was tried the least)


Also I tried saving the checkpoints by adding this:

model_checkpoint = tf.keras.callbacks.ModelCheckpoint(filepath, monitor = "val_table_mask_loss", verbose = 1, save_weights_only=True)

as a callback in the fit method (callbacks=[model_checkpoint]) but in the end of the first epoch it will generate the following error(I show the end of the Traceback):

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/_objects.pyx in h5py._objects.with_phil.wrapper()
h5py/h5f.pyx in h5py.h5f.create()
OSError: Unable to create file (file signature not found)

When I use the second approach on a platform that is not Databricks it works fine, but then when I try to load the model I get an error similar to the first approach loading.


Update 1

my variable filepath that I try to save to is a dbfs reference, and my dbfs is mapped to the datalake storage


Update 2

When trying as suggested in the comments, with the following answer I get the following error:

----> 3 model2 = keras.models.load_model("/tmp/model-full2.h5")
...
ValueError: Unknown layer: table_mask. Please ensure this object is passed to the `custom_objects` argument. See https://www.tensorflow.org/guide/keras/save_and_serialize#registering_the_custom_object for details.

Update 3:

So I try following the error plus this answer:

model2 = keras.models.load_model("/tmp/model-full2.h5", custom_objects={'table_mask': table_mask})

but then I get the following error:

TypeError: 'KerasTensor' object is not callable

Solution

  • Try making the following changes to your custom object(s), so they can be properly serialized and deserialized:

    Add the keywords arguments to your constructor:

    def __init__(self, **kwargs):
    super(TableMask, self).__init__(**kwargs)
    

    Rename table_mask to TableMask to avoid naming conflicts. So when you load your model, it will look something like this:

    model = keras.models.load_model("/tmp/path", custom_objects={'TableMask': TableMask, 'CustomObj2': CustomObj2, 'CustomMetric': CustomMetric})
    

    Update from question author:

    We found few error in my code:

    Also I used the following answer that @AloneTogether suggested in the comments (this answer is the way I choose to save and load the model, plus the extra data we wrote in the above list)

    After all that, the saving, loading, predicting worked great