pythontensorflowdeeplabtensorflow-model-garden

Deeplabv3 re-train result is skewed for non-square images


I have issues fine-tuning the pretrained model deeplabv3_mnv2_pascal_train_aug in Google Colab.

When I do the visualization with vis.py, the results appear to be displaced to the left/upper side of the image if it has a bigger height/width, namely, the image is not square.

The dataset used for the fine-tune is Look Into Person. The steps done to do so are:

  1. Create dataset in deeplab/datasets/data_generator.py
_LIP_INFORMATION = DatasetDescriptor(
    splits_to_sizes={
        'train': 30462,
        'train_aug': 10582,
        'trainval': 40462,
        'val': 10000,
    },
    num_classes=19,
    ignore_label=255,
)

_DATASETS_INFORMATION = {
    'cityscapes': _CITYSCAPES_INFORMATION,
    'pascal_voc_seg': _PASCAL_VOC_SEG_INFORMATION,
    'ade20k': _ADE20K_INFORMATION,
    'cihp': _CIHP_INFORMATION,
    'lip': _LIP_INFORMATION,
}
  1. Conversion to tfrecord
!python models/research/deeplab/datasets/build_voc2012_data.py \
  --image_folder="/content/drive/MyDrive/TFM/lip_trainval_images/TrainVal_images/train_images" \
  --semantic_segmentation_folder="/content/drive/MyDrive/TFM/lip_trainval_segmentations/TrainVal_parsing_annotations/train_segmentations" \
  --list_folder="/content/drive/MyDrive/TFM/lip_trainval_images" \
  --image_format="jpg" \
  --output_dir="train_lip_tfrecord/"
!python models/research/deeplab/datasets/build_voc2012_data.py \
  --image_folder="/content/drive/MyDrive/TFM/lip_trainval_images/TrainVal_images/val_images" \
  --semantic_segmentation_folder="/content/drive/MyDrive/TFM/lip_trainval_segmentations/TrainVal_parsing_annotations/val_segmentations" \
  --list_folder="/content/drive/MyDrive/TFM/lip_trainval_images" \
  --image_format="jpg" \
  --output_dir="val_lip_tfrecord/"
  1. Training
!python deeplab/train.py --logtostderr \
  --training_number_of_steps=40000 \
  --train_split="train" \
  --model_variant="mobilenet_v2" \
  --atrous_rates=6 \
  --atrous_rates=12 \
   --atrous_rates=18 \
   --output_stride=16 \
   --decoder_output_stride=4 \
   --train_batch_size=1 \
   --dataset="lip" \
   --train_logdir="/content/drive/MyDrive/TFM/checkpoint_lip_mobilenet" \
   --dataset_dir="/content/drive/MyDrive/TFM/trainval_lip_tfrecord/" \
   --fine_tune_batch_norm=false \
   --initialize_last_layer=false \
   --last_layers_contain_logits_only=false
  1. Visualization
!python deeplab/vis.py --logtostderr \
  --vis_split="val" 
  --model_variant="mobilenet_v2" 
  --atrous_rates=6 \
  --atrous_rates=12 \
   --atrous_rates=18 \
   --output_stride=16 \
   --decoder_output_stride=4 \
   --dataset="lip" \
   --checkpoint_dir="/content/drive/MyDrive/TFM/checkpoint_lip_mobilenet" \
   --vis_logdir="/content/drive/My Drive/TFM/eval_results_lip" \
   --dataset_dir="/content/drive/My Drive/TFM/trainval_lip_tfrecord" \
   --max_number_of_iterations=1 \
   --eval_interval_secs=0

With the following steps, an example of the problem I´m facing is:

Original image

Deeplabv3 result

I don´t know if I´m missing something important or if it needs more training. However, training does not seem to be a solution since loss its at the moment going up and down from 1.5 to 0.5, aprox.

Thanks in advance.


Solution

  • After some time, I did find a solution for this problem. An important thing to know is that, by default, train_crop_size and vis_crop_size are 513x513.

    The issue was due to vis_crop_size being smaller than the input images, so vis_crop_size is needed to be greater than the max dimension of the biggest image.

    In case you want to use export_model.py, you must use the same logic than vis.py, so your masks are not cropped to 513 by default.