What does DeepLab's --train_crop_size actually do?

Asked 12/5, 2019 at 4:0 Answered 23/5, 2019 at 14:13

Following the instructions included in the model, --training_crop_size is set to a value much smaller than the size of the training images. For instance:

python deeplab/train.py \
    --logtostderr \
    --training_number_of_steps=90000 \
    --train_split="train" \
    --model_variant="xception_65" \
    --atrous_rates=6 \
    --atrous_rates=12 \
    --atrous_rates=18 \
    --output_stride=16 \
    --decoder_output_stride=4 \
    --train_crop_size="769,769" \
    --train_batch_size=1 \
    --dataset="cityscapes" \
    --tf_initial_checkpoint=${PATH_TO_INITIAL_CHECKPOINT} \
    --train_logdir=${PATH_TO_TRAIN_DIR} \
    --dataset_dir=${PATH_TO_DATASET}

But what does this option actually do? Does it take a random crop of each training image? If so, wouldn't the input dimensions be smaller, e.g., 769x769 (WxH) as per example? As per instructions, the eval crop size is set to 2049x1025. How does a network with input dimensions 769x769 take 2049x1025 input when there's no suggestion of image resizing? A shape mismatch issue would arise.

Are the instructions conflicting?

Frauenfeld answered 12/5, 2019 at 4:0 Comment(0)

yes, it seems that in your case the images are cropped during the training process. This enables a larger batch size within the computational limitations of your system. A larger batch size leads to optimization steps which are based on multiple instances instead of considering only one (or very few) instance(s) per optimization (=training) step. This often leads to better results. Normally a random crop is used to make sure that the network is trained on all parts of the image.

The training or deployment of a "fully convolutional" CNN does not require a fixed input size. By using padding at the input edges, the dimentionality reduction is often represented by a factor of 2^n (caused by striding or pooling). Example: your encoder is reducing each spatial dimension by a factor of 2^4 before the decoder is upsampling it again. --> So you only have to make sure that your input dimensions are a multiple of 2^4 (The exact input size does not matter, it is just defining the spatial dimensions of the hidden layer of your network during the training). In case of deeplab, the framework automatically adapts the given input dimensions to the required multiple of 2^x to make it even easier for you to use.

The evaluation instances should never be randomly cropped since only a deterministic evaluation process guarantees meaningful evaluation results. During the evaluation, there is no optimization and a batch size of one is fine.

Ropable answered 23/5, 2019 at 14:13 Comment(0)

It seems that they use full image during evaluation time. It is typically done by averaging a larger tensor in the last convolutional layer. They also mention that due to full image evaluation crop size has to be set to maximum size of the image available in the dataset.

source, see Q8

Planksheer answered 13/5, 2019 at 6:15 Comment(1)

Yes, but are images are cropped to 769x769 (randomly?) during training when the full images are much larger? Is it effectively image augmentation? – Frauenfeld 13/5, 2019 at 15:52

Recommended topics

Hot tags