Tensorflow object detection API killed - OOM. How to reduce shuffle buffer size?

Asked 19/6, 2018 at 13:6 Answered 18/3, 2021 at 3:49

python tensorflow object-detection-api tfrecord

System information

OS Platform and Distribution: CentOS 7.5.1804
TensorFlow installed from: pip install tensorflow-gpu
TensorFlow version: tensorflow-gpu 1.8.0
CUDA/cuDNN version: 9.0/7.1.2
GPU model and memory: GeForce GTX 1080 Ti, 11264MB
Exact command to reproduce:

python train.py --logtostderr --train_dir=./models/train --pipeline_config_path=mask_rcnn_inception_v2_coco.config

Describe the problem

I am attempting to train a Mask-RCNN model on my own dataset (fine tuning from a model trained on COCO), but the process is killed as soon as the shuffle buffer is filled.

Before this happens, nvidia-smi shows memory usage of around 10669MB/11175MB but only 1% GPU utilisation.

I have tried adjusting the following train_config settings:

batch_size: 1    
batch_queue_capacity: 10    
num_batch_queue_threads: 4    
prefetch_queue_capacity: 5

And for train_input_reader:

num_readers: 1
queue_capacity: 10
min_after_dequeue: 5

I believe my problem is similar to TensorFlow Object Detection API - Out of Memory but I am using a GPU rather than CPU-only.

The images I am training on are comparatively large (2048*2048), however I would like to avoid downsizing as the objects to be detected are quite small. My training set consists of 400 images (in a .tfrecord file).

Is there a way to reduce the size of the shuffle buffer to see if this reduces the memory requirement?

Traceback

INFO:tensorflow:Restoring parameters from ./models/train/model.ckpt-0
INFO:tensorflow:Restoring parameters from ./models/train/model.ckpt-0
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path ./models/train/model.ckpt
INFO:tensorflow:Saving checkpoint to path ./models/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
2018-06-19 12:21:33.487840: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 97 of 2048
2018-06-19 12:21:43.547326: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 231 of 2048
2018-06-19 12:21:53.470634: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:94] Filling up shuffle buffer (this may take a while): 381 of 2048
2018-06-19 12:21:57.030494: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:129] Shuffle buffer filled.
Killed

Birdsall answered 19/6, 2018 at 13:6 Comment(4)

The shuffle.buffer is on the CPU bounded by the RAM (but there is swapping). Your inputs are simply too big for training. The maximum size a decent GPU can handle during training is something along 1300 x 1300. GPU utilization is not your issue. – Counterpoint 19/6, 2018 at 16:20

Thanks. Reducing the max_dimension parameter in the config from 1365 to 900 solved the OOM issue. However, GPU utilisation is still showing as 0% (or single-digit). Surely this isn't the expected behaviour? – Birdsall 20/6, 2018 at 12:56

Do you really expect an answer for that without showing any relevant code? – Counterpoint 20/6, 2018 at 12:58

I am using the same command as above to run the train.py file to train my model, which is no longer killed as soon as the buffer is filled. However, during training, nvidia-smi shows GPU-Util of 0% while memory usage is 11680MiB / 12212MiB. – Birdsall 20/6, 2018 at 13:50

You can try steps as followings:

1.Set batch_size=1 (or try your own)

2.Change "default value": optional uint32 shuffle_buffer_size = 11 [default = 256] (or try your own) the code is here

models/research/object_detection/protos/input_reader.proto

Line 40 in ce03903

 optional uint32 shuffle_buffer_size = 11 [default = 2048];

original set is :

optional uint32 shuffle_buffer_size = 11 [default = 2048]

the default value is 2048, it's too big for batch_size=1, should be modified accordingly, it consumes a lot of RAM in my opinion.

3.Recompile Protobuf libraries

From tensorflow/models/research/

protoc object_detection/protos/*.proto --python_out=.

Tarsometatarsus answered 13/6, 2019 at 3:33 Comment(2)

I am also experiencing this issue. What do you suggest I change optional uint32 shuffle_buffer_size = 11 [default = 2048] to? – Zaidazailer 3/10, 2019 at 17:1

chang from optional uint32 shuffle_buffer_size = 11 [default = 2048] to optional uint32 shuffle_buffer_size = 11 [default = 256] , if 256 can not meet requirements ,you can modify it. – Tarsometatarsus 8/10, 2019 at 3:13

In your pipeline.config, Add the

shuffle_buffer_size: 200

or as according to your system.

train_input_reader {
  shuffle_buffer_size: 200
  label_map_path: "tfrecords/label_map.pbtxt"
  tf_record_input_reader {
    input_path: "tfrecords/train.record"
  }
}

It's working for me, tested on tf1 and tf2 as well.

Offutt answered 18/3, 2021 at 3:49 Comment(0)

I change flow_from_directory to flow_from_dataframe function. Because it doesn't upload the matrix values of all images to memory.

Gorgonian answered 17/1, 2020 at 5:0 Comment(0)

System information

Describe the problem

Traceback

Recommended topics

Hot tags