I am setting up a TensorFlow pipeline for reading large HDF5 files as input for my deep learning models. Each HDF5 file contains 100 videos of variable size length stored as a collection of compressed JPG images (to make size on disk manageable). Using tf.data.Dataset
and a map to tf.py_func
, reading examples from the HDF5 file using custom Python logic is quite easy. For example:
def read_examples_hdf5(filename, label):
with h5py.File(filename, 'r') as hf:
# read frames from HDF5 and decode them from JPG
return frames, label
filenames = glob.glob(os.path.join(hdf5_data_path, "*.h5"))
labels = [0]*len(filenames) # ... can we do this more elegantly?
dataset = tf.data.Dataset.from_tensor_slices((filenames, labels))
dataset = dataset.map(
lambda filename, label: tuple(tf.py_func(
read_examples_hdf5, [filename, label], [tf.uint8, tf.int64]))
dataset = dataset.shuffle(1000 + 3 * BATCH_SIZE)
dataset = dataset.batch(BATCH_SIZE)
iterator = dataset.make_one_shot_iterator()
next_batch = iterator.get_next()
This example works, however the problem is that it seems like tf.py_func
can only handle one example at a time. As my HDF5 container stores 100 examples, this limitation causes significant overhead as the files constantly need to be opened, read, closed and reopened. It would be much more efficient to read all the 100 video examples into the dataset object and then move on with the next HDF5 file (preferably in multiple threads, each thread dealing with it's own collection of HDF5 files).
So, what I would like is a number of threads running in the background, reading video frames from the HDF5 files, decode them from JPG and then feed them into the dataset object. Prior to the introduction of the tf.data.Dataset
pipeline, this was quite easy using the RandomShuffleQueue
and enqueue_many
ops, but it seems like there is currently no elegant way of doing this (or the documentation is lacking).
Does anyone know what would be the best way of achieving my goal? I have also looked into (and implemented) the pipeline using tfrecord
files, but taking a random sample of video frames stored in a tfrecord
file seems quite impossible (see here). Additionally, I have looked at the from_generator()
inputs for tf.data.Dataset
but that is definitely not going to run in multiple threads it seems. Any suggestions are more than welcome.
tf.data.Dataset.map(your_map_function, num_parallel_calls=N)
do what you want? It will runN
threads of your map function. The problem that I see with this is that you now have 6 threads each reading 1 HDF5 file, meaning you better have enough memory for all 6 full HDF5 files. I came to this question because of a related question I posted trying to resolve the issue of limited memory and large HDF5 files. #48349909 – Melesa