RAM issues when trying to create tensorflow dataset pipeline that loads from multiple files and splits data into training/validation
Asked Answered
C

0

1

I'm trying to build a pipeline for a multi-output regression CNN. The input is pretty large and won't fit into memory, so I got different tf.data.Datasets on disk that I'm loading to use them for training.

When I'm loading the data, I don't acutally know the amount of samples that I have, so I tried implementing this solution suggested by @phemmer. However, the RAM runs full before the first batch is even used for training and I can't get it to run. Here is the code to the loading pipline that causes RAM issues:

element_spec = (
    tf.TensorSpec(shape=(32, 27), dtype=float32),
    tf.TensorSpec(shape=(20000), dtype=float64)
)

files = [join(dir, f) for f in listdir(dir) if ".dataset" in f]

files_ds = tf.data.Dataset.from_tensor_slices(files)
dataset =  files_ds.interleave(
    lambda path:
    tf.data.experimental.load(
     path=path,
     element_spec=element_spec,
     compression='GZIP'
    )
).shuffle(10000)

train_split = 0.8
frac = Fraction(train_split)
train_split = frac.numerator
val_split = frac.denominator - frac.numerator

dataset_train = dataset.window(train_split, train_split + val_split)\
    .flat_map(lambda *ds: ds[0] if len(ds) == 1 else tf.data.Dataset.zip(ds))\
    .batch(1024)\
    .prefetch(tf.data.experimental.AUTOTUNE)

dataset_validation = dataset.skip(train_split).window(val_split, train_split + val_split)\
    .flat_map(lambda *ds: ds[0] if len(ds) == 1 else tf.data.Dataset.zip(ds)) \
    .batch(1024)\
    .prefetch(tf.data.experimental.AUTOTUNE)

When I however replace the last part of the code (splitting in to train/validation) with the code below, it runs on the same maschine without RAM issues at all. This code works without any issues:

train_size = int(0.8 * dataset_size)
val_size = int(0.2 * dataset_size)

dataset_train = dataset.take(train_size).batch(1024).prefetch(tf.data.experimental.AUTOTUNE)

dataset_validation = dataset.skip(val_size).batch(1024).prefetch(tf.data.experimental.AUTOTUNE)

I'm assuming I made some mistake in the order that I used to set up my pipeline.

Addtional Info that might be helpful:

  1. In additon to the RAM problems I also encounter the following error message:
E tensorflow/core/framework/dataset.cc:825] Unimplemented: Cannot merge options for dataset of type LoadDataset, because the dataset does not implement InputDatasets.
  1. I'm training the model on Colab Pro (don't judge, I don't have any other options) with a GPU instance with the High Ram setting.

If there is any further information or code necessary to answer this question, let me know.

Cardiograph answered 6/7, 2021 at 16:38 Comment(1)
I got significant memory reduction by not using GZIP compression. Not sure if it needed to load the whole zip file into memory to decompress or what, but for whatever reason it does. Also you tried reducing that shuffle buffer size?Pasture

© 2022 - 2024 — McMap. All rights reserved.