RAM issues when trying to create tensorflow dataset pipeline that loads from multiple files and splits data into training/validation

I'm trying to build a pipeline for a multi-output regression CNN. The input is pretty large and won't fit into memory, so I got different tf.data.Datasets on disk that I'm loading to use them for training.

When I'm loading the data, I don't acutally know the amount of samples that I have, so I tried implementing this solution suggested by @phemmer. However, the RAM runs full before the first batch is even used for training and I can't get it to run. Here is the code to the loading pipline that causes RAM issues:

element_spec = (
    tf.TensorSpec(shape=(32, 27), dtype=float32),
    tf.TensorSpec(shape=(20000), dtype=float64)
)

files = [join(dir, f) for f in listdir(dir) if ".dataset" in f]

files_ds = tf.data.Dataset.from_tensor_slices(files)
dataset =  files_ds.interleave(
    lambda path:
    tf.data.experimental.load(
     path=path,
     element_spec=element_spec,
     compression='GZIP'
    )
).shuffle(10000)

train_split = 0.8
frac = Fraction(train_split)
train_split = frac.numerator
val_split = frac.denominator - frac.numerator

dataset_train = dataset.window(train_split, train_split + val_split)\
    .flat_map(lambda *ds: ds[0] if len(ds) == 1 else tf.data.Dataset.zip(ds))\
    .batch(1024)\
    .prefetch(tf.data.experimental.AUTOTUNE)

dataset_validation = dataset.skip(train_split).window(val_split, train_split + val_split)\
    .flat_map(lambda *ds: ds[0] if len(ds) == 1 else tf.data.Dataset.zip(ds)) \
    .batch(1024)\
    .prefetch(tf.data.experimental.AUTOTUNE)

When I however replace the last part of the code (splitting in to train/validation) with the code below, it runs on the same maschine without RAM issues at all. This code works without any issues:

train_size = int(0.8 * dataset_size)
val_size = int(0.2 * dataset_size)

dataset_train = dataset.take(train_size).batch(1024).prefetch(tf.data.experimental.AUTOTUNE)

dataset_validation = dataset.skip(val_size).batch(1024).prefetch(tf.data.experimental.AUTOTUNE)

I'm assuming I made some mistake in the order that I used to set up my pipeline.

Addtional Info that might be helpful:

In additon to the RAM problems I also encounter the following error message:

E tensorflow/core/framework/dataset.cc:825] Unimplemented: Cannot merge options for dataset of type LoadDataset, because the dataset does not implement InputDatasets.

I'm training the model on Colab Pro (don't judge, I don't have any other options) with a GPU instance with the High Ram setting.

If there is any further information or code necessary to answer this question, let me know.

Recommended topics

Hot tags