I'm trying to build a pipeline for a multi-output regression CNN. The input is pretty large and won't fit into memory, so I got different tf.data.Datasets on disk that I'm loading to use them for training.
When I'm loading the data, I don't acutally know the amount of samples that I have, so I tried implementing this solution suggested by @phemmer. However, the RAM runs full before the first batch is even used for training and I can't get it to run. Here is the code to the loading pipline that causes RAM issues:
element_spec = (
tf.TensorSpec(shape=(32, 27), dtype=float32),
tf.TensorSpec(shape=(20000), dtype=float64)
)
files = [join(dir, f) for f in listdir(dir) if ".dataset" in f]
files_ds = tf.data.Dataset.from_tensor_slices(files)
dataset = files_ds.interleave(
lambda path:
tf.data.experimental.load(
path=path,
element_spec=element_spec,
compression='GZIP'
)
).shuffle(10000)
train_split = 0.8
frac = Fraction(train_split)
train_split = frac.numerator
val_split = frac.denominator - frac.numerator
dataset_train = dataset.window(train_split, train_split + val_split)\
.flat_map(lambda *ds: ds[0] if len(ds) == 1 else tf.data.Dataset.zip(ds))\
.batch(1024)\
.prefetch(tf.data.experimental.AUTOTUNE)
dataset_validation = dataset.skip(train_split).window(val_split, train_split + val_split)\
.flat_map(lambda *ds: ds[0] if len(ds) == 1 else tf.data.Dataset.zip(ds)) \
.batch(1024)\
.prefetch(tf.data.experimental.AUTOTUNE)
When I however replace the last part of the code (splitting in to train/validation) with the code below, it runs on the same maschine without RAM issues at all. This code works without any issues:
train_size = int(0.8 * dataset_size)
val_size = int(0.2 * dataset_size)
dataset_train = dataset.take(train_size).batch(1024).prefetch(tf.data.experimental.AUTOTUNE)
dataset_validation = dataset.skip(val_size).batch(1024).prefetch(tf.data.experimental.AUTOTUNE)
I'm assuming I made some mistake in the order that I used to set up my pipeline.
Addtional Info that might be helpful:
- In additon to the RAM problems I also encounter the following error message:
E tensorflow/core/framework/dataset.cc:825] Unimplemented: Cannot merge options for dataset of type LoadDataset, because the dataset does not implement InputDatasets.
- I'm training the model on Colab Pro (don't judge, I don't have any other options) with a GPU instance with the High Ram setting.
If there is any further information or code necessary to answer this question, let me know.