Best way to process terabytes of data on gcloud ml-engine with keras
Asked Answered
M

1

2

I want to train a model on about 2TB of image data on gcloud storage. I saved the image data as separate tfrecords and tried to use the tensorflow data api following this example

https://medium.com/@moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36

But it seems like keras' model.fit(...) doesn't support validation for tfrecord datasets based on

https://github.com/keras-team/keras/pull/8388

Is there a better approach for processing large amounts of data with keras from ml-engine that I'm missing?

Thanks a lot!

Mislay answered 4/2, 2019 at 20:24 Comment(0)
T
5

If you are willing to use tf.keras instead of actual Keras, you can instantiate a TFRecordDataset with the tf.data API and pass that directly to model.fit(). Bonus: you get to stream directly from Google Cloud storage, no need to download the data first:

# Construct a TFRecordDataset
ds_train tf.data.TFRecordDataset('gs://') # path to TFRecords on GCS
ds_train = ds_train.shuffle(1000).batch(32)

model.fit(ds_train)

To include validation data, create a TFRecordDataset with your validation TFRecords and pass that one to the validation_data argument of model.fit(). Note: this is possible as of TensorFlow 1.9.

Final note: you'll need to specify the steps_per_epoch argument. A hack that I use to know the total number of examples in all TFRecordfiles, is to simply iterate over the files and count:

import tensorflow as tf

def n_records(record_list):
    """Get the total number of records in a collection of TFRecords.
    Since a TFRecord file is intended to act as a stream of data,
    this needs to be done naively by iterating over the file and counting.
    See https://stackoverflow.com/questions/40472139

    Args:
        record_list (list): list of GCS paths to TFRecords files
    """
    counter = 0
    for f in record_list:
        counter +=\
            sum(1 for _ in tf.python_io.tf_record_iterator(f))
    return counter 

Which you can use to compute steps_per_epoch:

n_train = n_records([gs://path-to-tfrecords/record1,
                     gs://path-to-tfrecords/record2])

steps_per_epoch = n_train // batch_size
Trident answered 4/2, 2019 at 20:28 Comment(6)
That sounds good, is there a way to include validation data in model.fit?Mislay
Another question, I ran my model but there's a error with the input shape of my model. The dataset has an image with shape (?, 224,224,1) and one hot label (?,2) is an array so the shape of the dataset seems to be [(?, 224,224,1), (?,2)]. What do I put as the input shape of in my keras model? x = Input( (224,224,1) )Mislay
Oh, the message is: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 array(s), but instead got the following list of 2 arrays: [<tf.Tensor 'IteratorGetNext_6:0' shape=(?, 224, 224, 1) dtype=float32>, <tf.Tensor 'IteratorGetNext_6:1' shape=(?, 2) dtype=float32>]...Mislay
Sorry, that happened because I was passing an iterator instead of a dataset.Mislay
It worked! I had to upgrade my tensorflow version to 1.12.Mislay
Yes, these integrations with tf.keras have come in the most recent tf versions. Happy to hear it works!Trident

© 2022 - 2025 — McMap. All rights reserved.