Best way to process terabytes of data on gcloud ml-engine with keras - McMap

About

Best way to process terabytes of data on gcloud ml-engine with keras

Asked 4/2, 2019 at 20:24 Answered 4/2, 2019 at 20:28

Solved tensorflow keras google-cloud-ml tensorflow-datasets tfrecord

M

1

2

I want to train a model on about 2TB of image data on gcloud storage. I saved the image data as separate tfrecords and tried to use the tensorflow data api following this example

https://medium.com/@moritzkrger/speeding-up-keras-with-tfrecord-datasets-5464f9836c36

But it seems like keras' model.fit(...) doesn't support validation for tfrecord datasets based on

https://github.com/keras-team/keras/pull/8388

Is there a better approach for processing large amounts of data with keras from ml-engine that I'm missing?

Thanks a lot!

Mislay answered 4/2, 2019 at 20:24 Comment(0)

T

5

If you are willing to use tf.keras instead of actual Keras, you can instantiate a TFRecordDataset with the tf.data API and pass that directly to model.fit(). Bonus: you get to stream directly from Google Cloud storage, no need to download the data first:

# Construct a TFRecordDataset
ds_train tf.data.TFRecordDataset('gs://') # path to TFRecords on GCS
ds_train = ds_train.shuffle(1000).batch(32)

model.fit(ds_train)

To include validation data, create a TFRecordDataset with your validation TFRecords and pass that one to the validation_data argument of model.fit(). Note: this is possible as of TensorFlow 1.9.

Final note: you'll need to specify the steps_per_epoch argument. A hack that I use to know the total number of examples in all TFRecordfiles, is to simply iterate over the files and count:

import tensorflow as tf

def n_records(record_list):
    """Get the total number of records in a collection of TFRecords.
    Since a TFRecord file is intended to act as a stream of data,
    this needs to be done naively by iterating over the file and counting.
    See https://stackoverflow.com/questions/40472139

    Args:
        record_list (list): list of GCS paths to TFRecords files
    """
    counter = 0
    for f in record_list:
        counter +=\
            sum(1 for _ in tf.python_io.tf_record_iterator(f))
    return counter

Which you can use to compute steps_per_epoch:

n_train = n_records([gs://path-to-tfrecords/record1,
                     gs://path-to-tfrecords/record2])

steps_per_epoch = n_train // batch_size

Trident answered 4/2, 2019 at 20:28 Comment(6)

That sounds good, is there a way to include validation data in model.fit? – Mislay 4/2, 2019 at 20:28

Another question, I ran my model but there's a error with the input shape of my model. The dataset has an image with shape (?, 224,224,1) and one hot label (?,2) is an array so the shape of the dataset seems to be [(?, 224,224,1), (?,2)]. What do I put as the input shape of in my keras model? x = Input( (224,224,1) ) – Mislay 4/2, 2019 at 21:0

Oh, the message is: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 array(s), but instead got the following list of 2 arrays: [<tf.Tensor 'IteratorGetNext_6:0' shape=(?, 224, 224, 1) dtype=float32>, <tf.Tensor 'IteratorGetNext_6:1' shape=(?, 2) dtype=float32>]... – Mislay 4/2, 2019 at 21:17

Sorry, that happened because I was passing an iterator instead of a dataset. – Mislay 4/2, 2019 at 21:22

It worked! I had to upgrade my tensorflow version to 1.12. – Mislay 4/2, 2019 at 22:26

Yes, these integrations with tf.keras have come in the most recent tf versions. Happy to hear it works! – Trident 5/2, 2019 at 6:29

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2025 — McMap. All rights reserved.