Google Storage (gs) wrapper file input/out for Cloud ML?
Asked Answered
M

3

16

Google recently announced the Clould ML, https://cloud.google.com/ml/ and it's very useful. However, one limitation is that the input/out of a Tensorflow program should support gs://.

If we use all tensorflow APIS to read/write files, it should OK, since these APIs support gs://.

However, if we use native file IO APIs such as open, it does not work, because they don't understand gs://

For example:

 with open(vocab_file, 'wb') as f:
        cPickle.dump(self.words, f)

This code won't work in Google Cloud ML.

However, modifying all native file IO APIs to tensorflow APIs or Google Storage Python APIs is really tedious. Is there any simple way to do this? Any wrappers to support google storage systems, gs:// on top of the native file IO?

As suggested here Pickled scipy sparse matrix as input data?, perhaps we can use file_io.read_file_to_string('gs://...'), but still this requrements significant code modifcation.

Monkhmer answered 3/11, 2016 at 8:8 Comment(0)
G
9

One solution is to copy all of the data to local disk when the program starts up. You can do that using gsutil inside the Python script that gets run, something like:

vocab_file = 'vocab.pickled'
subprocess.check_call(['gsutil', '-m' , 'cp', '-r',
                       os.path.join('gs://path/to/', vocab_file), '/tmp'])

with open(os.path.join('/tmp', vocab_file), 'wb') as f:
  cPickle.dump(self.words, f)

And if you have any outputs, you can write them to local disk and gsutil rsync them. (But, be careful to handle restarts correctly, because you may be put on a different machine).

The other solution is to monkey patch open (Note: untested):

import __builtin__

# NB: not all modes are compatible; should handle more carefully.
# Probably should be reported on
# https://github.com/tensorflow/tensorflow/issues/4357
def new_open(name, mode='r', buffering=-1):
  return file_io.FileIO(name, mode)

__builtin__.open = new_open

Just be sure to do that before any module actually tries to read from GCS.

Goldengoldenberg answered 3/11, 2016 at 8:42 Comment(0)
S
17

Do it like this:

from tensorflow.python.lib.io import file_io

with file_io.FileIO('gs://.....', mode='w+') as f:
    cPickle.dump(self.words, f)

Or you can read pickle file in like this:

file_stream = file_io.FileIO(train_file, mode='r')
x_train, y_train, x_test, y_test  = pickle.load(file_stream)
Stephainestephan answered 5/4, 2017 at 21:34 Comment(0)
G
9

One solution is to copy all of the data to local disk when the program starts up. You can do that using gsutil inside the Python script that gets run, something like:

vocab_file = 'vocab.pickled'
subprocess.check_call(['gsutil', '-m' , 'cp', '-r',
                       os.path.join('gs://path/to/', vocab_file), '/tmp'])

with open(os.path.join('/tmp', vocab_file), 'wb') as f:
  cPickle.dump(self.words, f)

And if you have any outputs, you can write them to local disk and gsutil rsync them. (But, be careful to handle restarts correctly, because you may be put on a different machine).

The other solution is to monkey patch open (Note: untested):

import __builtin__

# NB: not all modes are compatible; should handle more carefully.
# Probably should be reported on
# https://github.com/tensorflow/tensorflow/issues/4357
def new_open(name, mode='r', buffering=-1):
  return file_io.FileIO(name, mode)

__builtin__.open = new_open

Just be sure to do that before any module actually tries to read from GCS.

Goldengoldenberg answered 3/11, 2016 at 8:42 Comment(0)
M
2

apache_beam has the gcsio module which can be used to return a standard Python file object to read/write GCS objects. You can use this object with any method that works with Python file objects. For example

def open_local_or_gcs(path, mode):
  """Opens the given path."""
  if path.startswith('gs://'):
    try:
      return gcsio.GcsIO().open(path, mode)
    except Exception as e:  # pylint: disable=broad-except
      # Currently we retry exactly once, to work around flaky gcs calls.
      logging.error('Retrying after exception reading gcs file: %s', e)
      time.sleep(10)
      return gcsio.GcsIO().open(path, mode)
  else:
    return open(path, mode)

 with open_local_or_gcs(vocab_file, 'wb') as f:
   cPickle.dump(self.words, f)
Melonie answered 3/11, 2016 at 14:23 Comment(2)
Thanks! This looks very good. I think Tensorflow file_io might be a solution as well. with file_io.FileIO(file_path, mode="w") as f. Do you think it is also OK? I haven't fully tested yet.Monkhmer
I interpreted your question as wanting to avoid having to replace all open() function calls with specialized functions. If that's not the case, i.e., you're willing to replace calls to open(), then gcsio.open_local_or_gcs and file_io.FileIO are pretty similar, just affects what dependencies you bring in -- file_io already being part of TF. But FileIO uses some non-standard modes, so that might affect your decision as well.Goldengoldenberg

© 2022 - 2024 — McMap. All rights reserved.