Pickled scipy sparse matrix as input data?
Asked Answered
B

1

2

I am working on a multiclass classification problem consisting in classifying resumes.

I used sklearn and its TfIdfVectorizer to get a big scipy sparse matrix that I feed in a Tensorflow model after pickling it. On my local machine, I load it, convert a small batch to dense numpy arrays and fill a feed dictionnary. Everything works great.

Now I would like to do the same thing on ML cloud. My pickle is stored at gs://my-bucket/path/to/pickle but when I run my trainer, the pickle file can't be found at this URI (IOError: [Errno 2] No such file or directory). I am using pickle.load(open('gs://my-bucket/path/to/pickle), 'rb') to extract my data. I suspect that this is not the good way to open a file on GCS but I'm totally new to Google Cloud and I can't find the proper way to do so.

Also, I read that one must use TFRecords or a CSV format for input data but I don't understand why my method could not work. CSV is excluded since the dense representation of the matrix would be too big to fit in memory. Can TFRecords encode efficiently sparse data like that? And is it possible to read data from a pickle file?

Baird answered 19/10, 2016 at 13:46 Comment(0)
G
5

You are correct that Python's "open" won't work with GCS out of the box. Given that you're using TensorFlow, you can use the file_io library instead, which will work both with local files as well as files on GCS.

from tensorflow.python.lib.io import file_io
pickle.loads(file_io.read_file_to_string('gs://my-bucket/path/to/pickle'))

NB: pickle.load(file_io.FileIO('gs://..', 'r')) does not appear to work.

You are welcome to use whatever data format works for you and are not limited to CSV or TFRecord (do you mind pointing to the place in the documentation that makes that claim?). If the data fits in memory, then your approach is sensible.

If the data doesn't fit in memory, you will likely want to use TensorFlow's reader framework, the most convenient of which tend to be CSV or TFRecords. TFRecord is simply a container of byte strings. Most commonly, it contains serialized tf.Example data which does support sparse data (it is essentially a map). See tf.parse_example for more information on parsing tf.Example data.

Grosmark answered 19/10, 2016 at 15:27 Comment(2)
Thanks for the detailed answer, trying your solution asap! I have investigated TFRecords but I'm not sure how to use it for sparse data. I understand that for dense arrays like mnist, you can encode the 784 dimension array into one feature for each example. I suppose that for sparse data, I need to encode each feature separately and set up a default value (0) when data is missing. am I right?Baird
There are various ways to encode sparse data. What are you trying to encode?Grosmark

© 2022 - 2024 — McMap. All rights reserved.