I am working on a multiclass classification problem consisting in classifying resumes.
I used sklearn and its TfIdfVectorizer to get a big scipy sparse matrix that I feed in a Tensorflow model after pickling it. On my local machine, I load it, convert a small batch to dense numpy arrays and fill a feed dictionnary. Everything works great.
Now I would like to do the same thing on ML cloud. My pickle is stored at gs://my-bucket/path/to/pickle
but when I run my trainer, the pickle file can't be found at this URI (IOError: [Errno 2] No such file or directory
). I am using pickle.load(open('gs://my-bucket/path/to/pickle), 'rb')
to extract my data. I suspect that this is not the good way to open a file on GCS but I'm totally new to Google Cloud and I can't find the proper way to do so.
Also, I read that one must use TFRecords or a CSV format for input data but I don't understand why my method could not work. CSV is excluded since the dense representation of the matrix would be too big to fit in memory. Can TFRecords encode efficiently sparse data like that? And is it possible to read data from a pickle file?