Tensorflow fail with "Unable to get element from the feed as bytes." when attempting to restore checkpoint

Asked 28/12, 2016 at 20:12 Answered 24/10, 2017 at 8:17

Solved python tensorflow google-cloud-ml

I am using Tensorflow r0.12.

I use google-cloud-ml locally to run 2 different training jobs. In the first job, I find good initial values for my variables. I store them in a V2-checkpoint.

When I try to restore my variables for using them in the second job :

import tensorflow as tf

sess = tf.Session()
new_saver = tf.train.import_meta_graph('../variables_pred/model.ckpt-10151.meta', clear_devices=True)
new_saver.restore(sess, tf.train.latest_checkpoint('../variables_pred/'))
all_vars = tf.trainable_variables()
for v in all_vars:
    print(v.name)

I got the following error message :

tensorflow.python.framework.errors_impl.InternalError: Unable to get element from the feed as bytes.

The checkpoint is created with these lines in the first job :

saver = tf.train.Saver()
saver.export_meta_graph(filename=os.path.join(output_dir, 'export.meta'))
saver.save(sess, os.path.join(output_dir, 'export'), write_meta_graph=False)

According to this answer, it could come from the absence of metadata file but I am loading the metadata file.

PS : I use the argument clear_devices=True because the device specifications generated by a launch on google-cloud-ml are quite intricated and I don't need to necessarily get the same dispatch.

Seizing answered 28/12, 2016 at 20:12 Comment(5)

Is you checkpoint stored as binary? You may need to specify the format when saving. I think the default is text. Also, bear in mind that there might be a big if you are not using a released version (e.g. the edge from the repository). – Dish 28/12, 2016 at 21:48

Where does the model you are trying to load reside? Does it reside on GCS? Its not clear to me whether this is a single training job or multiple training jobs. My interpretation of your question was that you run one Cloud ML training job to produce an initial checkpoint. So presumably you saved this to GCS to make it available after the job finishes. In your second job you are trying to read that checkpoint. You are using relative paths. Did you copy the model from GCS to the local filesystem? – Jackinthepulpit 29/12, 2016 at 0:30

@EricPlaton I am using tensorflow r0.12. Also, I save my checkpoint in the default manner which implies the argument as_text=False. – Seizing 29/12, 2016 at 11:10

@JeremyLewi Your interpretation is right, I have two different training jobs. Also, since my project is not completely ready to produce satisfying results, I am working on a mini dataset and locally. I will clarify my question accordingly to your suggestions, thanks. – Seizing 29/12, 2016 at 11:10

Added a potential answer. – Jackinthepulpit 29/12, 2016 at 15:34

The error message was due to the absence of the file named "checkpoint" by inadvertency.

After the reintroduction of this file in the appropriate folder, it appears that the loading of the checkpoint is working.

Sorry for having omitted this key point.

Seizing answered 4/1, 2017 at 14:36 Comment(0)

I think the problem could be that when you save the model you set write_meta_graph=False. As a result I don't think you are actually saving the graph so when you try to restore there is no graph to restore. Try setting write_meta_graph=True

Jackinthepulpit answered 29/12, 2016 at 15:34 Comment(1)

Turns out that I finally found the solution of my problem. I thought that only transferring the checkpoint files was sufficient but it appears that it was not. Thanks anyway ! – Seizing 4/1, 2017 at 14:36

-1

The error message was also due to the mistakes in the file named "checkpoint" by inadvertency.

For examples, the folder which contains the models has been moved, but the value of "model_checkpoint_path:" in "checkpoint" still is old path.

Nonlegal answered 24/10, 2017 at 8:17 Comment(0)

Recommended topics

Hot tags