Failed copying input tensor from CPU to GPU in order to run GatherVe: Dst tensor is not initialized. [Op:GatherV2]

Asked 15/7, 2020 at 14:11 Answered 19/5 at 10:10

    from random import sample
    index=sample(range(0, len(result)), len(result)//5*4)
    description_train=[child[0] for i, child in enumerate(result) if i in index]
    ipc_train=[child[1] for i, child in enumerate(result) if i in index]
    description_test=[child[0] for i, child in enumerate(result) if i not in index]
    ipc_test=[child[1] for i, child in enumerate(result) if i not in index]
    
    import numpy as np
    
    def to_onehot(li):
        result=np.zeros(8)
        if 'A' in li:
            result[0]=1
        if 'B' in li:
            result[1]=1
        if 'C' in li:
            result[2]=1
        if 'D' in li:
            result[3]=1
        if 'E' in li:
            result[4]=1
        if 'F' in li:
            result[5]=1
        if 'G' in li:
            result[6]=1
        if 'H' in li:
            result[7]=1
        return result
            
            
    
    from tensorflow.python.keras.preprocessing.text import Tokenizer
    
    
    max_words=100000
    num_classes=8
    
    t=Tokenizer(num_words=max_words)
    t.fit_on_texts(description_train)
    X_train=t.texts_to_matrix(description_train, mode='binary')
    X_test=t.texts_to_matrix(description_test, mode='binary')
    Y_train=np.array([to_onehot(child) for child in ipc_train], dtype=np.int32)
    Y_test=np.array([to_onehot(child) for child in ipc_test], dtype=np.int32)
    
    
    from tensorflow.python.keras.models import Sequential
    from tensorflow.python.keras.layers import Dense, Dropout
    
    
    model = Sequential()
    model.add(Dense(1024, input_shape=(max_words,), activation='relu'))
    model.add(Dense(128, activation='relu'))
    model.add(Dense(num_classes, activation='sigmoid'))
    
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.fit(X_train, Y_train, batch_size=128, epochs=5, validation_split=0.1)

the last line (model.fit) result in a following error.

InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run GatherV2: Dst tensor is not initialized. [Op:GatherV2]

How can I fix it? thank you in advance.

Cyclosis answered 15/7, 2020 at 14:11 Comment(0)

I had this error very often, even with high-RAM EC2 instances. The only solution for me was to use generators:

from tensorflow.keras.utils import Sequence
import numpy as np   

class DataGenerator(Sequence):
    def __init__(self, x_set, y_set, batch_size):
        self.x, self.y = x_set, y_set
        self.batch_size = batch_size

    def __len__(self):
        return int(np.ceil(len(self.x) / float(self.batch_size)))

    def __getitem__(self, idx):
        batch_x = self.x[idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.y[idx * self.batch_size:(idx + 1) * self.batch_size]
        return batch_x, batch_y

train_gen = DataGenerator(X_train, y_train, 32)
test_gen = DataGenerator(X_test, y_test, 32)


history = model.fit(train_gen,
                    epochs=6,
                    validation_data=test_gen)

In the above example, we assume that X and y are numpy arrays.

My guess on what's happening: even though I'm using a high-RAM instance, I suspect the problem is a limitation in the GPU memory, and even though I'm training in batches, when not using generators, TensorFlow is trying to load the full array into the GPU memory.

Elamitic answered 23/3, 2022 at 19:13 Comment(2)

Does this have an influence to model accuracy? If your answer is no, can you add a short description why this will not influence the model accuracy? – Hover 6/9, 2022 at 11:14

I don't see why it would influence the model's accuracy. We are forcing TF to read and send to GPU one batch each time. – Elamitic 10/10, 2022 at 15:54

it may be because of RAM shortage, and you can do one of the followings to solve the problem:

decreasing the batch_size can help most of the times and this is the best option when the training speed is not matters for you(as you know with decreasing the batch_size , it takes more long time for model to be trained).
you can run your code on a system with a large amount of RAM or run your code on a VM or Google Colab for free(Google Colab gives you 16 GB of RAM for free with Tesla K80 GPUs and TPU.
reduce the number of samples or reduce your data dimension with methods such as PCA or feature selection.
also if your model`s hidden layers size is so large, you can decrease it to solve problem in some situations.because with decreasing hidden layer size the model complexity and parameters decrease and it occupies less memory.

Animosity answered 9/1, 2022 at 14:15 Comment(0)

One memory-costly process is the internal conversion from (usually) numpy to tf.tensor. If you are sure that your GPU should be able to handle the batches: Manually convert the data to tf tensors using the CPU RAM, and only then pass it to your model (eventually using GPU).

with tf.device('/cpu:0'):
   x = tf.convert_to_tensor(x, np.float32)
   y = tf.convert_to_tensor(y, np.float32)

and then, outside of the with statement:

model.fit(x=x, y=y)

Valenevalenka answered 3/5, 2023 at 23:10 Comment(1)

This the best solution I found. Cause of this I can use tf.data.Dataset API for shuffle, batch and prefetch on my millions of rows of data and still train the model on GPU. – Methodical 7/6, 2023 at 5:55

One solution is to reduce the size of input images to fit the capacity of GPU. For me, I reduced from (224,224,10) to (128,128,10).

Laporte answered 24/2, 2022 at 17:10 Comment(0)

Reducing the sample size is not always the option because why would you have those many samples in the first place, therefore, I would recommend a few options:

Use a cloud VM (AWS, Azure or GCP) with higher specs and pay hourly and be done and dusted on this one
If you don't want to pay, and Ok to write extra code, then basically, you have to create your own custom generator to call flow_from_directory to load dataset in batches. Refer to this: https://www.askpython.com/python/examples/handling-large-datasets-machine-learning https://www.analyticsvidhya.com/blog/2020/08/image-augmentation-on-the-fly-using-keras-imagedatagenerator/

Citrange answered 28/10, 2021 at 5:13 Comment(0)

Specially logged in my StackOverflow account, as most of the answers seem BS.

Background: This issue occurs mainly when you try to copy a large dataset to the GPU for processing, particularly when the dataset's absolute size is larger than the GPU VRAM. Increasing RAM/Swap size doesn't not seem to help.

Solutions:

1. Generators

Use generators as suggested above. However, from my experience the drawback is that your epoch duration might become longer due to the batching operations (don't ask me why, it's what happens, but at least it doesn't crash your setup anymore).

or...

2. Use 'CPU' trick

Add with tf.device('/cpu:0') right before you start training your neural network:

with tf.device('/cpu:0'):
    trained_neural_network = model.fit( X_train, y_train)

Now, as counter-intuitive this might sound ("Why would you train your network on the CPU?") what appears to happen is TensorFlow actually does proceed to train the model on your GPU instantly. What's more, I have not noticed any performance degradation with this method too (if you use generators to batch your dataset actually it takes twice longer in my case, due to the batch splitting - my epochs are twice longer!).

Overall, by using both method 1 and 2 model training accuracy seems to be the same, with method 2 being faster than 1 (both works).

However trying to convert the datasets to tensors in advance never worked in my case (some people suggested that too).

BONUS: If you encounter crashes like me while doing later model.evaluate and model.predict with large datasets, adding with tf.device('/cpu:0'): again solves it. I can confirm the operation is again performed with the GPU regardless.

Perhaps TensorFlow just needs to fix their code (tested on TensorFlow 2.11)

Determine answered 19/5 at 10:10 Comment(0)

I found a solution. I reduced the number of sample by

model.fit(X_train[0:3000], Y_train[0:3000], batch_size=128, epochs=5, validation_split=0.1)

Then, the error disappeared.

Good luck for everyone.

Cyclosis answered 15/7, 2020 at 14:23 Comment(2)

But what about rest of the samples? you won't fit the model to it? – Desman 9/3, 2022 at 11:18

If he wishes to do it this way, he might create a custom train loop, where he only calls .fit for one epoch for this number of samples, then he calls it using the remaining samples, and repeats using a for loop. But yep, it is quite barbaric and using tf.Dataset or something similar is more elegant. – Girth 15/6, 2023 at 14:12

Might be silly answer, but I wasn't able to load 200MB of training data into RTX3090 (24GB RAM) and after fighting a bit with this problem the solution for me was:

sudo reboot now

afterwards everything started to work again.

Redfield answered 12/4, 2023 at 21:54 Comment(0)

1. Generators

2. Use 'CPU' trick

Recommended topics

Hot tags