Pytorch error "RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows"

T

2

3

I have sentences that I vectorize using sentence_vector() method of BiobertEmbedding python module (https://pypi.org/project/biobert-embedding/). For some group of sentences I have no problem but for some others I have the following error message :

File "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py", line 133, in sentence_vector encoded_layers = self.eval_fwdprop_biobert(tokenized_text) File "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py", line 82, in eval_fwdprop_biobert encoded_layers, _ = self.model(tokens_tensor, segments_tensors) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 730, in forward embedding_output = self.embeddings(input_ids, token_type_ids) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py", line 268, in forward position_embeddings = self.position_embeddings(position_ids) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 114, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/functional.py", line 1467, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237

I discovered that for some group of sentences, the problem was related to tags like <tb> for instance. But for others, even when tags are removed, the error message is still there.
(Unfortunately I can't share the code for confidentiality reasons)

Do you have any ideas of what could be the problem?

Thank you by advance

EDIT : you are right cronoik, it will be better with an example.

Example :

sentences = ["This is the first sentence.", "This is the second sentence.", "This is the third sentence."

biobert = BiobertEmbedding(model_path='./biobert_v1.1_pubmed_pytorch_model')

vectors = [biobert.sentence_vector(doc) for doc in sentences]

This last line of code is what caused the error message in my opinion.

Terrilyn answered 26/6, 2020 at 15:36 Comment(1)

Please give us a minimal reproducible example which allows us to reproduce the error. – Ferryboat 26/6, 2020 at 15:38

F

3

The problem is that the biobert-embedding module isn't taking care of the of the maximum sequence length of 512 (tokens not words!). This is the relevant source code. Have a look at the example below to force the error you received:

from biobert_embedding.embedding import BiobertEmbedding
#sentence has 385 words
sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'

biobert = BiobertEmbedding()
print('sentence has {} tokens'.format(len(biobert.process_text(sentence))))
#works
biobert.sentence_vector(sentence)
print('longersentence has {} tokens'.format(len(biobert.process_text(longersentence))))
#didn't work
biobert.sentence_vector(longersentence)

Output:

sentence has 512 tokens
longersentence has 513 tokens
#your error message....

What you should do is to implement a sliding window approach to process these texts:

import torch
from biobert_embedding.embedding import BiobertEmbedding

maxtokens = 512
startOffset = 0
docStride = 200

sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'

sentences = [sentence, longersentence, 'small test sentence']
vectors = []
biobert = BiobertEmbedding()

#https://github.com/Overfitter/biobert_embedding/blob/b114e3456de76085a6cf881ff2de48ce868e6f4b/biobert_embedding/embedding.py#L127
def sentence_vector(tokenized_text, biobert):
    encoded_layers = biobert.eval_fwdprop_biobert(tokenized_text)

    # `encoded_layers` has shape [12 x 1 x 22 x 768]
    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = encoded_layers[11][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)
    return sentence_embedding


for doc in sentences:
    #tokenize your text
    docTokens = biobert.process_text(doc)
    
    while startOffset < len(docTokens):
        print(startOffset)
        length = min(len(docTokens) - startOffset, maxtokens)

        #now we calculate the sentence_vector for the document slice
        vectors.append(sentence_vector(
                        docTokens[startOffset:startOffset+length]
                        , biobert)
                      )
        #stop when the whole document is processed (document has less than 512
        #or the last document slice was processed)
        if startOffset + length == len(docTokens):
            break
        startOffset += min(length, docStride)
    startOffset = 0

P.S.: Your partial success with removing <tb> was possible because removing <tb> will remove 4 tokens ('<', 't', '##b', '>').

Ferryboat answered 27/6, 2020 at 22:18 Comment(6)

Thank you very much, this is very helpful. If I understand properly the code you've posted, it vectorizes the first part of the too long sentence and then the second part, so we have at the end two tensors of dimension 768 for the too long sentence ? Tell me if I'm wrong. I ask because it could be problematic for my use case. Anyway thanks a lot. – Terrilyn 1/7, 2020 at 14:17

Yes that is correct, but the sentence isn't just split in half. It will always keep some tokens (200 in the example above) of the previous part to provide context to the new part . In case this is problematic, you might want to use Longformer which can process 4096 tokens instead of 512. – Ferryboat 1/7, 2020 at 14:25

The use of Longformer could be interesting. Could you give me an axample of how to use it in the above example? – Terrilyn 1/7, 2020 at 15:7

Because when I look at the Longformer documention, I don't really see how I could use it. – Terrilyn 1/7, 2020 at 15:40

Sure, but please open a new question for that. SO is build to collect good questions and answer which are helpful not just for yourself but also for others. Mixing Longformer and Biobert is not helpful for others in my opinion. Please keep in mind that a good question contains your use case, example data, expected output and research effort. – Ferryboat 1/7, 2020 at 18:10

I'm sure it could be helpful to other people . If I have this problem, others certainly have the same or will. I've opened a new SO question here hoping for help. Thanks. – Terrilyn 2/7, 2020 at 8:53

C

1

Since the original BERT has 512 (0 - 511) size Positional Encoding and bioBERT derives from BERT, it is no surprise to get an index error for 512. However, it is a little strange that you're able to access 512 for some sentences like you mentioned.

Comber answered 26/6, 2020 at 16:22 Comment(0)

Recommended topics

Hot tags