How to slice string depending on length of tokens
Asked Answered
P

0

6

When I use (with a long test_text and short question):

from transformers import BertTokenizer
import torch
from transformers import BertForQuestionAnswering

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

input_ids = tokenizer.encode(question, test_text)

print('Query has {:,} tokens.\n'.format(len(input_ids)))

sep_index = input_ids.index(tokenizer.sep_token_id)

num_seg_a = sep_index + 1

num_seg_b = len(input_ids) - num_seg_a

segment_ids = [0]*num_seg_a + [1]*num_seg_b

start_scores, end_scores = model(torch.tensor([input_ids]),
                                token_type_ids=torch.tensor([segment_ids]))

I get an error with the output

Token indices sequence length is longer than the specified maximum sequence length for this model (3 > 512). Running this sequence through the model will result in indexing errors

Query has 1,244 tokens.

How can I separate test_text into maximized length of chunks knowing that it won't exceed 512 tokens? And then ask the same question for each chunk of text, taking the best answer out of all of them, also going through the text twice with different slice points, in case the answer is cut during a slice.

Pecker answered 21/6, 2020 at 18:20 Comment(12)
Pre-process each input text as text = text[:512] if that helpsDogma
The length of the string does not linearly correspond to the number of tokens, the function that transforms the number is the tokenizer itself.Pecker
I see, have you tried max_length here?Dogma
I'm not sure what you mean but if you figure out something that works let me know.Pecker
Check the run_squad script. It basically does what you are looking for.Suburbicarian
@Suburbicarian do you know where in the script it does this? How to get that script to run would be another question.Pecker
Here. You can simply create a json file with your questions which has the same format as squad format and execute the script with it. What you basically want is a sliding window approach. I have posted a small example here.Suburbicarian
@Suburbicarian Ok, that links to squad.py rather than run_squad.py. Yes, sliding windows seems to be the label for what I want. I'm not clear on how to apply it here. Ultimately, I'm trying to find a script where I can put text of any length and and a question and get the best answer out of it.Pecker
The script you are looking for is the already mentioned run_squad.py. This script is calling the squad.py to prepare the data.Suburbicarian
@Suburbicarian Ok, maybe someone will want to put this in an answer applying it to the case here for pointsPecker
@Suburbicarian Is there a link on how to create a json file with such a format? I'm also not sure what you mean by squad format, the word 'format' is not used on that page. And in your example, is biobert.process_text(doc) to be replaced by tokenizer.encode(question, test_text) and biobert.eval_fwdprop_biobert(tokenized_text) by model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids])) or something like that?Pecker
Download the training/valdation json file from squad and open it with an editor of your choice. It is self-explanatory.Suburbicarian

© 2022 - 2024 — McMap. All rights reserved.