How to slice string depending on length of tokens - McMap

About

How to slice string depending on length of tokens

Asked 21/6, 2020 at 18:20 Answered 21/6, 2020 at 18:20

python python-3.x tokenize huggingface-transformers bert-language-model

P

0

6

When I use (with a long test_text and short question):

from transformers import BertTokenizer
import torch
from transformers import BertForQuestionAnswering

tokenizer = BertTokenizer.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')

input_ids = tokenizer.encode(question, test_text)

print('Query has {:,} tokens.\n'.format(len(input_ids)))

sep_index = input_ids.index(tokenizer.sep_token_id)

num_seg_a = sep_index + 1

num_seg_b = len(input_ids) - num_seg_a

segment_ids = [0]*num_seg_a + [1]*num_seg_b

start_scores, end_scores = model(torch.tensor([input_ids]),
                                token_type_ids=torch.tensor([segment_ids]))

I get an error with the output

Token indices sequence length is longer than the specified maximum sequence length for this model (3 > 512). Running this sequence through the model will result in indexing errors

Query has 1,244 tokens.

How can I separate test_text into maximized length of chunks knowing that it won't exceed 512 tokens? And then ask the same question for each chunk of text, taking the best answer out of all of them, also going through the text twice with different slice points, in case the answer is cut during a slice.

Pecker answered 21/6, 2020 at 18:20 Comment(12)

Pre-process each input text as text = text[:512] if that helps – Dogma 21/6, 2020 at 21:59

The length of the string does not linearly correspond to the number of tokens, the function that transforms the number is the tokenizer itself. – Pecker 21/6, 2020 at 22:6

I see, have you tried max_length here? – Dogma 21/6, 2020 at 22:10

I'm not sure what you mean but if you figure out something that works let me know. – Pecker 21/6, 2020 at 22:21

Check the run_squad script. It basically does what you are looking for. – Suburbicarian 22/6, 2020 at 3:57

@Suburbicarian do you know where in the script it does this? How to get that script to run would be another question. – Pecker 27/6, 2020 at 22:9

Here. You can simply create a json file with your questions which has the same format as squad format and execute the script with it. What you basically want is a sliding window approach. I have posted a small example here. – Suburbicarian 27/6, 2020 at 22:32

@Suburbicarian Ok, that links to squad.py rather than run_squad.py. Yes, sliding windows seems to be the label for what I want. I'm not clear on how to apply it here. Ultimately, I'm trying to find a script where I can put text of any length and and a question and get the best answer out of it. – Pecker 28/6, 2020 at 16:1

The script you are looking for is the already mentioned run_squad.py. This script is calling the squad.py to prepare the data. – Suburbicarian 29/6, 2020 at 5:56

@Suburbicarian Ok, maybe someone will want to put this in an answer applying it to the case here for points – Pecker 30/6, 2020 at 5:47

@Suburbicarian Is there a link on how to create a json file with such a format? I'm also not sure what you mean by squad format, the word 'format' is not used on that page. And in your example, is biobert.process_text(doc) to be replaced by tokenizer.encode(question, test_text) and biobert.eval_fwdprop_biobert(tokenized_text) by model(torch.tensor([input_ids]), token_type_ids=torch.tensor([segment_ids])) or something like that? – Pecker 30/6, 2020 at 6:11

Download the training/valdation json file from squad and open it with an editor of your choice. It is self-explanatory. – Suburbicarian 30/6, 2020 at 19:53

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.