tokenize Questions
17
Solved
How do I convert a comma separated string to a array?
I have the input '1,2,3' , and I need to convert it into an array.
37
Solved
Java has a convenient split method:
String str = "The quick brown fox";
String[] results = str.split(" ");
Is there an easy way to do this in C++?
2
With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", &...
Steiger asked 5/8, 2020 at 11:7
2
I am trying to do text classification using pretrained BERT model. I trained the model on my dataset, and in the phase of testing; I know that BERT can only take to 512 tokens, so I wrote if condit...
Gismo asked 12/10, 2020 at 15:34
12
Solved
I need to write a procedure to normalize a record that have multiple tokens concatenated by one char. I need to obtain these tokens splitting the string and insert each one as a new record in a tab...
1
Solved
what exactly is the difference between "token" and a "special token"?
I understand the following:
what is a typical token
what is a typical special token: MASK, UNK, SEP, etc
w...
Acuity asked 30/3, 2022 at 14:58
5
Solved
I want to split a char *string based on multiple-character delimiter. I know that strtok() is used to split a string but it works with single character delimiter.
I want to split char *string bas...
2
Solved
I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of tex...
Declination asked 27/10, 2017 at 18:12
7
I need a tokenizer that given a string with arbitrary white-space among words will create an array of words without empty sub-strings.
For example, given a string:
" I dont know what you mean by ...
Vagal asked 22/2, 2012 at 19:50
2
Solved
Is there a way to know the mapping from the tokens back to the original words in the tokenizer.decode() function?
For example:
from transformers.tokenization_roberta import RobertaTokenizer
token...
Idolla asked 11/6, 2020 at 5:33
2
Can somebody help to explain the basic concept behind the bpe model? Except this paper, there is no so many explanations about it yet.
What I have known so far is that it enables NMT model transl...
1
I was quite disappointed to discover that functions calls were not highlighted using Pygments.
See it online (I tested it with all available styles)
Builtin functions are highlighted but not mi...
Inspire asked 19/9, 2017 at 9:30
3
Solved
I am trying to read a csv file using pandas
df1 = pd.read_csv('panda_error.csv', header=None, sep=',')
But I am getting this error:
ParserError: Error tokenizing data. C error: Expected 7 fiel...
9
Solved
I'm trying to parse a sentence (or line of text) where you have a sentence and optionally followed some key/val pairs on the same line. Not only are the key/value pairs optional, they are dynamic. ...
Friedcake asked 22/7, 2013 at 18:50
3
I have a method that takes in a String parameter, and uses NLTK to break the String down to sentences, then into words. Afterwards, it converts each word into lowercase, and finally creates a dicti...
Clustered asked 28/1, 2017 at 16:26
4
Solved
>>> t = Tokenizer(num_words=3)
>>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"]
>>> t.fit_on_texts(l)
>>> t.word_i...
Reconstitute asked 13/9, 2017 at 16:24
3
I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any sol...
Arbour asked 29/5, 2020 at 9:37
2
Solved
I would like to use spacy's POS tagging, NER, and dependency parsing without using word tokenization. Indeed, my input is a list of tokens representing a sentence, and I would like to respect the u...
Grigson asked 9/1, 2018 at 13:43
2
I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word.
from transformers import BertTokenizer
tz = BertTokenizer.from_pretrained(&q...
Clermontferrand asked 16/2, 2021 at 22:14
4
Solved
The below code breaks the sentence into individual tokens and the output is as below
"cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies"
import en_core_web_sm
nlp = en_c...
Anderson asked 3/12, 2018 at 16:50
3
Solved
I would like to know if there is a method using boost::split to split a string using whole strings as a delimiter. For example:
str = "xxaxxxxabcxxxxbxxxcxxx"
is there a method to split this str...
2
Does it make sense to change the tokenization paradigm in the BERT model, to something else? Maybe just a simple word tokenization or character level tokenization?
Adopted asked 31/3, 2020 at 2:30
3
I am trying to tokenize and remove stop words from a txt file with Lucene. I have this:
public String removeStopWords(String string) throws IOException {
Set<String> stopWords = new HashSet...
Blackcap asked 12/7, 2013 at 23:17
14
I know this has been answered to some degree with PHP and MYSQL, but I was wondering if someone could teach me the simplest approach to splitting a string (comma delimited) into multiple rows in Or...
3
I was recently practicing bag of words introduction : kaggle , I want to clear few things :
using vectorizer.fit_transform( " * on the list of *cleaned* reviews* " )
Now when we were prep...
Emogeneemollient asked 1/8, 2016 at 6:46
© 2022 - 2024 — McMap. All rights reserved.