tokenize Questions

17

Solved

How do I convert a comma separated string to a array? I have the input '1,2,3' , and I need to convert it into an array.
Preoccupy asked 29/9, 2010 at 6:49

37

Solved

Java has a convenient split method: String str = "The quick brown fox"; String[] results = str.split(" "); Is there an easy way to do this in C++?
Methylal asked 10/9, 2008 at 12:10

2

With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", &...

2

I am trying to do text classification using pretrained BERT model. I trained the model on my dataset, and in the phase of testing; I know that BERT can only take to 512 tokens, so I wrote if condit...
Gismo asked 12/10, 2020 at 15:34

12

Solved

I need to write a procedure to normalize a record that have multiple tokens concatenated by one char. I need to obtain these tokens splitting the string and insert each one as a new record in a tab...
Isley asked 14/9, 2010 at 15:55

1

Solved

what exactly is the difference between "token" and a "special token"? I understand the following: what is a typical token what is a typical special token: MASK, UNK, SEP, etc w...

5

Solved

I want to split a char *string based on multiple-character delimiter. I know that strtok() is used to split a string but it works with single character delimiter. I want to split char *string bas...
Savell asked 22/4, 2015 at 6:2

2

Solved

I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of tex...
Declination asked 27/10, 2017 at 18:12

7

I need a tokenizer that given a string with arbitrary white-space among words will create an array of words without empty sub-strings. For example, given a string: " I dont know what you mean by ...
Vagal asked 22/2, 2012 at 19:50

2

Solved

Is there a way to know the mapping from the tokens back to the original words in the tokenizer.decode() function? For example: from transformers.tokenization_roberta import RobertaTokenizer token...
Idolla asked 11/6, 2020 at 5:33

2

Can somebody help to explain the basic concept behind the bpe model? Except this paper, there is no so many explanations about it yet. What I have known so far is that it enables NMT model transl...
Lauritz asked 29/5, 2018 at 11:28

1

I was quite disappointed to discover that functions calls were not highlighted using Pygments. See it online (I tested it with all available styles) Builtin functions are highlighted but not mi...
Inspire asked 19/9, 2017 at 9:30

3

Solved

I am trying to read a csv file using pandas df1 = pd.read_csv('panda_error.csv', header=None, sep=',') But I am getting this error: ParserError: Error tokenizing data. C error: Expected 7 fiel...
Weevil asked 20/12, 2019 at 4:42

9

Solved

I'm trying to parse a sentence (or line of text) where you have a sentence and optionally followed some key/val pairs on the same line. Not only are the key/value pairs optional, they are dynamic. ...
Friedcake asked 22/7, 2013 at 18:50

3

I have a method that takes in a String parameter, and uses NLTK to break the String down to sentences, then into words. Afterwards, it converts each word into lowercase, and finally creates a dicti...
Clustered asked 28/1, 2017 at 16:26

4

Solved

>>> t = Tokenizer(num_words=3) >>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"] >>> t.fit_on_texts(l) >>> t.word_i...
Reconstitute asked 13/9, 2017 at 16:24

3

I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any sol...
Arbour asked 29/5, 2020 at 9:37

2

Solved

I would like to use spacy's POS tagging, NER, and dependency parsing without using word tokenization. Indeed, my input is a list of tokens representing a sentence, and I would like to respect the u...
Grigson asked 9/1, 2018 at 13:43

2

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word. from transformers import BertTokenizer tz = BertTokenizer.from_pretrained(&q...

4

Solved

The below code breaks the sentence into individual tokens and the output is as below "cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies" import en_core_web_sm nlp = en_c...
Anderson asked 3/12, 2018 at 16:50

3

Solved

I would like to know if there is a method using boost::split to split a string using whole strings as a delimiter. For example: str = "xxaxxxxabcxxxxbxxxcxxx" is there a method to split this str...
Abound asked 15/9, 2011 at 20:17

2

Does it make sense to change the tokenization paradigm in the BERT model, to something else? Maybe just a simple word tokenization or character level tokenization?
Adopted asked 31/3, 2020 at 2:30

3

I am trying to tokenize and remove stop words from a txt file with Lucene. I have this: public String removeStopWords(String string) throws IOException { Set<String> stopWords = new HashSet...
Blackcap asked 12/7, 2013 at 23:17

14

I know this has been answered to some degree with PHP and MYSQL, but I was wondering if someone could teach me the simplest approach to splitting a string (comma delimited) into multiple rows in Or...
Proconsul asked 14/1, 2013 at 23:20

3

I was recently practicing bag of words introduction : kaggle , I want to clear few things : using vectorizer.fit_transform( " * on the list of *cleaned* reviews* " ) Now when we were prep...
Emogeneemollient asked 1/8, 2016 at 6:46

© 2022 - 2024 — McMap. All rights reserved.