tokenize - 2

17

Solved

Convert comma separated string to array in PL/SQL

How do I convert a comma separated string to a array? I have the input '1,2,3' , and I need to convert it into an array.

oracle plsql tokenize

Preoccupy asked 29/9, 2010 at 6:49

37

Solved

How do I tokenize a string in C++?

Java has a convenient split method: String str = "The quick brown fox"; String[] results = str.split(" "); Is there an easy way to do this in C++?

c++string split tokenize

Methylal asked 10/9, 2008 at 12:10

2

BPE multiple ways to encode a word

With BPE or WordPiece there might be multiple ways to encode a word. For instance, assume (for simplicity) the token vocabulary contains all letters as well as the merged symbols ("to", &...

merge nlp tokenize bert-language-model huggingface-tokenizers

Steiger asked 5/8, 2020 at 11:7

2

The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1

I am trying to do text classification using pretrained BERT model. I trained the model on my dataset, and in the phase of testing; I know that BERT can only take to 512 tokens, so I wrote if condit...

python tensorflow pytorch tokenize bert-language-model

Gismo asked 12/10, 2020 at 15:34

12

Solved

Is there a function to split a string in Oracle PL/SQL?

I need to write a procedure to normalize a record that have multiple tokens concatenated by one char. I need to obtain these tokens splitting the string and insert each one as a new record in a tab...

string oracle plsql split tokenize

Isley asked 14/9, 2010 at 15:55

1

Solved

what is so special about special tokens?

what exactly is the difference between "token" and a "special token"? I understand the following: what is a typical token what is a typical special token: MASK, UNK, SEP, etc w...

nlp tokenize huggingface-transformers bert-language-model huggingface-tokenizers

Acuity asked 30/3, 2022 at 14:58

5

Solved

split char string with multi-character delimiter in C

I want to split a char *string based on multiple-character delimiter. I know that strtok() is used to split a string but it works with single character delimiter. I want to split char *string bas...

c string parsing tokenize delimiter

Savell asked 22/4, 2015 at 6:2

2

Solved

Tokenizing using Pandas and spaCy

I'm working on my first Python project and have reasonably large dataset (10's of thousands of rows). I need to do some nlp (clustering, classification) on 5 text columns (multiple sentences of tex...

python python-3.x pandas tokenize spacy

Declination asked 27/10, 2017 at 18:12

7

Split a string using whitespace in Javascript?

I need a tokenizer that given a string with arbitrary white-space among words will create an array of words without empty sub-strings. For example, given a string: " I dont know what you mean by ...

javascript tokenize

Vagal asked 22/2, 2012 at 19:50

2

Solved

Tokens to Words mapping in the tokenizer decode step huggingface?

Is there a way to know the mapping from the tokens back to the original words in the tokenizer.decode() function? For example: from transformers.tokenization_roberta import RobertaTokenizer token...

pytorch tokenize huggingface-transformers

Idolla asked 11/6, 2020 at 5:33

2

Explain bpe (Byte Pair Encoding) with examples?

Can somebody help to explain the basic concept behind the bpe model? Except this paper, there is no so many explanations about it yet. What I have known so far is that it enables NMT model transl...

algorithm nlp tokenize

Lauritz asked 29/5, 2018 at 11:28

1

Is there a way to highlight function calls using Pygments (or another library)?

I was quite disappointed to discover that functions calls were not highlighted using Pygments. See it online (I tested it with all available styles) Builtin functions are highlighted but not mi...

python syntax-highlighting tokenize lexer pygments

Inspire asked 19/9, 2017 at 9:30

3

Solved

ParserError: Error tokenizing data. C error: Expected 7 fields in line 4, saw 10 error reading csv file

I am trying to read a csv file using pandas df1 = pd.read_csv('panda_error.csv', header=None, sep=',') But I am getting this error: ParserError: Error tokenizing data. C error: Expected 7 fiel...

python pandas dataframe tokenize

Weevil asked 20/12, 2019 at 4:42

9

Solved

Python tokenize sentence with optional key/val pairs

I'm trying to parse a sentence (or line of text) where you have a sentence and optionally followed some key/val pairs on the same line. Not only are the key/value pairs optional, they are dynamic. ...

python regex tokenize text-parsing

Friedcake asked 22/7, 2013 at 18:50

3

NLTK tokenize - faster way?

I have a method that takes in a String parameter, and uses NLTK to break the String down to sentences, then into words. Afterwards, it converts each word into lowercase, and finally creates a dicti...

python time-complexity nltk tokenize frequency

Clustered asked 28/1, 2017 at 16:26

4

Solved

Keras Tokenizer num_words doesn't seem to work

>>> t = Tokenizer(num_words=3) >>> l = ["Hello, World! This is so&#$ fantastic!", "There is no other world like this one"] >>> t.fit_on_texts(l) >>> t.word_i...

machine-learning neural-network keras deep-learning tokenize

Reconstitute asked 13/9, 2017 at 16:24

3

How to stop BERT from breaking apart specific words into word-piece

I am using a pre-trained BERT model to tokenize a text into meaningful tokens. However, the text has many specific words and I don't want BERT model to break them into word-pieces. Is there any sol...

python text nlp tokenize bert-language-model

Arbour asked 29/5, 2020 at 9:37

2

Solved

Does spacy take as input a list of tokens?

I would like to use spacy's POS tagging, NER, and dependency parsing without using word tokenization. Indeed, my input is a list of tokens representing a sentence, and I would like to respect the u...

python-2.7 tokenize spacy dependency-parsing

Grigson asked 9/1, 2018 at 13:43

2

How to untokenize BERT tokens?

I have a sentence and I need to return the text corresponding to N BERT tokens to the left and right of a specific word. from transformers import BertTokenizer tz = BertTokenizer.from_pretrained(&q...

python tokenize bert-language-model huggingface-transformers huggingface-tokenizers

Clermontferrand asked 16/2, 2021 at 22:14

4

Solved

Is there a bi gram or tri gram feature in Spacy?

The below code breaks the sentence into individual tokens and the output is as below "cloud" "computing" "is" "benefiting" " major" "manufacturing" "companies" import en_core_web_sm nlp = en_c...

python-3.x nlp tokenize spacy n-gram

Anderson asked 3/12, 2018 at 16:50

3

Solved

Boost::Split using whole string as delimiter

I would like to know if there is a method using boost::split to split a string using whole strings as a delimiter. For example: str = "xxaxxxxabcxxxxbxxxcxxx" is there a method to split this str...

c++string boost tokenize

Abound asked 15/9, 2011 at 20:17

2

BERT training with character embeddings

Does it make sense to change the tokenization paradigm in the BERT model, to something else? Maybe just a simple word tokenization or character level tokenization?

nlp pytorch tokenize transformer-model

Adopted asked 31/3, 2020 at 2:30

3

Tokenize, remove stop words using Lucene with Java

I am trying to tokenize and remove stop words from a txt file with Lucene. I have this: public String removeStopWords(String string) throws IOException { Set<String> stopWords = new HashSet...

java lucene nlp tokenize stop-words

Blackcap asked 12/7, 2013 at 23:17

14

Splitting string into multiple rows in Oracle

I know this has been answered to some degree with PHP and MYSQL, but I was wondering if someone could teach me the simplest approach to splitting a string (comma delimited) into multiple rows in Or...

sql string oracle plsql tokenize

Proconsul asked 14/1, 2013 at 23:20

3

What is the difference between fit_transform and transform in sklearn countvectorizer?

I was recently practicing bag of words introduction : kaggle , I want to clear few things : using vectorizer.fit_transform( " * on the list of *cleaned* reviews* " ) Now when we were prep...

python scikit-learn tokenize text-processing

Emogeneemollient asked 1/8, 2016 at 6:46

tokenize Questions

Recommended topics

Hot tags