tokenize - 3

4

Solved

How to use a Lucene Analyzer to tokenize a String?

Is there a simple way I could use any subclass of Lucene's Analyzer to parse/tokenize a String? Something like: String to_be_parsed = "car window seven"; Analyzer analyzer = new StandardAnalyzer(...

java lucene tokenize analyzer

Semela asked 13/6, 2011 at 18:38

4

Solved

How do I use NLTK's default tokenizer to get spans instead of strings?

NLTK's default tokenizer, nltk.word_tokenizer, chains two tokenizers, a sentence tokenizer and then a word tokenizer that operates on sentences. It does a pretty good job out of the box. >>&...

python nltk tokenize

Pavlodar asked 23/2, 2015 at 16:22

3

shlex alternative for Java

Is there a shlex alternative for Java? I'd like to be able to split quote delimited strings like the shell would process them. For example, if I'd send : one two "three four" and perform a split, I...

java bash shell tokenize

Lacustrine asked 4/7, 2009 at 20:52

0

How to slice string depending on length of tokens

When I use (with a long test_text and short question): from transformers import BertTokenizer import torch from transformers import BertForQuestionAnswering tokenizer = BertTokenizer.from_pretrain...

python python-3.x tokenize huggingface-transformers bert-language-model

Pecker asked 21/6, 2020 at 18:20

3

Solved

get indices of original text from nltk word_tokenize

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e. import nltk x = 'hello world' tokens = n...

python text nltk tokenize

Gismo asked 28/7, 2015 at 6:5

6

Solved

tokenizer.texts_to_sequences Keras Tokenizer gives almost all zeros

I am working to create a text classification code but I having problems in encoding documents using the tokenizer. 1) I started by fitting a tokenizer on my document as in here: vocabulary_size...

python keras nlp deep-learning tokenize

Bazaar asked 5/8, 2018 at 23:28

6

Solved

Java StringTokenizer.nextToken() skips over empty fields

I am using a tab (/t) as delimiter and I know there are some empty fields in my data e.g.: one->two->->three Where -> equals the tab. As you can see an empty field is still correctly su...

java string tokenize

Bobbiebobbin asked 10/7, 2012 at 8:22

1

Token indices sequence length error when using encode_plus method

I got a strange error when trying to encode question-answer pairs for BERT using the encode_plus method provided in the Transformers library. I am using data from this Kaggle competition. Given a q...

nlp tokenize huggingface-transformers bert-language-model

Creation asked 20/4, 2020 at 12:12

9

Solved

How can I split a string of a mathematical expressions in python?

I made a program which convert infix to postfix in python. The problem is when I introduce the arguments. If i introduce something like this: (this will be a string) ( ( 73 + ( ( 34 - 72 ) / ( 33 ...

python string python-3.x split tokenize

Apophasis asked 13/4, 2017 at 10:22

2

Solved

Tokenizer vs token filters

I'm trying to implement autocomplete using Elasticsearch thinking that I understand how to do it... I'm trying to build multi-word (phrase) suggestions by using ES's edge_n_grams while indexing cr...

elasticsearch token tokenize

Eleanor asked 11/5, 2016 at 16:47

3

Solved

Replacing all tokens based on properties file with ANT

I'm pretty sure this is a simple question to answer and ive seen it asked before just no solid answers. I have several properties files that are used for different environments, i.e xxxx-dev, xxxx...

ant tokenize

Sloop asked 22/12, 2010 at 10:16

4

Solved

Split string with PowerShell and do something with each token

I want to split each line of a pipe on spaces, and then print each token on its own line. I realise that I can get this result using: (cat someFileInsteadOfAPipe).split(" ") But I want more fle...

string powershell tokenize

Benitabenites asked 5/7, 2012 at 16:20

2

Tokenizing texts in both Chinese and English improperly splits English words into letters

When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code: from nltk.tokenize.stanford_se...

python-3.x nlp nltk stanford-nlp tokenize

Knit asked 29/8, 2017 at 13:59

3

Solved

Use spacy Spanish Tokenizer

I always used spacy library with english or german. To load the library I used this code: import spacy nlp = spacy.load('en') I would like to use the Spanish tokeniser, but I do not know how t...

python nlp tokenize spacy

Grogram asked 22/3, 2017 at 9:40

4

Solved

Getting rid of stop words and document tokenization using NLTK

I’m having difficulty eliminating and tokenizing a .text file using nltk. I keep getting the following AttributeError: 'list' object has no attribute 'lower'. I just can’t figure out what I’m doin...

python nltk tokenize stop-words

Bulgar asked 30/6, 2013 at 12:24

5

Solved

How could spacy tokenize hashtag as a whole?

In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens: import spacy nlp = spacy.load('en') doc = nlp(u'This is a #sentence.') [t for t in doc] out...

python tokenize spacy hashtag

Costermansville asked 13/4, 2017 at 9:28

3

Solved

processing before or after train test split

I am using this excellent article to learn Machine learning. https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/ The author has tokenized the X and y data after spli...

keras scikit-learn nlp tokenize train-test-split

Patton asked 28/8, 2019 at 13:15

1

Solved

Is it possible to change the token split rules for a Spacy tokenizer?

The (German) spacy tokenizer does not split on slashes, underscores, or asterisks by default, which is just what I need (so "der/die" results in a single token). However it does split on parenthes...

python regex token tokenize spacy

Basra asked 31/7, 2019 at 17:16

3

Solved

C++/Boost split a string on more than one character

This is probably really simple once I see an example, but how do I generalize boost::tokenizer or boost::split to deal with separators consisting of more than one character? For example, with "__"...

c++string parsing boost tokenize

Upstretched asked 3/4, 2013 at 13:58

1

Solved

Wordpiece tokenization versus conventional lemmatization?

I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do ...

nlp tokenize lemmatization

Clemmer asked 16/7, 2019 at 13:7

1

Solved

Split string every n characters but without splitting a word [duplicate]

Let's suppose that I have this in python: orig_string = 'I am a string in python' and if we suppose that I want to split this string every 10 characters but without splitting a word then I...

python python-3.x text split tokenize

Treatment asked 18/6, 2019 at 16:59

1

Solved

Does keras-tokenizer perform the task of lemmatization and stemming?

Does keras tokenizer provide the functions such as stemming and lemmetization? If it does, then how is it done? Need an intuitive understanding. Also, what does text_to_sequence do in that?

keras nlp tokenize stemming lemmatization

Flotage asked 12/6, 2019 at 7:33

3

Solved

Get bigrams and trigrams in word2vec Gensim

I am currently using uni-grams in my word2vec model as follows. def review_to_sentences( review, tokenizer, remove_stopwords=False ): #Returns a list of sentences, where each sentence is a list o...

python tokenize word2vec gensim n-gram

Putput asked 9/9, 2017 at 9:49

2

Solved

How to prevent splitting specific words or phrases and numbers in NLTK?

I have a problem in text matching when I tokenize text that splits specific words, dates and numbers. How can I prevent some phrases like "run in my family" ,"30 minute walk" or "4x a day" from spl...

python nltk tokenize phrase

Vidette asked 10/4, 2019 at 18:39

6

Solved

Tokenizing Twitter Posts in Lucene

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene? More detailed version: I want to index a number of tweets in Lucene and keep the terms like @user ...

twitter lucene tokenize

Anemometer asked 31/3, 2010 at 17:26

tokenize Questions

Recommended topics

Hot tags