tokenize Questions

4

Solved

Is there a simple way I could use any subclass of Lucene's Analyzer to parse/tokenize a String? Something like: String to_be_parsed = "car window seven"; Analyzer analyzer = new StandardAnalyzer(...
Semela asked 13/6, 2011 at 18:38

4

Solved

NLTK's default tokenizer, nltk.word_tokenizer, chains two tokenizers, a sentence tokenizer and then a word tokenizer that operates on sentences. It does a pretty good job out of the box. >>&...
Pavlodar asked 23/2, 2015 at 16:22

3

Is there a shlex alternative for Java? I'd like to be able to split quote delimited strings like the shell would process them. For example, if I'd send : one two "three four" and perform a split, I...
Lacustrine asked 4/7, 2009 at 20:52

0

When I use (with a long test_text and short question): from transformers import BertTokenizer import torch from transformers import BertForQuestionAnswering tokenizer = BertTokenizer.from_pretrain...

3

Solved

I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e. import nltk x = 'hello world' tokens = n...
Gismo asked 28/7, 2015 at 6:5

6

Solved

I am working to create a text classification code but I having problems in encoding documents using the tokenizer. 1) I started by fitting a tokenizer on my document as in here: vocabulary_size...
Bazaar asked 5/8, 2018 at 23:28

6

Solved

I am using a tab (/t) as delimiter and I know there are some empty fields in my data e.g.: one->two->->three Where -> equals the tab. As you can see an empty field is still correctly su...
Bobbiebobbin asked 10/7, 2012 at 8:22

1

I got a strange error when trying to encode question-answer pairs for BERT using the encode_plus method provided in the Transformers library. I am using data from this Kaggle competition. Given a q...
Creation asked 20/4, 2020 at 12:12

9

Solved

I made a program which convert infix to postfix in python. The problem is when I introduce the arguments. If i introduce something like this: (this will be a string) ( ( 73 + ( ( 34 - 72 ) / ( 33 ...
Apophasis asked 13/4, 2017 at 10:22

2

Solved

I'm trying to implement autocomplete using Elasticsearch thinking that I understand how to do it... I'm trying to build multi-word (phrase) suggestions by using ES's edge_n_grams while indexing cr...
Eleanor asked 11/5, 2016 at 16:47

3

Solved

I'm pretty sure this is a simple question to answer and ive seen it asked before just no solid answers. I have several properties files that are used for different environments, i.e xxxx-dev, xxxx...
Sloop asked 22/12, 2010 at 10:16

4

Solved

I want to split each line of a pipe on spaces, and then print each token on its own line. I realise that I can get this result using: (cat someFileInsteadOfAPipe).split(" ") But I want more fle...
Benitabenites asked 5/7, 2012 at 16:20

2

When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code: from nltk.tokenize.stanford_se...
Knit asked 29/8, 2017 at 13:59

3

Solved

I always used spacy library with english or german. To load the library I used this code: import spacy nlp = spacy.load('en') I would like to use the Spanish tokeniser, but I do not know how t...
Grogram asked 22/3, 2017 at 9:40

4

Solved

I’m having difficulty eliminating and tokenizing a .text file using nltk. I keep getting the following AttributeError: 'list' object has no attribute 'lower'. I just can’t figure out what I’m doin...
Bulgar asked 30/6, 2013 at 12:24

5

Solved

In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens: import spacy nlp = spacy.load('en') doc = nlp(u'This is a #sentence.') [t for t in doc] out...
Costermansville asked 13/4, 2017 at 9:28

3

Solved

I am using this excellent article to learn Machine learning. https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/ The author has tokenized the X and y data after spli...
Patton asked 28/8, 2019 at 13:15

1

Solved

The (German) spacy tokenizer does not split on slashes, underscores, or asterisks by default, which is just what I need (so "der/die" results in a single token). However it does split on parenthes...
Basra asked 31/7, 2019 at 17:16

3

Solved

This is probably really simple once I see an example, but how do I generalize boost::tokenizer or boost::split to deal with separators consisting of more than one character? For example, with "__"...
Upstretched asked 3/4, 2013 at 13:58

1

Solved

I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do ...
Clemmer asked 16/7, 2019 at 13:7

1

Solved

Let's suppose that I have this in python: orig_string = 'I am a string in python' and if we suppose that I want to split this string every 10 characters but without splitting a word then I...
Treatment asked 18/6, 2019 at 16:59

1

Solved

Does keras tokenizer provide the functions such as stemming and lemmetization? If it does, then how is it done? Need an intuitive understanding. Also, what does text_to_sequence do in that?
Flotage asked 12/6, 2019 at 7:33

3

Solved

I am currently using uni-grams in my word2vec model as follows. def review_to_sentences( review, tokenizer, remove_stopwords=False ): #Returns a list of sentences, where each sentence is a list o...
Putput asked 9/9, 2017 at 9:49

2

Solved

I have a problem in text matching when I tokenize text that splits specific words, dates and numbers. How can I prevent some phrases like "run in my family" ,"30 minute walk" or "4x a day" from spl...
Vidette asked 10/4, 2019 at 18:39

6

Solved

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene? More detailed version: I want to index a number of tweets in Lucene and keep the terms like @user ...
Anemometer asked 31/3, 2010 at 17:26

© 2022 - 2024 — McMap. All rights reserved.