tokenize Questions
4
Solved
Is there a simple way I could use any subclass of Lucene's Analyzer to parse/tokenize a String?
Something like:
String to_be_parsed = "car window seven";
Analyzer analyzer = new StandardAnalyzer(...
4
Solved
NLTK's default tokenizer, nltk.word_tokenizer, chains two tokenizers, a sentence tokenizer and then a word tokenizer that operates on sentences. It does a pretty good job out of the box.
>>&...
3
Is there a shlex alternative for Java? I'd like to be able to split quote delimited strings like the shell would process them. For example, if I'd send : one two "three four" and perform a split, I...
0
When I use (with a long test_text and short question):
from transformers import BertTokenizer
import torch
from transformers import BertForQuestionAnswering
tokenizer = BertTokenizer.from_pretrain...
Pecker asked 21/6, 2020 at 18:20
3
Solved
I am tokenizing a text using nltk.word_tokenize and I would like to also get the index in the original raw text to the first character of every token, i.e.
import nltk
x = 'hello world'
tokens = n...
6
Solved
I am working to create a text classification code but I having problems in encoding documents using the tokenizer.
1) I started by fitting a tokenizer on my document as in here:
vocabulary_size...
Bazaar asked 5/8, 2018 at 23:28
6
Solved
I am using a tab (/t) as delimiter and I know there are some empty fields in my data e.g.:
one->two->->three
Where -> equals the tab. As you can see an empty field is still correctly su...
1
I got a strange error when trying to encode question-answer pairs for BERT using the encode_plus method provided in the Transformers library.
I am using data from this Kaggle competition. Given a q...
Creation asked 20/4, 2020 at 12:12
9
Solved
I made a program which convert infix to postfix in python. The problem is when I introduce the arguments.
If i introduce something like this: (this will be a string)
( ( 73 + ( ( 34 - 72 ) / ( 33 ...
Apophasis asked 13/4, 2017 at 10:22
2
Solved
I'm trying to implement autocomplete using Elasticsearch thinking that I understand how to do it...
I'm trying to build multi-word (phrase) suggestions by using ES's edge_n_grams while indexing cr...
Eleanor asked 11/5, 2016 at 16:47
3
Solved
I'm pretty sure this is a simple question to answer and ive seen it asked before just no solid answers.
I have several properties files that are used for different environments, i.e xxxx-dev, xxxx...
4
Solved
I want to split each line of a pipe on spaces, and then print each token on its own line.
I realise that I can get this result using:
(cat someFileInsteadOfAPipe).split(" ")
But I want more fle...
Benitabenites asked 5/7, 2012 at 16:20
2
When tokenizing texts that contain both Chinese and English, the result will split English words into letters, which is not what I want. Consider the following code:
from nltk.tokenize.stanford_se...
Knit asked 29/8, 2017 at 13:59
3
Solved
I always used spacy library with english or german.
To load the library I used this code:
import spacy
nlp = spacy.load('en')
I would like to use the Spanish tokeniser, but I do not know how t...
4
Solved
I’m having difficulty eliminating and tokenizing a .text file using nltk. I keep getting the following AttributeError: 'list' object has no attribute 'lower'.
I just can’t figure out what I’m doin...
Bulgar asked 30/6, 2013 at 12:24
5
Solved
In a sentence containing hashtags, such as a tweet, spacy's tokenizer splits hashtags into two tokens:
import spacy
nlp = spacy.load('en')
doc = nlp(u'This is a #sentence.')
[t for t in doc]
out...
3
Solved
I am using this excellent article to learn Machine learning.
https://stackabuse.com/python-for-nlp-multi-label-text-classification-with-keras/
The author has tokenized the X and y data after spli...
Patton asked 28/8, 2019 at 13:15
1
Solved
The (German) spacy tokenizer does not split on slashes, underscores, or asterisks by default, which is just what I need (so "der/die" results in a single token).
However it does split on parenthes...
3
Solved
This is probably really simple once I see an example, but how do I generalize boost::tokenizer or boost::split to deal with separators consisting of more than one character?
For example, with "__"...
1
Solved
I'm looking at NLP preprocessing. At some point I want to implement a context-sensitive word embedding, as a way of discerning word sense, and I was thinking about using the output from BERT to do ...
Clemmer asked 16/7, 2019 at 13:7
1
Solved
Let's suppose that I have this in python:
orig_string = 'I am a string in python'
and if we suppose that I want to split this string every 10 characters but without splitting a word then I...
Treatment asked 18/6, 2019 at 16:59
1
Solved
Does keras tokenizer provide the functions such as stemming and lemmetization? If it does, then how is it done? Need an intuitive understanding. Also, what does text_to_sequence do in that?
Flotage asked 12/6, 2019 at 7:33
3
Solved
I am currently using uni-grams in my word2vec model as follows.
def review_to_sentences( review, tokenizer, remove_stopwords=False ):
#Returns a list of sentences, where each sentence is a list o...
2
Solved
I have a problem in text matching when I tokenize text that splits specific words, dates and numbers. How can I prevent some phrases like "run in my family" ,"30 minute walk" or "4x a day" from spl...
6
Solved
My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?
More detailed version:
I want to index a number of tweets in Lucene and keep the terms like @user ...
© 2022 - 2024 — McMap. All rights reserved.