tokenize Questions
4
Solved
I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method that will return the individual ngrams that were tokenized. In fact I only see ...
2
Solved
Tokenize() in nltk.TweetTokenizer returning the 32-bit integers by dividing them into digits. It is only happening to some certain numbers, and I don't see any reason why?
>>> from nltk.t...
3
I have to split Chinese text into multiple sentences. I tried the Stanford DocumentPreProcessor. It worked quite well for English but not for Chinese.
Please can you let me know any good sen...
Parulis asked 12/12, 2014 at 10:4
2
I have a value like this:
Supoose I have a string:
s = "server ('m1.labs.teradata.com') username ('u\'se)r_*5') password('uer 5') dbname ('default')";
I need to extract
token1 : 'm1.labs.ter...
0
I have the following string :
s = "server ('m1.labs.terada')ta.com') username ('user5') password('use r5') dbname ('default')";
I have defined a regex for extracting the values between the paran...
1
Solved
I followed the tutorial here: (https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)
However, I modified the code to be able to save the generated model through h5py. Thus...
Dunkle asked 26/6, 2017 at 13:31
1
Solved
I have 2 sentences in my dataset:
w1 = I am Pusheen the cat.I am so cute. # no space after period
w2 = I am Pusheen the cat. I am so cute. # with space after period
When I use NKTL tokenizer (bo...
Ozzy asked 1/7, 2017 at 8:4
6
Solved
I have this lines of text the number of quotes could change like:
Here just one "comillas"
But I also could have more "mas" values in "comillas" and that "is" the "trick"
I was thinking in a meth...
1
Solved
This is the Code that I am using for semantic analysis of twitter:-
import pandas as pd
import datetime
import numpy as np
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import...
2
Solved
I have the following text:
I don't like to eat Cici's food (it is true)
I need to tokenize it to
['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(', 'it', 'is', 'true', ')']
I have fo...
1
Solved
I am trying to make 2 document-term matrices for a corpus, one with unigrams and one with bigrams. However, the bigram matrix is currently just identical to the unigram matrix, and I'm not sure why...
2
Solved
I've got the following code:
std::string str = "abc def,ghi";
std::stringstream ss(str);
string token;
while (ss >> token)
{
printf("%s\n", token.c_str());
}
The...
Undercroft asked 30/7, 2012 at 10:21
4
I'm learning how to write tokenizers, parsers and as an exercise I'm writing a calculator in JavaScript.
I'm using a prase tree approach (I hope I got this term right) to build my calculator. I'm ...
Press asked 1/7, 2014 at 23:45
4
Solved
I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a pr...
2
I’m trying to find a way to precisely determine the line number and character position of both tags and attributes whilst parsing an XML document. I want to do this so that I can report accurately ...
5
Solved
For argument's sake lets assume a HTML parser.
I've read that it tokenizes everything first, and then parses it.
What does tokenize mean?
Does the parser read every character each, building up a...
Lizethlizette asked 30/6, 2010 at 14:36
11
Solved
I'm doing a faster tests for a naive boolean information retrival system, and I would like use awk, grep, egrep, sed or thing similiar and pipes for split a text file into words and save them into ...
Microfiche asked 19/3, 2013 at 14:3
4
Solved
I need to split a text using the separator ". ". For example I want this string :
Washington is the U.S Capital. Barack is living there.
To be cut into two parts:
Washington is the U.S Capital....
Pardon asked 4/6, 2010 at 7:23
1
Solved
How can I prevent spacy's tokenizer from splitting a specific substring when tokenizing a string?
More specifically, I have this sentence:
Once unregistered, the folder went away from the shell...
3
Solved
Hello i been trying to get a tokenizer to work using the boost library tokenizer class.
I found this tutorial on the boost documentation:
http://www.boost.org/doc/libs/1 _36 _0/libs/tokenizer/escap...
11
Solved
I want to tokenize a string like this
String line = "a=b c='123 456' d=777 e='uij yyy'";
I cannot split based like this
String [] words = line.split(" ");
Any idea how can I split so that I ...
3
Solved
I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to b...
2
Solved
I am using elasticsearch version 1.2.1.
I have a use case in which I would like to create a custom tokenizer that will break the tokens by their length up to a certain minimum length. For example, ...
Trackless asked 8/2, 2015 at 16:55
1
Solved
Background Information:
I have a desire to make a programming language, knowing the tools to do so, I don't have any good examples on how to use them. I really do not want to use Flex or Biso...
3
I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls.
I've tried a...
© 2022 - 2024 — McMap. All rights reserved.