tokenize Questions

4

Solved

I am trying tokenize strings into ngrams. Strangely in the documentation for the NGramTokenizer I do not see a method that will return the individual ngrams that were tokenized. In fact I only see ...
Deppy asked 17/11, 2012 at 18:50

2

Solved

Tokenize() in nltk.TweetTokenizer returning the 32-bit integers by dividing them into digits. It is only happening to some certain numbers, and I don't see any reason why? >>> from nltk.t...
Morbidity asked 31/7, 2017 at 21:59

3

I have to split Chinese text into multiple sentences. I tried the Stanford DocumentPreProcessor. It worked quite well for English but not for Chinese. Please can you let me know any good sen...
Parulis asked 12/12, 2014 at 10:4

2

I have a value like this: Supoose I have a string: s = "server ('m1.labs.teradata.com') username ('u\'se)r_*5') password('uer 5') dbname ('default')"; I need to extract token1 : 'm1.labs.ter...
Defalcate asked 19/7, 2017 at 13:37

0

I have the following string : s = "server ('m1.labs.terada')ta.com') username ('user5') password('use r5') dbname ('default')"; I have defined a regex for extracting the values between the paran...
Scum asked 19/7, 2017 at 4:16

1

Solved

I followed the tutorial here: (https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) However, I modified the code to be able to save the generated model through h5py. Thus...

1

Solved

I have 2 sentences in my dataset: w1 = I am Pusheen the cat.I am so cute. # no space after period w2 = I am Pusheen the cat. I am so cute. # with space after period When I use NKTL tokenizer (bo...
Ozzy asked 1/7, 2017 at 8:4

6

Solved

I have this lines of text the number of quotes could change like: Here just one "comillas" But I also could have more "mas" values in "comillas" and that "is" the "trick" I was thinking in a meth...
Marco asked 24/9, 2009 at 17:43

1

Solved

This is the Code that I am using for semantic analysis of twitter:- import pandas as pd import datetime import numpy as np import re from nltk.tokenize import word_tokenize from nltk.corpus import...
Sibella asked 25/5, 2017 at 6:21

2

Solved

I have the following text: I don't like to eat Cici's food (it is true) I need to tokenize it to ['i', 'don't', 'like', 'to', 'eat', 'Cici's', 'food', '(', 'it', 'is', 'true', ')'] I have fo...
Fetus asked 29/3, 2017 at 12:2

1

Solved

I am trying to make 2 document-term matrices for a corpus, one with unigrams and one with bigrams. However, the bigram matrix is currently just identical to the unigram matrix, and I'm not sure why...
Trichromatic asked 5/3, 2017 at 4:11

2

Solved

I've got the following code: std::string str = "abc def,ghi"; std::stringstream ss(str); string token; while (ss >> token) { printf("%s\n", token.c_str()); } The...
Undercroft asked 30/7, 2012 at 10:21

4

I'm learning how to write tokenizers, parsers and as an exercise I'm writing a calculator in JavaScript. I'm using a prase tree approach (I hope I got this term right) to build my calculator. I'm ...
Press asked 1/7, 2014 at 23:45

4

Solved

I am looking for a clear definition of what a "tokenizer", "parser" and "lexer" are and how they are related to each other (e.g., does a parser use a tokenizer or vice versa)? I need to create a pr...
Solvency asked 19/12, 2008 at 9:14

2

I’m trying to find a way to precisely determine the line number and character position of both tags and attributes whilst parsing an XML document. I want to do this so that I can report accurately ...
Hunger asked 31/1, 2017 at 22:2

5

Solved

For argument's sake lets assume a HTML parser. I've read that it tokenizes everything first, and then parses it. What does tokenize mean? Does the parser read every character each, building up a...
Lizethlizette asked 30/6, 2010 at 14:36

11

Solved

I'm doing a faster tests for a naive boolean information retrival system, and I would like use awk, grep, egrep, sed or thing similiar and pipes for split a text file into words and save them into ...
Microfiche asked 19/3, 2013 at 14:3

4

Solved

I need to split a text using the separator ". ". For example I want this string : Washington is the U.S Capital. Barack is living there. To be cut into two parts: Washington is the U.S Capital....
Pardon asked 4/6, 2010 at 7:23

1

Solved

How can I prevent spacy's tokenizer from splitting a specific substring when tokenizing a string? More specifically, I have this sentence: Once unregistered, the folder went away from the shell...
Confessional asked 26/1, 2017 at 3:26

3

Solved

Hello i been trying to get a tokenizer to work using the boost library tokenizer class. I found this tutorial on the boost documentation: http://www.boost.org/doc/libs/1 _36 _0/libs/tokenizer/escap...
Sardanapalus asked 12/2, 2009 at 14:44

11

Solved

I want to tokenize a string like this String line = "a=b c='123 456' d=777 e='uij yyy'"; I cannot split based like this String [] words = line.split(" "); Any idea how can I split so that I ...
Managerial asked 1/10, 2009 at 0:21

3

Solved

I have blocks of text I want to tokenize, but I don't want to tokenize on whitespace and punctuation, as seems to be the standard with tools like NLTK. There are particular phrases that I want to b...
Tugboat asked 3/4, 2011 at 20:42

2

Solved

I am using elasticsearch version 1.2.1. I have a use case in which I would like to create a custom tokenizer that will break the tokens by their length up to a certain minimum length. For example, ...
Trackless asked 8/2, 2015 at 16:55

1

Solved

Background Information: I have a desire to make a programming language, knowing the tools to do so, I don't have any good examples on how to use them. I really do not want to use Flex or Biso...
Infrasonic asked 11/11, 2016 at 0:33

3

I have a database of URLs that I would like to search. Because URLs are not always written the same (may or may not have www), I am looking for the correct way to Index and Query urls. I've tried a...
Sociable asked 13/1, 2011 at 18:59

© 2022 - 2024 — McMap. All rights reserved.