tokenize Questions

1

Solved

I have a dataframe with text columns. I separated them into x_train and x_test. My question is if its better to do Keras's Tokenizer.fit_on_text() on the entire x data set or just x_train? Like t...
Hewitt asked 26/2, 2019 at 17:56

4

Solved

My current web-app project calls for a little NLP: Tokenizing text into sentences, via Punkt and similar; Breaking down the longer sentences by subordinate clause (often it’s on commas exce...
Conclave asked 15/3, 2012 at 13:54

3

Solved

I am using Python's NLTK library to tokenize my sentences. If my code is text = "C# billion dollars; we don't own an ounce C++" print nltk.word_tokenize(text) I get this as my output ['C', '...
Margie asked 27/2, 2016 at 19:3

1

Solved

So if I were to not pass num_words argument when initializing Tokenizer(), how do I find the vocabulary size after it is used to tokenize the training dataset? Why this way, I don't want to limit ...
Gehlbach asked 28/11, 2018 at 18:37

3

Solved

Im building an iOS app that have a view that is going to have its source from markdown. My idea is to be able to parse markdown stored in MongoDB into a JSON-object that looks something like: { ...
Abiotic asked 26/2, 2014 at 12:43

3

I have a text classification problem where i have two types of features: features which are n-grams (extracted by CountVectorizer) other textual features (e.g. presence of a word from a given lex...
Cyna asked 8/3, 2016 at 12:32

1

Solved

I am new to Spacy and NLP. I'm facing the below issue while doing sentence segmentation using Spacy. The text I am trying to tokenise into sentences contains numbered lists (with space between numb...
Queenhood asked 6/9, 2018 at 13:43

10

Solved

I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals ...
Reinstate asked 9/9, 2014 at 1:55

4

I'm looking for a way to parse / tokenize SQL statement within a Node.js application, in order to: Tokenize all the "basics" SQL keywords defined in the ISO/IEC 9075 standard or here. Valid...
Sick asked 6/8, 2014 at 9:16

1

I'm using nltk.word_tokenize for tokenizing some sentences which contain programming languages, frameworks, etc., which get incorrectly tokenized. For example: >>> tokenize.word_tokenize...
Juju asked 10/8, 2017 at 16:3

6

Solved

I have a filename in a format like: system-source-yyyymmdd.dat I'd like to be able to parse out the different bits of the filename using the "-" as a delimiter.
Barn asked 8/9, 2008 at 10:7

1

Solved

I want to include hyphenated words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on StackOverflow, Github, its documentation and elsewher...
Snuff asked 24/6, 2018 at 17:45

3

Is there a way to tokenize a string in C++ with multiple separators? In C# I would have done: string[] tokens = "adsl, dkks; dk".Split(new [] { ",", " ", ";" }, StringSplitOptions.RemoveEmpty);
Guaco asked 16/4, 2010 at 15:19

4

Solved

I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream. The worst part is that I'm looking at the comments in the JavaDocs that address...
Marroquin asked 14/4, 2010 at 14:30

3

I will be getting document written in Chinese language for which I have to tokenize and keep it in database table. I was trying the CJKBigramFilter of Lucene but all it does is unite the 2 characte...
Cubit asked 18/9, 2012 at 19:52

2

Solved

Why is the french tokenizer that comes with python not working for me? Am I doing something wrong? I'm doing import nltk content_french = ["Les astronomes amateurs jouent également un rôle import...
Eonian asked 23/2, 2017 at 23:54

3

Solved

is there such thing in oracle like listunagg function? For example, if I have a data like: user_id degree_fi degree_en degree_sv 3601464 3700 1600 2200 1020 100 0 0 3600520 100,3200,400...
Hymie asked 2/11, 2012 at 5:2

2

Is it possible to use n-grams in Keras? E.g., sentences contain in X_train dataframe with "sentences" column. I use tokenizer from Keras in the following manner: tokenizer = Tokenizer(lower=True...
Intermission asked 12/9, 2017 at 10:2

9

Solved

I am used to the c-style getchar(), but it seems like there is nothing comparable for java. I am building a lexical analyzer, and I need to read in the input character by character. I know I can u...
Piperonal asked 1/5, 2009 at 15:22

1

Solved

I have a pandas dataframe raw_df with 2 columns, ID and sentences. I need to convert each sentence to a string. The code below produces no errors and says datatype of rule is "object." raw_df['se...
Goatsucker asked 21/1, 2018 at 3:40

4

I'm writing a parser for a templating language which compiles into JS (if that's relevant). I started out with a few simple regexes, which seemed to work, but regexes are very fragile, so I decided...
Unhallowed asked 3/4, 2011 at 19:35

6

Solved

Suppose I've a long string containing newlines and tabs as: var x = "This is a long string.\n\t This is another one on next line."; So how can we split this string into tokens, using regular exp...
Emersonemery asked 9/12, 2011 at 6:32

1

Solved

I'm exploring some of NLTK's corpora and came across the following behaviour: word_tokenize() and words produce different sets of words(). Here is an example using webtext: from nltk.corpus impor...
Superbomb asked 27/10, 2017 at 0:5

1

Solved

Given the paragraph from Wikipedia: An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwe...
Bleareyed asked 13/11, 2017 at 22:21

1

Solved

When the input string is blank, boost::split returns a vector with one empty string in it. Is it possible to have boost::split return an empty vector instead? MCVE: #include <string> #incl...
Guess asked 3/10, 2017 at 8:27

© 2022 - 2024 — McMap. All rights reserved.