tokenize - 4

1

Solved

Is it better to Keras fit_to_text on the entire x_data or just the train_data?

I have a dataframe with text columns. I separated them into x_train and x_test. My question is if its better to do Keras's Tokenizer.fit_on_text() on the entire x data set or just x_train? Like t...

python keras tokenize

Hewitt asked 26/2, 2019 at 17:56

4

Solved

Basic NLP in CoffeeScript or JavaScript -- Punkt tokenizaton, simple trained Bayes models -- where to start? [closed]

My current web-app project calls for a little NLP: Tokenizing text into sentences, via Punkt and similar; Breaking down the longer sentences by subordinate clause (often it’s on commas exce...

javascript nlp coffeescript user-experience tokenize

Conclave asked 15/3, 2012 at 13:54

3

Solved

Modify python nltk.word_tokenize to exclude "#" as delimiter

I am using Python's NLTK library to tokenize my sentences. If my code is text = "C# billion dollars; we don't own an ounce C++" print nltk.word_tokenize(text) I get this as my output ['C', '...

python nltk tokenize

Margie asked 27/2, 2016 at 19:3

1

Solved

How to find "num_words" or vocabulary size of Keras tokenizer when one is not assigned?

So if I were to not pass num_words argument when initializing Tokenizer(), how do I find the vocabulary size after it is used to tokenize the training dataset? Why this way, I don't want to limit ...

machine-learning keras deep-learning nlp tokenize

Gehlbach asked 28/11, 2018 at 18:37

3

Solved

How to tokenize markdown using Node.js?

Im building an iOS app that have a view that is going to have its source from markdown. My idea is to be able to parse markdown stored in MongoDB into a JSON-object that looks something like: { ...

javascript ios node.js markdown tokenize

Abiotic asked 26/2, 2014 at 12:43

3

Pass tokens to CountVectorizer

I have a text classification problem where i have two types of features: features which are n-grams (extracted by CountVectorizer) other textual features (e.g. presence of a word from a given lex...

scikit-learn tokenize

Cyna asked 8/3, 2016 at 12:32

1

Solved

Custom sentence segmentation using Spacy

I am new to Spacy and NLP. I'm facing the below issue while doing sentence segmentation using Spacy. The text I am trying to tokenise into sentences contains numbered lists (with space between numb...

nlp tokenize spacy sentence

Queenhood asked 6/9, 2018 at 13:43

10

Solved

Python - RegEx for splitting text into sentences (sentence-tokenizing) [duplicate]

I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals ...

python regex nlp tokenize

Reinstate asked 9/9, 2014 at 1:55

4

How to parse / tokenize an SQL statement in Node.js [closed]

I'm looking for a way to parse / tokenize SQL statement within a Node.js application, in order to: Tokenize all the "basics" SQL keywords defined in the ISO/IEC 9075 standard or here. Valid...

sql node.js parsing tokenize sql-parser

Sick asked 6/8, 2014 at 9:16

1

How to define special "untokenizable" words for nltk.word_tokenize

I'm using nltk.word_tokenize for tokenizing some sentences which contain programming languages, frameworks, etc., which get incorrectly tokenized. For example: >>> tokenize.word_tokenize...

nltk tokenize

Juju asked 10/8, 2017 at 16:3

6

Solved

How do you parse a filename in bash?

I have a filename in a format like: system-source-yyyymmdd.dat I'd like to be able to parse out the different bits of the filename using the "-" as a delimiter.

bash shell parsing tokenize cut

Barn asked 8/9, 2008 at 10:7

1

Solved

Spacy custom tokenizer to include only hyphen words as tokens using Infix regex

I want to include hyphenated words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on StackOverflow, Github, its documentation and elsewher...

regex nlp tokenize spacy linguistics

Snuff asked 24/6, 2018 at 17:45

3

A string tokenizer in C++ that allows multiple separators

Is there a way to tokenize a string in C++ with multiple separators? In C# I would have done: string[] tokens = "adsl, dkks; dk".Split(new [] { ",", " ", ";" }, StringSplitOptions.RemoveEmpty);

c#c++string tokenize

Guaco asked 16/4, 2010 at 15:19

4

Solved

How to get a Token from a Lucene TokenStream?

I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream. The worst part is that I'm looking at the comments in the JavaDocs that address...

java attributes lucene token tokenize

Marroquin asked 14/4, 2010 at 14:30

3

How to tokenize Chinese language document

I will be getting document written in Chinese language for which I have to tokenize and keep it in database table. I was trying the CJKBigramFilter of Lucene but all it does is unite the 2 characte...

java tokenize

Cubit asked 18/9, 2012 at 19:52

2

Solved

Nltk french tokenizer in python not working

Why is the french tokenizer that comes with python not working for me? Am I doing something wrong? I'm doing import nltk content_french = ["Les astronomes amateurs jouent également un rôle import...

python nltk tokenize

Eonian asked 23/2, 2017 at 23:54

3

Solved

listunagg function?

is there such thing in oracle like listunagg function? For example, if I have a data like: user_id degree_fi degree_en degree_sv 3601464 3700 1600 2200 1020 100 0 0 3600520 100,3200,400...

string oracle csv oracle11g tokenize

Hymie asked 2/11, 2012 at 5:2

2

Using Keras Tokenizer to generate n-grams

Is it possible to use n-grams in Keras? E.g., sentences contain in X_train dataframe with "sentences" column. I use tokenizer from Keras in the following manner: tokenizer = Tokenizer(lower=True...

nlp keras tokenize text-processing n-gram

Intermission asked 12/9, 2017 at 10:2

9

Solved

How do I read input character-by-character in Java?

I am used to the c-style getchar(), but it seems like there is nothing comparable for java. I am building a lexical analyzer, and I need to read in the input character by character. I know I can u...

java character tokenize

Piperonal asked 1/5, 2009 at 15:22

1

Solved

Passing a pandas dataframe column to an NLTK tokenizer

I have a pandas dataframe raw_df with 2 columns, ID and sentences. I need to convert each sentence to a string. The code below produces no errors and says datatype of rule is "object." raw_df['se...

python string pandas nltk tokenize

Goatsucker asked 21/1, 2018 at 3:40

4

Recursive Descent Parser for something simple?

I'm writing a parser for a templating language which compiles into JS (if that's relevant). I started out with a few simple regexes, which seemed to work, but regexes are very fragile, so I decided...

javascript parsing templates lexer tokenize

Unhallowed asked 3/4, 2011 at 19:35

6

Solved

Tokenizing strings using regular expression in Javascript

Suppose I've a long string containing newlines and tabs as: var x = "This is a long string.\n\t This is another one on next line."; So how can we split this string into tokens, using regular exp...

javascript regex string tokenize stringtokenizer

Emersonemery asked 9/12, 2011 at 6:32

1

Solved

NLTK words vs word_tokenize

I'm exploring some of NLTK's corpora and came across the following behaviour: word_tokenize() and words produce different sets of words(). Here is an example using webtext: from nltk.corpus impor...

python nlp nltk tokenize corpus

Superbomb asked 27/10, 2017 at 0:5

1

Solved

How to improve NLTK sentence segmentation?

Given the paragraph from Wikipedia: An ambitious campus expansion plan was proposed by Fr. Vernon F. Gallagher in 1952. Assumption Hall, the first student dormitory, was opened in 1954, and Rockwe...

python nlp nltk tokenize text-segmentation

Bleareyed asked 13/11, 2017 at 22:21

1

Solved

boost::split pushes an empty string to the vector even with token_compress_on

When the input string is blank, boost::split returns a vector with one empty string in it. Is it possible to have boost::split return an empty vector instead? MCVE: #include <string> #incl...

c++boost tokenize

Guess asked 3/10, 2017 at 8:27

tokenize Questions

Recommended topics

Hot tags