tokenize Questions
1
Solved
I have a dataframe with text columns. I separated them into x_train and x_test.
My question is if its better to do Keras's Tokenizer.fit_on_text() on the entire x data set or just x_train?
Like t...
4
Solved
My current web-app project calls for a little NLP:
Tokenizing text into sentences, via Punkt and similar;
Breaking down the longer sentences by subordinate clause (often it’s on commas exce...
Conclave asked 15/3, 2012 at 13:54
3
Solved
I am using Python's NLTK library to tokenize my sentences.
If my code is
text = "C# billion dollars; we don't own an ounce C++"
print nltk.word_tokenize(text)
I get this as my output
['C', '...
1
Solved
So if I were to not pass num_words argument when initializing Tokenizer(), how do I find the vocabulary size after it is used to tokenize the training dataset?
Why this way, I don't want to limit ...
Gehlbach asked 28/11, 2018 at 18:37
3
Solved
Im building an iOS app that have a view that is going to have its source from markdown.
My idea is to be able to parse markdown stored in MongoDB into a JSON-object that looks something like:
{
...
Abiotic asked 26/2, 2014 at 12:43
3
I have a text classification problem where i have two types of features:
features which are n-grams (extracted by CountVectorizer)
other textual features (e.g. presence of a word from a given lex...
Cyna asked 8/3, 2016 at 12:32
1
Solved
I am new to Spacy and NLP. I'm facing the below issue while doing sentence segmentation using Spacy.
The text I am trying to tokenise into sentences contains numbered lists (with space between numb...
10
Solved
I want to make a list of sentences from a string and then print them out. I don't want to use NLTK to do this. So it needs to split on a period at the end of the sentence and not at decimals ...
4
I'm looking for a way to parse / tokenize SQL statement within a Node.js application, in order to:
Tokenize all the "basics" SQL keywords defined in the ISO/IEC 9075 standard or here.
Valid...
Sick asked 6/8, 2014 at 9:16
1
I'm using nltk.word_tokenize for tokenizing some sentences which contain programming languages, frameworks, etc., which get incorrectly tokenized.
For example:
>>> tokenize.word_tokenize...
6
Solved
1
Solved
I want to include hyphenated words for example: long-term, self-esteem, etc. as a single token in Spacy. After looking at some similar posts on StackOverflow, Github, its documentation and elsewher...
Snuff asked 24/6, 2018 at 17:45
3
Is there a way to tokenize a string in C++ with multiple separators? In C# I would have done:
string[] tokens = "adsl, dkks; dk".Split(new [] { ",", " ", ";" }, StringSplitOptions.RemoveEmpty);
4
Solved
I'm trying to use Apache Lucene for tokenizing, and I am baffled at the process to obtain Tokens from a TokenStream.
The worst part is that I'm looking at the comments in the JavaDocs that address...
Marroquin asked 14/4, 2010 at 14:30
3
I will be getting document written in Chinese language for which I have to tokenize and keep it in database table. I was trying the CJKBigramFilter of Lucene but all it does is unite the 2 characte...
2
Solved
Why is the french tokenizer that comes with python not working for me?
Am I doing something wrong?
I'm doing
import nltk
content_french = ["Les astronomes amateurs jouent également un rôle import...
3
Solved
2
Is it possible to use n-grams in Keras?
E.g., sentences contain in X_train dataframe with "sentences" column.
I use tokenizer from Keras in the following manner:
tokenizer = Tokenizer(lower=True...
Intermission asked 12/9, 2017 at 10:2
9
Solved
I am used to the c-style getchar(), but it seems like there is nothing comparable for java. I am building a lexical analyzer, and I need to read in the input character by character.
I know I can u...
1
Solved
I have a pandas dataframe raw_df with 2 columns, ID and sentences. I need to convert each sentence to a string. The code below produces no errors and says datatype of rule is "object."
raw_df['se...
4
I'm writing a parser for a templating language which compiles into JS (if that's relevant). I started out with a few simple regexes, which seemed to work, but regexes are very fragile, so I decided...
Unhallowed asked 3/4, 2011 at 19:35
6
Solved
Suppose I've a long string containing newlines and tabs as:
var x = "This is a long string.\n\t This is another one on next line.";
So how can we split this string into tokens, using regular exp...
Emersonemery asked 9/12, 2011 at 6:32
1
Solved
1
Solved
Given the paragraph from Wikipedia:
An ambitious campus expansion plan was proposed by Fr. Vernon F.
Gallagher in 1952. Assumption Hall, the first student dormitory, was
opened in 1954, and Rockwe...
Bleareyed asked 13/11, 2017 at 22:21
1
Solved
When the input string is blank, boost::split returns a vector with one empty string in it.
Is it possible to have boost::split return an empty vector instead?
MCVE:
#include <string>
#incl...
© 2022 - 2024 — McMap. All rights reserved.