n-gram Questions

5

Can someone help me with how to find the most frequently used two and three words in a text using R? My text is... text <- c("There is a difference between the common use of the term phrase an...
Conversable asked 18/5, 2016 at 6:38

2

Solved

Using ngram in Python my aim is to find out verbs and their corresponding adverbs from an input text. What I have done: Input text:""He is talking weirdly. A horse can run fast. A big tree is ther...
Snuffbox asked 27/1, 2016 at 6:10

3

Solved

Here's an appeal for a better way to do something that I can already do inefficiently: filter a series of n-gram tokens using "stop words" so that the occurrence of any stop word term in an n-gram ...
Rakes asked 12/10, 2015 at 0:9

5

Solved

I am writing an R script and am using library(ngram). Suppose I have a string, "good qualiti dog food bought sever vital can dog food product found good qualiti product look like stew process mea...
Webfoot asked 29/9, 2015 at 17:25

5

Solved

The winner of a recent Wikipedia vandalism detection competition suggests that detection could be improved by "detecting random keyboard hits considering QWERTY keyboard layout". Example: woijf qo...
Douglassdougy asked 27/9, 2010 at 8:41

1

Solved

I use sklearn.feature_extraction.text.CountVectorizer to compute n-grams. Example: import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html ngram_size = 4 string = ...
Territorialize asked 20/8, 2015 at 21:35

1

Solved

Does TfidfVectorizer identify n-grams using python regular expressions? This issue arises while reading the documentation for scikit-learn TfidfVectorizer, I see that the pattern to recognize n-gr...
Episiotomy asked 26/3, 2015 at 23:51

1

Solved

I am a beginner in Solr. In my project, NGramFilterFactory and EdgeNGramFilterFactory, both are being used for a field. My understanding as per the document is EdgeNGramFilterFactory is used for "s...
Obligatory asked 18/5, 2015 at 9:14

3

Solved

I want to take a text file and create a bigram of all words not separated by a dot ".", removing any special characters. I'm trying to do this using Spark and Scala. This text: Hello my Friend. H...
Cupping asked 18/4, 2015 at 3:28

0

I am using solr for spell checking/ query correction. I have added solr.PhoneticFilterFactory and solr.NGramFilterFactory in fieldType to perform spell checking. It is working fine but here the pro...
Milklivered asked 15/12, 2014 at 12:39

6

Solved

I'm trying to load a couple of files into the memory. The files have either of the following 3 formats: string TAB int string TAB float int TAB float. Indeed, they are ngram statics files, i...
Splice asked 22/4, 2012 at 3:3

2

I am trying to do Sentiment Analysis on Tweets using Python. To begin with, I've implemented an n-grams model. So, lets say our training data is I am a good kid He is a good kid, but he didn't g...
Alithea asked 9/11, 2014 at 3:45

3

I have a huge files of 3,000,000 lines and each line have 20-40 words. I have to extract 1 to 5 ngrams from the corpus. My input files are tokenized plain text, e.g.: This is a foo bar sentence . ...
Staphylorrhaphy asked 13/10, 2014 at 13:45

1

I've been trying to find out an alternative for two straight days now, and couldn't find anything relevant. I'm basically trying to get a probabilistic score of a synthesized sentence (synthesized...
Whitening asked 18/10, 2014 at 18:24

4

I'm getting started with the tm package in R, so please bear with me and apologies for the big ol' wall of text. I have created a fairly large corpus of Socialist/Communist propaganda and would lik...
Bendix asked 27/10, 2013 at 6:8

1

To put my question in context, I would like to train and test/compare several (neural) language models. In order to focus on the models rather than data preparation I chose to use the Brown corpus ...
Poock asked 12/5, 2013 at 16:40

2

Solved

We would like to run a query that returns two word phrases that appear in more than one row. So for e.g. take the string "Data Ninja". Since it appears in more than one row in our dataset, the quer...
Subassembly asked 10/9, 2013 at 1:46

1

Solved

I'm a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range argument works in a CountVectorizer. Running this code: from sklearn.featu...
Bullnecked asked 3/6, 2014 at 1:27

5

Solved

I need to compare documents stored in a DB and come up with a similarity score between 0 and 1. The method I need to use has to be very simple. Implementing a vanilla version of n-grams (where it...
Laky asked 4/3, 2010 at 15:22

1

Solved

The following word2ngrams function extracts character 3grams from a word: >>> x = 'foobar' >>> n = 3 >>> [x[i:i+n] for i in range(len(x)-n+1)] ['foo', 'oob', 'oba', 'bar...
Status asked 15/3, 2014 at 18:32

3

Solved

At least 3 types of n-grams can be considered for representing text documents: byte-level n-grams character-level n-grams word-level n-grams It's unclear to me which one should be used for a g...
Renovate asked 9/2, 2014 at 8:18

3

I am trying to code dissociated press algorithm based on n-gram in scala. How to generate an n-gram for a large files: For example, for the file containing "the bee is the bee of the bees". Fir...
Boice asked 24/11, 2011 at 14:55

2

Solved

I'm new to text categorization techniques, I want to know the difference between the N-gram approach for text categorization and other classifier (decision tree, KNN, SVM) based text categorization...

2

Solved

In this documentation, there is example using nltk.collocations.BigramAssocMeasures(), BigramCollocationFinder,nltk.collocations.TrigramAssocMeasures(), and TrigramCollocationFinder. There is exam...
Shirtmaker asked 7/9, 2013 at 9:58

4

Drupal's core search module, only searches for keywords, e.g. "sandwich". Can I make it search with a substring e.g. "sandw" and return my sandwich-results? Maybe there is a plugin that does that?...
Mossback asked 16/4, 2010 at 15:17

© 2022 - 2024 — McMap. All rights reserved.