Find the most frequently occuring words in a text in R
Asked Answered
C

5

6

Can someone help me with how to find the most frequently used two and three words in a text using R?

My text is...

text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")
Conversable answered 18/5, 2016 at 6:38 Comment(0)
F
11

The tidytext package makes this sort of thing pretty simple:

library(tidytext)
library(dplyr)

data_frame(text = text) %>% 
    unnest_tokens(word, text) %>%    # split words
    anti_join(stop_words) %>%    # take out "a", "an", "the", etc.
    count(word, sort = TRUE)    # count occurrences

# Source: local data frame [73 x 2]
# 
#           word     n
#          (chr) (int)
# 1       phrase     8
# 2     sentence     6
# 3        words     4
# 4       called     3
# 5       common     3
# 6  grammatical     3
# 7      meaning     3
# 8         alex     2
# 9         bird     2
# 10    complete     2
# ..         ...   ...

If the question is asking for counts of bigrams and trigrams, tokenizers::tokenize_ngrams is useful:

library(tokenizers)

tokenize_ngrams(text, n = 3L, n_min = 2L, simplify = TRUE) %>%    # tokenize bigrams and trigrams
    as_data_frame() %>%    # structure
    count(value, sort = TRUE)    # count

# Source: local data frame [531 x 2]
# 
#           value     n
#          (fctr) (int)
# 1        of the     5
# 2      a phrase     4
# 3  the sentence     4
# 4          as a     3
# 5        in the     3
# 6        may be     3
# 7    a complete     2
# 8   a phrase is     2
# 9    a sentence     2
# 10      a white     2
# ..          ...   ...
Fluorspar answered 18/5, 2016 at 6:55 Comment(2)
Good one @Fluorspar short and concise for calculating frequency of occurrences.Forgather
First approach (tidytext) has one %>% too much at the end. I also got an error Error in count(., word, sort = TRUE) : unused argument (sort = TRUE). This was a clash with plyr's sortcommand, which can be resolved using dplyr::count(word, sort = TRUE) . Otherwise the best option IMO.Hemphill
F
9

Your text is:

text <- c("There is a difference between the common use of the term phrase and its technical use in linguistics. In common usage, a phrase is usually a group of words with some special idiomatic meaning or other significance, such as \"all rights reserved\", \"economical with the truth\", \"kick the bucket\", and the like. It may be a euphemism, a saying or proverb, a fixed expression, a figure of speech, etc. In grammatical analysis, particularly in theories of syntax, a phrase is any group of words, or sometimes a single word, which plays a particular role within the grammatical structure of a sentence. It does not have to have any special meaning or significance, or even exist anywhere outside of the sentence being analyzed, but it must function there as a complete grammatical unit. For example, in the sentence Yesterday I saw an orange bird with a white neck, the words an orange bird with a white neck form what is called a noun phrase, or a determiner phrase in some theories, which functions as the object of the sentence. Theorists of syntax differ in exactly what they regard as a phrase; however, it is usually required to be a constituent of a sentence, in that it must include all the dependents of the units that it contains. This means that some expressions that may be called phrases in everyday language are not phrases in the technical sense. For example, in the sentence I can't put up with Alex, the words put up with (meaning \'tolerate\') may be referred to in common language as a phrase (English expressions like this are frequently called phrasal verbs\ but technically they do not form a complete phrase, since they do not include Alex, which is the complement of the preposition with.")

In Natural Language Processing, 2-word phrases are referred to as "bi-gram", and 3-word phrases are referred to as "tri-gram", and so forth. Generally, a given combination of n-words is called an "n-gram".

First, we install the ngram package (available on CRAN)

# Install package "ngram"
install.packages("ngram")

Then, we will find the most frequent two-word and three-word phrases

library(ngram)

# To find all two-word phrases in the test "text":
ng2 <- ngram(text, n = 2)

# To find all three-word phrases in the test "text":
ng3 <- ngram(text, n = 3)

Finally, we will print the objects (ngrams) using various methods as below:

print(ng, output="truncated")

print(ngram(x), output="full")

get.phrasetable(ng)

ngram::ngram_asweka(text, min=2, max=3)

We can also use Markov Chains to babble new sequences:

# if we are using ng2 (bi-gram)
lnth = 2 
babble(ng = ng2, genlen = lnth)

# if we are using ng3 (tri-gram)
lnth = 3  
babble(ng = ng3, genlen = lnth)
Forgather answered 18/5, 2016 at 6:57 Comment(0)
J
4

We can split the words and use table to summarize the frequency:

words <- strsplit(text, "[ ,.\\(\\)\"]")
sort(table(words, exclude = ""), decreasing = T)
Jarl answered 18/5, 2016 at 6:54 Comment(1)
I guess you should also truncate "a, the, is" like words (as well as remove punctuation). Although not needed for this question, but sure it would help to other NLP learners/practitioners. Thanks.Forgather
S
4

Simplest?

require(quanteda)

# bi-grams
topfeatures(dfm(text, ngrams = 2, verbose = FALSE))
##      of_the     a_phrase the_sentence       may_be         as_a       in_the    in_common    phrase_is 
##           5            4            4            3            3            3            2            2 
##  is_usually     group_of 
##           2            2 

# for tri-grams
topfeatures(dfm(text, ngrams = 3, verbose = FALSE))
##     a_phrase_is   group_of_words    of_a_sentence  of_the_sentence   for_example_in   example_in_the 
##               2                2                2                2                2                2 
## in_the_sentence   an_orange_bird orange_bird_with      bird_with_a 
#               2                2                2                2 
Schuman answered 23/5, 2016 at 7:30 Comment(2)
Hi Ken. very good one.. simple, easy and few lines. But one doubt, can it be used to predict next words? (just like swiftkey keyboard for android phones). and why there is an 'underscore ( ' _ ' ) ' between two words?Conversable
It could be used to predict the next word if you used the ngrams in a predictive model. The "_" is the default for the concatenator argument to ngrams(), which can be passed in dfm(). See ?quanteda::tokenise or ?quanteda::ngrams.Schuman
I
3

Here's a simple base R approach for the 5 most frequent words:

head(sort(table(strsplit(gsub("[[:punct:]]", "", text), " ")), decreasing = TRUE), 5)

#     a    the     of     in phrase 
#    21     18     12     10      8 

What it returns is an integer vector with the frequency count and the names of the vector correspond to the words that were counted.

  • gsub("[[:punct:]]", "", text) to remove punctuation since you don't want to count that, I guess
  • strsplit(gsub("[[:punct:]]", "", text), " ") to split the string on spaces
  • table() to count unique elements' frequency
  • sort(..., decreasing = TRUE) to sort them in decreasing order
  • head(..., 5) to select only the top 5 most frequent words
Invertebrate answered 18/5, 2016 at 6:44 Comment(4)
I am not sure if user wants the frequency. Check @Manoj Kumar's answer.Uniliteral
@RonakShah You might be right, but in that case the title of the question is misleading. This answer provides the correct solution according to the title. Since it can be assumed that only a small part of programming experts are specialists in NLP, I believe the OP should have stated the expected output more clearly.Ashwin
Thanks @RonakShah and RHertel but surely my question is not misleading. ALl of you answered what I needed as. Thanks to all.Conversable
Although I like this answer as a solution to "find the most frequent words", I believe that more transformations could be helpful than just removing the punctuation. In particular, I would assume that transforming all entries to lower case could be a good idea. I was considering to provide an alternative using the tm package, but the question seems to be already answered to the satisfaction of the OP.Ashwin

© 2022 - 2024 — McMap. All rights reserved.