R : Finding the top 10 terms associated with the term 'fraud' across documents in a Document Term Matrix in R
Asked Answered
S

1

1

I have a corpus of 39 text files named by the year - 1945.txt, 1978.txt.... 2013.txt.

I've imported them into R and created a Document Term Matrix using TM package. I'm trying to investigate how words associated with term'fraud' have changed over years from 1945 to 2013. The desired output would be a 39 by 10/5 matrix with years as row titles and top 10 or 5 terms as columns.

Any help would be greatly appreciated.

Thanks in advance.

Structure of my TDM:

> str(ytdm)
List of 6
 $ i       : int [1:6791] 5 7 8 17 32 41 42 55 58 71 ...
 $ j       : int [1:6791] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:6791] 2 4 2 2 2 8 4 3 2 2 ...
 $ nrow    : int 193
 $ ncol    : int 39
 $ dimnames:List of 2
  ..$ Terms: chr [1:193] "abus" "access" "account" "accur" ...
  ..$ Docs : chr [1:39] "1947" "1976" "1977" "1978" ...
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

My ideal output is like this:


1947   account accur gao medicine fed ......
1948   access  .............
.
.
.
.
.
.
Spaceless answered 22/5, 2013 at 15:31 Comment(0)
P
3

Your example can't be replicated but findAssocs() is probably what you're looking for. Since you want to only look at associates on a yearly basis you'll need a dtm for each year.

> library(tm)
> data(crude)
> # i don't have your data so pretend this is corpus of docs for each year
> names(crude) <- rep(c("1999","2000"),10)
> # create a dtm for each year
> dtm.list <- lapply(unique(names(crude)),function(x) TermDocumentMatrix(crude[names(crude)==x]))
> # get associations for each year
> assoc.list <- lapply(dtm.list,findAssocs,term="oil",corlimit=0.7)
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
 prices barrel. 
   0.79    0.70 

$`2000`
     15.8      opec       and      said   prices,      sell       the  analysts   clearly     fixed 
     0.94      0.94      0.92      0.92      0.91      0.91      0.88      0.85      0.85      0.85 
     late   meeting     never      that    trying       who    winter emergency     above       but 
     0.85      0.85      0.85      0.85      0.85      0.85      0.85      0.84      0.83      0.83 
    world      they       mln    market agreement    before       bpd    buyers    energy    prices 
     0.82      0.80      0.79      0.78      0.75      0.75      0.75      0.75      0.75      0.75 
      set   through     under      will       not       its 
     0.75      0.75      0.75      0.74      0.72      0.70 

> # or if you want the 5 top terms
> assoc.list <- lapply(dtm.list,function(x) names(findAssocs(x,"oil",0)[1:5]))
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
[1] "prices"   "barrel."  "said."    "minister" "arabian" 

$`2000`
[1] "15.8"    "opec"    "and"     "said"    "prices,"
Perimeter answered 22/5, 2013 at 15:40 Comment(6)
Hi David, I've used findAssocs, it returns the terms that are associated with specific term, but in my case, I need to find how those associated words have changed over time, hence thought of creating a matrix years/top n associated terms, to depict the same. Please feel free to suggest.Spaceless
I see, I misunderstood. I'm not sure if that's going to be possible using a bag-of-words approach like a dtm unless you have multiple documents for each year because you're going to need some variance. If you don't then I suppose you could discretize your documents, for example grouping documents by decade and then creating dtm's and running findAssocs on each one.Perimeter
Actually, I do have multiple documents for each year.. I've concaenated the text so that I could build a Document Term Matrix and try.Spaceless
Try not concatenating them and creating a document term matrix for each year, then running findAssocs() on each of those document term matrices. I'll see if I think of an easy way to show an example and edit my post, but I'm not going to go hunting for data to use.Perimeter
David - this is fantastic! Exactly what I'm looking for. I'm getting the following error : > assoc.list = lapply(dtm.list, findAssocs, term = 'fraud', corlimit=0.8) Error in x[term, ] : subscript out of bounds I suppose the matrix is too big to handle for R. How can I remove sparse terms from the DTMs while creating so that I can reduce the sixe of the matrix?Spaceless
It's difficult to know what's going on without see more code and knowing what the actual data you're using looks like. If it is an issue of findAssocs returning a vector too large for R to store in a typical vector then you can limit the number of terms it returns by modifying your corlimit parameter or using a hard index, like: lapply(dtm.list, function(x) findAssocs(x, term = 'fraud', corlimit=0.8)[1:10^5]). I'm almost positive that your error is something else though.Perimeter

© 2022 - 2024 — McMap. All rights reserved.