Combining Lists of Word Frequency Data
Asked Answered
B

5

11

This seems like it should be an obvious question, but the tutorials and documentation on lists are not forthcoming. Many of these issues stem from the sheer size of my text files (hundreds of MB) and my attempts to boil them down to something manageable by my system. As a result, I'm doing my work in segments and am now trying to combine the results.

I have multiple word frequency lists (~40 of them). The lists can either be taken through Import[ ] or as variables generated in Mathematica. Each list appears as the following and has been generated using the Tally[ ] and Sort[ ] commands:

{{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 16850}, {"in",
16164}, {"de", 14930}, {"a", 14660}, {"to", 14175}, {"la", 7347}, {"was", 6030}, {"l", 5981}, {"le", 5735}, <<51293>>, {"abattoir", 1}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

Here is an example of the second file:

{{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 16262}, {"and",
14488}, {"to", 12726}, {"a", 12635}, {"in", 11141}, {"la", 10739}, {"et", 9016}, {"les", 8675}, {"le", 7748}, <<101032>>, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

I want to combine them so that the frequency data aggregates: i.e. if the second file has 30,419 occurrences of 'the' and is joined to the first file, it should return that there are 72,635 occurrences (and so on as I move through the entire collection).

Buschi answered 24/10, 2011 at 13:0 Comment(2)
A closely related question: https://mcmap.net/q/1013547/-aggregating-tally-countersSmoothtongued
Also somewhat related: #7750133Diomedes
I
10

It sounds like you need GatherBy.

Suppose your two lists are named data1 and data2, then use

{#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Join[data1, data2], First]

This easily generalizes to any number of lists, not just two.

Isotone answered 24/10, 2011 at 13:57 Comment(0)
H
8

Try using a hash table, like this. First set things up:

ClearAll[freq];
freq[_] = 0;

Now eg freq["safas"] returns 0. Next, if the lists are defined as

lst1 = {{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 
    16850}, {"in", 16164}, {"de", 14930}, {"a", 14660}, {"to", 
    14175}, {"la", 7347}, {"was", 6030}, {"l", 5981}, {"le", 
    5735}, {"abattoir", 1}, {"abattement", 1}, {"abattagen", 
    1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 
    1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 
    1}, {"aaa", 1}};
lst2 = {{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 
    16262}, {"and", 14488}, {"to", 12726}, {"a", 12635}, {"in", 
    11141}, {"la", 10739}, {"et", 9016}, {"les", 8675}, {"le", 
    7748}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 
    1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 
    1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}};

you may run this

Scan[(freq[#[[1]]] += #[[2]]) &, lst1]

after which eg

freq["the"]
(*
42216
*)

and then the next list

Scan[(freq[#[[1]]] += #[[2]]) &, lst2]

after which eg

freq["the"]
72635

while still

freq["safas"]
(*
0
*)
Hinrichs answered 24/10, 2011 at 13:36 Comment(2)
This works really quickly! Is there a way, however, to output the list again with the final results - i.e. an aggregate list of all the terms (i.e. {{"the",72635},{"of",41165}...) [can be in any format]Buschi
@ian.milligan try eg this #7165669Hinrichs
S
8

Here is a direct Sow/Reap function:

Reap[#2~Sow~# & @@@ data1~Join~data2;, _, {#, Tr@#2} &][[2]]

Here is a concise form of acl's method:

Module[{c},
  c[_] = 0;

  c[#] += #2 & @@@ data1~Join~data2;

  {#[[1, 1]], #2} & @@@ Most@DownValues@c
]

This appears to be a bit faster than Szabolcs code on my system:

data1 ~Join~ data2 ~GatherBy~ First /.
  {{{x_, a_}, {x_, b_}} :> {x, a + b}, {x : {_, _}} :> x}
Smoothtongued answered 24/10, 2011 at 19:6 Comment(2)
Also, the compact form of acl's method is particularly clever. I do have one question, though, in the Reap implementation, why the semi-colon after the Sow statement?Tilford
@Tilford that's a good question. It's an old habit, but maybe not a good one. It visually reminds me that I am not using the direct result of that expression and I once thought it was more efficient, but that does not seem to be the case. It does have the advantage of suppressing a large output, which can be very slow to format even with the Skeleton display, if I forget the [[2]] after Reap. I should consider using Scan or Do instead.Smoothtongued
T
6

There's an old saying, "if all you have is a hammer, everything becomes a nail." So, here's my hammer: SelectEquivalents.

This can be done a little quicker using SelectEquivalents:

SelectEquivalents[data1~Join~data2, #[[1]]&, #[[2]]&, {#1, Total[#2]}&]

In order, the first param is obviously just the joined lists, the second one is what they're grouped by (in this case the first element), the third param strips off the string leaving just the count, and the fourth param puts it back together with the string as #1 and the counts in a list as #2.

Tilford answered 24/10, 2011 at 14:52 Comment(1)
@ian.milligan, also check out Faysal's variant of SelectEquivalents. I, personally, wouldn't make everything an Option, but his variant is extremely flexible. And, both versions are more than capable of running rings around GatherBy.Tilford
M
3

Try ReplaceRepeated.

Join the lists. Then use

//. {{f1___, {a_, c1_}, f2___, {a_, c2_}, f3___} -> {f1, f2, f3, {a, c1 + c2}}}
Mulkey answered 24/10, 2011 at 13:24 Comment(2)
I'm sure there are faster ways. In my edit I placed {a,c1+c2} at the end of the rule's output, to save a bit of time.Mulkey
Though conceptually interesting, this is very slow.Smoothtongued

© 2022 - 2024 — McMap. All rights reserved.