How to count the frequency of a element in APL or J without loops
Asked Answered
P

7

10

Assume I have two lists, one is the text t, one is a list of characters c. I want to count how many times each character appears in the text.

This can be done easily with the following APL code.

+⌿t∘.=c

However it is slow. It take the outer product, then sum each column.

It is a O(nm) algorithm where n and m are the size of t and c.

Of course I can write a procedural program in APL that read t character by character and solve this problem in O(n+m) (assume perfect hashing).

Are there ways to do this faster in APL without loops(or conditional)? I also accept solutions in J.

Edit: Practically speaking, I'm doing this where the text is much shorter than the list of characters(the characters are non-ascii). I'm considering where text have length of 20 and character list have length in the thousands.

There is a simple optimization given n is smaller than m.

w  ← (∪t)∩c
f ←  +⌿t∘.=w
r ← (⍴c)⍴0
r[c⍳w] ← f
r

w contains only the characters in t, therefore the table size only depend on t and not c. This algorithm runs in O(n^2+m log m). Where m log m is the time for doing the intersection operation.

However, a sub-quadratic algorithm is still preferred just in case someone gave a huge text file.

Piling answered 10/8, 2011 at 8:47 Comment(7)
I can't see how it can be O(n+m)... Even if there was a J verb dedicated to do just that, wouldn't it be O(nm)?Gunman
Assume we have this operation addCount(k). This add 1 to the counter that record how many times character k appeared. If this operation can be done in constant time (it's possible if there is a perfect hash function), then we use O(n+m) time. Even if we don't have such hash function, we can still use a binary tree to get a O(n log m) algorithm.Piling
It's an interesting case, to say the least. Do you have figures on how much data you're using (size of t and c), and how long it takes?Gunman
Indeed, I added an update to reflect the practical problem I'm trying to solve.Piling
Thousands of characters in c? Are those Unicode characters?Gunman
Yes. Of course this question can be generalized to find the frequency of elements other than characters, and the only change is switch = to ≡.Piling
Is it possible that t contains characters not present in c? For example could it be that t = 'abcd1234' and c = 'abcdefgh'?Gerah
R
8

NB. Using "key" (/.) adverb w/tally (#) verb counts

   #/.~ 'abdaaa'
4 1 1

NB. the items counted are the nub of the string.

   ~. 'abdaaa'
abd

NB. So, if we count the target along with the string

   #/.~ 'abc','abdaaa'
5 2 1 1

NB. We get an extra one for each of the target items.

   countKey2=: 4 : '<:(#x){.#/.~ x,y'

NB. This subtracts 1 (<:) from each count of the xs.

   6!:2 '''1'' countKey2 10000000$''1234567890'''
0.0451088
   6!:2 '''1'' countKey2 1e7$''1234567890'''
0.0441849
   6!:2 '''1'' countKey2 1e8$''1234567890'''
0.466857

NB. A tacit version

   countKey=. [: <: ([: # [) {. [: #/.~ ,

NB. appears to be a little faster at first

   6!:2 '''1'' countKey 1e8$''1234567890'''
0.432938

NB. But repeating the timing 10 times shows they are the same.

   (10) 6!:2 '''1'' countKey 1e8$''1234567890'''
0.43914
   (10) 6!:2 '''1'' countKey2 1e8$''1234567890'''
0.43964
Revolutionize answered 12/10, 2011 at 19:10 Comment(0)
P
7

Dyalog v14 introduced the key operator ():

      {⍺,⍴⍵}⌸'abcracadabra'
a 5
b 2
c 2
r 2
d 1

The operand function takes a letter as and the occurrences of that letter (vector of indices) as .

Pricilla answered 17/8, 2014 at 8:56 Comment(0)
F
5

I think this example, written in J, fits your request. The character list is longer than the text (but both are kept short for convenience during development.) I have not examined timing but my intuition is that it will be fast. The tallying is done only with reference to characters that actually occur in the text, and the long character set is looked across only to correlate characters that occur in the text.

   c=: 80{.43}.a.
   t=: 'some {text} to examine'

   RawIndicies=: c i. ~.t
   Mask=: RawIndicies ~: #c
   Indicies=: Mask # RawIndicies
   Tallies=: Mask # #/.~ t
   Result=: Tallies Indicies} (#c)$0

   4 20 $ Result
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 4 0
0 0 1 0 0 0 2 1 2 0 0 0 1 3 0 0 0 2 0 0
   4 20 $ c
+,-./0123456789:;<=>
?@ABCDEFGHIJKLMNOPQR
STUVWXYZ[\]^_`abcdef
ghijklmnopqrstuvwxyz
Follmer answered 2/9, 2011 at 13:35 Comment(0)
R
1

As noted in other answers, the key operator does this directly. However the classic APL way of solving this problem is still worth knowing.

The classic solution is "sort, shift, and compare":

      c←'missippi'
      t←'abcdefghijklmnopqrstuvwxyz'
      g←⍋c
      g
1 4 7 0 5 6 2 3
      s←c[g]
      s
iiimppss
      b←s≠¯1⌽s
      b
1 0 0 1 1 0 1 0
      n←b/⍳⍴b
      n
0 3 4 6
      k←(1↓n,⍴b)-n
      k
3 1 2 2
      u←b/s
      u
imps

And for the final answer:

       z←(⍴t)⍴0
       z
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
       z[t⍳u]←k
       z
0 0 0 0 0 0 0 0 3 0 0 0 1 0 0 2 0 0 2 0 0 0 0 0 0 0

This code is off the top of my head, not ready for production. Have to look for empty cases - the boolean shift is probably not right for all cases....

Rutger answered 18/8, 2014 at 11:41 Comment(0)
P
0

"Brute force" in J:

count =: (i.~~.) ({,&0) (]+/"1@:=)

Usage:

   'abc' count 'abdaaa'
4 1 0

Not sure how it's implemented internally, but here are the timings for different input sizes:

   6!:2 '''abcdefg'' count 100000$''abdaaaerbfqeiurbouebjkvwek''' NB: run time for #t = 100000 
0.00803909
   6!:2 '''abcdefg'' count 1000000$''abdaaaerbfqeiurbouebjkvwek'''
0.0845451
   6!:2 '''abcdefg'' count 10000000$''abdaaaerbfqeiurbouebjkvwek''' NB: and for #t = 10^7
0.862423

We don't filter input date prior to 'self-classify' so:

   6!:2 '''1'' count 10000000$''1'''
0.244975
   6!:2 '''1'' count 10000000$''1234567890'''
0.673034
   6!:2 '''1234567890'' count 10000000$''1234567890'''
0.673864
Planula answered 9/9, 2011 at 15:49 Comment(0)
D
0

My implementation in APL (NARS2000):

(∪w),[0.5]∪⍦w←t∩c

Example:

      c←'abcdefg'
      t←'abdaaaerbfqeiurbouebjkvwek'
      (∪w),[0.5]∪⍦w←t∩c
a b d e f
4 4 1 4 1

Note: showing only those characters in c that exist in t

Derwood answered 30/4, 2013 at 15:48 Comment(0)
T
0

My initial thought was that this was a case for the Find operator:

T←'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
C←'MISSISSIPPI'
X←+/¨T⍷¨⊂C

The used characters are:

       (×X)/T
IMPS

Their respective frequencies are:

       X~0
4 1 2 4

I've only run toy cases so I have no idea what the performance is, but my intuition tells me it should be cheaper that the outer product. Any thoughts?

Tetrode answered 24/4, 2015 at 11:53 Comment(1)
I just saw that I flipped the variables C and T from the OP, but the logic is the same.Tetrode

© 2022 - 2024 — McMap. All rights reserved.