Function/instruction to count number of times a value has already been seen

D

4

7

I'm trying to identify if MATLAB or R has a function that resembles the following.

Say I have an input vector v.

v = [1, 3, 1, 2, 4, 2, 1, 3]

I want to generate a vector, w of equivalent length to v. Each element w[i] should tell me the following: for the corresponding value v[i], how many times has this value been encountered so far in v, i.e. in all elements of v up to, but not including, position i. In this example

w = [0, 0, 1, 0, 0, 1, 2, 1]

I'm really looking to see if any statistical or domain-specific languages have a function/instruction like this and what it might be called.

Dongola answered 20/8, 2014 at 9:19 Comment(4)

Roland, if you don't mind I will keep the original tags. My objective isn't to use the function itself but rather identify some languages where it can be found. R and Matlab seemed like the best starting place. – Dongola 20/8, 2014 at 9:30

Efficient solution ( O(n) ) should include accumulator array. – Soluk 20/8, 2014 at 9:39

Please study the tag wikis. the instructions tag doesn't fit and the objective stated in your comment would make your question off-topic (you are asking for a tool recommendation). – Gerena 20/8, 2014 at 9:52

In R you could also use dplyr like this: library(dplyr); data.frame(v) %>% group_by(v) %>% mutate(count = row_number()-1) (the result would be a data.frame but you could easily extract the count column if you need it separate). – Acceptable 20/8, 2014 at 10:43

P

7

In R, you can try this:

 v <- c(1,3,1,2,4,2,1,3)
 ave(v, v, FUN=seq_along)-1
 #[1] 0 0 1 0 0 1 2 1

Explanation

 ave(seq_along(v), v, FUN=seq_along)  #It may be better to use `seq_along(v)` considering different classes i.e. `factor` also.
 #[1] 1 1 2 1 1 2 3 2

Here, we are grouping the sequence of elements by v. For elements that match the same group, the seq_along function will create 1,2,3 etc. In the case of v, the elements of same group 1 are in positions 1,3,7, so those corresponding positions will be 1,2,3. By subtracting with 1, we will be able to start from 0.

To understand it better,

 lst1 <- split(v,v)
 lst2 <- lapply(lst1, seq_along)
 unsplit(lst2, v)
 #[1] 1 1 2 1 1 2 3 2

Using data.table

  library(data.table)
  DT <- data.table(v, ind=seq_along(v))
  DT[, n:=(1:.N)-1, by=v][,n[ind]]
  #[1] 0 0 1 0 0 1 2 1

Pedroza answered 20/8, 2014 at 9:23 Comment(7)

And of course this could be done faster with data.table or dplyr. – Gerena 20/8, 2014 at 9:50

@Gerena Thanks, I updated with a data.table solution. Do you know if there is any better way in data.table to keep the initial order? – Pedroza 20/8, 2014 at 10:33

@Roland, I doubt that converting a numeric vector to a data.table object and then operating on it will make it faster. My guess it will be slower. But could be wrong though – Mizzen 20/8, 2014 at 10:44

@DavidArenburg With v <- sample(1:1e4, 1e6, TRUE) the data.table solution data.table(v)[, w:=seq_len(.N) - 1, by=v][["w"]] is faster by about a factor of 5 in my benchmark. – Gerena 20/8, 2014 at 11:14

@Roland, didn't benchmark, but I believe you. You are comparing with ave right? – Mizzen 20/8, 2014 at 11:27

Yep, ave is a valuable tool with an unfortunate name. It should be called "do_anything" :-) – Cicatrix 20/8, 2014 at 11:39

@Pedroza If you use := you don't need to worry about the order. – Gerena 20/8, 2014 at 11:44

K

7

In Matlab there is not a function for that (as far as I know), but you can achieve it this way:

w = sum(triu(bsxfun(@eq, v, v.'), 1));

Explanation: bsxfun(...) compares each element with each other. Then triu(..., 1) keeps only matches of an element with previous elements (i.e. values above the diagonal). Finally sum(...) adds all coincidences with previous elements.

A more explicit, but slower alternative (not recommended) is:

w = arrayfun(@(n) sum(v(1:n-1)==v(n)), 1:numel(v));

Explanation: for each index n (where n varies as 1:numel(v)), compare all previous elements v(1:n-1) to the current element v(n), and get the number of matches (sum(...)).

Kissiah answered 20/8, 2014 at 9:21 Comment(2)

Thank you for your solution. It's a pity that there isn't something slightly more primitive in Matlab. – Dongola 20/8, 2014 at 9:37

I agree. But you could encapsulate the bsxfun line into a function, so it will be "primitive" to you. As you see in The Minion's answer, it's pretty fast – Kissiah 20/8, 2014 at 11:47

P

7

In R, you can try this:

 v <- c(1,3,1,2,4,2,1,3)
 ave(v, v, FUN=seq_along)-1
 #[1] 0 0 1 0 0 1 2 1

Explanation

 ave(seq_along(v), v, FUN=seq_along)  #It may be better to use `seq_along(v)` considering different classes i.e. `factor` also.
 #[1] 1 1 2 1 1 2 3 2

Here, we are grouping the sequence of elements by v. For elements that match the same group, the seq_along function will create 1,2,3 etc. In the case of v, the elements of same group 1 are in positions 1,3,7, so those corresponding positions will be 1,2,3. By subtracting with 1, we will be able to start from 0.

To understand it better,

 lst1 <- split(v,v)
 lst2 <- lapply(lst1, seq_along)
 unsplit(lst2, v)
 #[1] 1 1 2 1 1 2 3 2

Using data.table

  library(data.table)
  DT <- data.table(v, ind=seq_along(v))
  DT[, n:=(1:.N)-1, by=v][,n[ind]]
  #[1] 0 0 1 0 0 1 2 1

Pedroza answered 20/8, 2014 at 9:23 Comment(7)

And of course this could be done faster with data.table or dplyr. – Gerena 20/8, 2014 at 9:50

@Gerena Thanks, I updated with a data.table solution. Do you know if there is any better way in data.table to keep the initial order? – Pedroza 20/8, 2014 at 10:33

@Roland, I doubt that converting a numeric vector to a data.table object and then operating on it will make it faster. My guess it will be slower. But could be wrong though – Mizzen 20/8, 2014 at 10:44

@DavidArenburg With v <- sample(1:1e4, 1e6, TRUE) the data.table solution data.table(v)[, w:=seq_len(.N) - 1, by=v][["w"]] is faster by about a factor of 5 in my benchmark. – Gerena 20/8, 2014 at 11:14

@Roland, didn't benchmark, but I believe you. You are comparing with ave right? – Mizzen 20/8, 2014 at 11:27

Yep, ave is a valuable tool with an unfortunate name. It should be called "do_anything" :-) – Cicatrix 20/8, 2014 at 11:39

@Pedroza If you use := you don't need to worry about the order. – Gerena 20/8, 2014 at 11:44

G

4

R has a function called make.unique that can be used to obtain the required result. First use it to make all elements unique:

(v.u <- make.unique(as.character(v))) # it only works on character vectors so you must convert first
[1] "1"   "3"   "1.1" "2"   "4"   "2.1" "1.2" "3.1"

You can then take this vector, remove the original data, convert the blanks to 0, and convert back to integer to get the counts:

as.integer(sub("^$","0",sub("[0-9]+\\.?","",v.u)))
[1] 0 0 1 0 0 1 2 1

Goatish answered 20/8, 2014 at 10:2 Comment(2)

That's quite interesting. It's a pity that you have to convert to and from strings though. I think there could be scenarios where this kind of thing would be useful in integer form. – Dongola 20/8, 2014 at 11:16

Interesting approach +1 – Dimaggio 20/8, 2014 at 12:20

D

3

If you want to use a for-loop in matlab you can get the result with:

res=v;
res(:)=0;
for c=1:length(v)
    helper=find(v==v(c));
    res(c)=find(helper==c);
end

not sure about runtime compared to Luis Mendo's solution. Gonna check that now.

Edit

Running the code 10.000 times results in:

My Solution: Elapsed time is 0.303828 seconds 
Luis Mendo's Solution (bsxfun): Elapsed time is 0.180215 seconds.
Luis Mendo's Solution (arrayfun): Elapsed time is 3.868467 seconds.

So the bsxfun solution is fastest, then the for-loop followed by the arrayfun solution. Gonna generate longer v-arrays now and see if sth changes.

Edit 2 Changing v to

v = ceil(rand(100,1)*8);

resulted in more obvious runtime ranking:

My Solution: Elapsed time is 4.020916 seconds.
Luis Mendo's Solution (bsxfun):Elapsed time is 0.808152 seconds.
Luis Mendo's Solution (arrayfun): Elapsed time is 22.126661 seconds.

Dupin answered 20/8, 2014 at 9:38 Comment(0)

Explanation

Explanation

Recommended topics

Hot tags