Function/instruction to count number of times a value has already been seen
Asked Answered
D

4

7

I'm trying to identify if MATLAB or R has a function that resembles the following.

Say I have an input vector v.

v = [1, 3, 1, 2, 4, 2, 1, 3]

I want to generate a vector, w of equivalent length to v. Each element w[i] should tell me the following: for the corresponding value v[i], how many times has this value been encountered so far in v, i.e. in all elements of v up to, but not including, position i. In this example

w = [0, 0, 1, 0, 0, 1, 2, 1]

I'm really looking to see if any statistical or domain-specific languages have a function/instruction like this and what it might be called.

Dongola answered 20/8, 2014 at 9:19 Comment(4)
Roland, if you don't mind I will keep the original tags. My objective isn't to use the function itself but rather identify some languages where it can be found. R and Matlab seemed like the best starting place.Dongola
Efficient solution ( O(n) ) should include accumulator array.Soluk
Please study the tag wikis. the instructions tag doesn't fit and the objective stated in your comment would make your question off-topic (you are asking for a tool recommendation).Gerena
In R you could also use dplyr like this: library(dplyr); data.frame(v) %>% group_by(v) %>% mutate(count = row_number()-1) (the result would be a data.frame but you could easily extract the count column if you need it separate).Acceptable
P
7

In R, you can try this:

 v <- c(1,3,1,2,4,2,1,3)
 ave(v, v, FUN=seq_along)-1
 #[1] 0 0 1 0 0 1 2 1

Explanation

 ave(seq_along(v), v, FUN=seq_along)  #It may be better to use `seq_along(v)` considering different classes i.e. `factor` also.
 #[1] 1 1 2 1 1 2 3 2

Here, we are grouping the sequence of elements by v. For elements that match the same group, the seq_along function will create 1,2,3 etc. In the case of v, the elements of same group 1 are in positions 1,3,7, so those corresponding positions will be 1,2,3. By subtracting with 1, we will be able to start from 0.

To understand it better,

 lst1 <- split(v,v)
 lst2 <- lapply(lst1, seq_along)
 unsplit(lst2, v)
 #[1] 1 1 2 1 1 2 3 2

Using data.table

  library(data.table)
  DT <- data.table(v, ind=seq_along(v))
  DT[, n:=(1:.N)-1, by=v][,n[ind]]
  #[1] 0 0 1 0 0 1 2 1
Pedroza answered 20/8, 2014 at 9:23 Comment(7)
And of course this could be done faster with data.table or dplyr.Gerena
@Gerena Thanks, I updated with a data.table solution. Do you know if there is any better way in data.table to keep the initial order?Pedroza
@Roland, I doubt that converting a numeric vector to a data.table object and then operating on it will make it faster. My guess it will be slower. But could be wrong thoughMizzen
@DavidArenburg With v <- sample(1:1e4, 1e6, TRUE) the data.table solution data.table(v)[, w:=seq_len(.N) - 1, by=v][["w"]] is faster by about a factor of 5 in my benchmark.Gerena
@Roland, didn't benchmark, but I believe you. You are comparing with ave right?Mizzen
Yep, ave is a valuable tool with an unfortunate name. It should be called "do_anything" :-)Cicatrix
@Pedroza If you use := you don't need to worry about the order.Gerena
K
7

In Matlab there is not a function for that (as far as I know), but you can achieve it this way:

w = sum(triu(bsxfun(@eq, v, v.'), 1));

Explanation: bsxfun(...) compares each element with each other. Then triu(..., 1) keeps only matches of an element with previous elements (i.e. values above the diagonal). Finally sum(...) adds all coincidences with previous elements.


A more explicit, but slower alternative (not recommended) is:

w = arrayfun(@(n) sum(v(1:n-1)==v(n)), 1:numel(v));

Explanation: for each index n (where n varies as 1:numel(v)), compare all previous elements v(1:n-1) to the current element v(n), and get the number of matches (sum(...)).

Kissiah answered 20/8, 2014 at 9:21 Comment(2)
Thank you for your solution. It's a pity that there isn't something slightly more primitive in Matlab.Dongola
I agree. But you could encapsulate the bsxfun line into a function, so it will be "primitive" to you. As you see in The Minion's answer, it's pretty fastKissiah
P
7

In R, you can try this:

 v <- c(1,3,1,2,4,2,1,3)
 ave(v, v, FUN=seq_along)-1
 #[1] 0 0 1 0 0 1 2 1

Explanation

 ave(seq_along(v), v, FUN=seq_along)  #It may be better to use `seq_along(v)` considering different classes i.e. `factor` also.
 #[1] 1 1 2 1 1 2 3 2

Here, we are grouping the sequence of elements by v. For elements that match the same group, the seq_along function will create 1,2,3 etc. In the case of v, the elements of same group 1 are in positions 1,3,7, so those corresponding positions will be 1,2,3. By subtracting with 1, we will be able to start from 0.

To understand it better,

 lst1 <- split(v,v)
 lst2 <- lapply(lst1, seq_along)
 unsplit(lst2, v)
 #[1] 1 1 2 1 1 2 3 2

Using data.table

  library(data.table)
  DT <- data.table(v, ind=seq_along(v))
  DT[, n:=(1:.N)-1, by=v][,n[ind]]
  #[1] 0 0 1 0 0 1 2 1
Pedroza answered 20/8, 2014 at 9:23 Comment(7)
And of course this could be done faster with data.table or dplyr.Gerena
@Gerena Thanks, I updated with a data.table solution. Do you know if there is any better way in data.table to keep the initial order?Pedroza
@Roland, I doubt that converting a numeric vector to a data.table object and then operating on it will make it faster. My guess it will be slower. But could be wrong thoughMizzen
@DavidArenburg With v <- sample(1:1e4, 1e6, TRUE) the data.table solution data.table(v)[, w:=seq_len(.N) - 1, by=v][["w"]] is faster by about a factor of 5 in my benchmark.Gerena
@Roland, didn't benchmark, but I believe you. You are comparing with ave right?Mizzen
Yep, ave is a valuable tool with an unfortunate name. It should be called "do_anything" :-)Cicatrix
@Pedroza If you use := you don't need to worry about the order.Gerena
G
4

R has a function called make.unique that can be used to obtain the required result. First use it to make all elements unique:

(v.u <- make.unique(as.character(v))) # it only works on character vectors so you must convert first
[1] "1"   "3"   "1.1" "2"   "4"   "2.1" "1.2" "3.1"

You can then take this vector, remove the original data, convert the blanks to 0, and convert back to integer to get the counts:

as.integer(sub("^$","0",sub("[0-9]+\\.?","",v.u)))
[1] 0 0 1 0 0 1 2 1
Goatish answered 20/8, 2014 at 10:2 Comment(2)
That's quite interesting. It's a pity that you have to convert to and from strings though. I think there could be scenarios where this kind of thing would be useful in integer form.Dongola
Interesting approach +1Dimaggio
D
3

If you want to use a for-loop in matlab you can get the result with:

res=v;
res(:)=0;
for c=1:length(v)
    helper=find(v==v(c));
    res(c)=find(helper==c);
end

not sure about runtime compared to Luis Mendo's solution. Gonna check that now.

Edit

Running the code 10.000 times results in:

My Solution: Elapsed time is 0.303828 seconds 
Luis Mendo's Solution (bsxfun): Elapsed time is 0.180215 seconds.
Luis Mendo's Solution (arrayfun): Elapsed time is 3.868467 seconds.

So the bsxfun solution is fastest, then the for-loop followed by the arrayfun solution. Gonna generate longer v-arrays now and see if sth changes.

Edit 2 Changing v to

v = ceil(rand(100,1)*8);

resulted in more obvious runtime ranking:

My Solution: Elapsed time is 4.020916 seconds.
Luis Mendo's Solution (bsxfun):Elapsed time is 0.808152 seconds.
Luis Mendo's Solution (arrayfun): Elapsed time is 22.126661 seconds.
Dupin answered 20/8, 2014 at 9:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.