Count numbers that are lower by constant from current number
Asked Answered
C

4

6

Imagine that I have a list of numbers (i.e. numbers column in data.table/data.frame).

1
5
5
10
11
12

for each number in a list a want to count how many unique numbers are there which are lower than that particular number + 5.

The explanation for upper case, first number = 1, search range is 1+5 = 6, so three numbers are in range, less than or equal to: c(1,5,5), and then count unique is 2. This is all by assuming we've got the additional condition, that the number must not only be lower than current_number + 5, but also its index in the list must be >= that of current_number.

The result in this case would be:

2
2
2
3
2
1

Note: Is there a fast solution for huge dataset, in data.frame or data.table? My dataset is rather huge, 10+M rows.

Castroprauxel answered 14/2, 2020 at 14:27 Comment(8)
I don't get the logic. What does search range mean here? For the first element (1) the comparison value is 6. 3 unique values in the vector are higher than 6, but your result is 2. Where do I misunderstand you?Candelaria
Clarify how you are getting 3 on 4th row?Engelhardt
Shouldn't it supposed to be 2, 2, 2, 5, 5, 5?...Also I think you mean less than..? because 1, 5, 5 are less than 6..? I don't know. I m confusedInkster
@Candelaria I meant lower, sorry!Castroprauxel
After the "lower" edit, let's look at the last number, 12. 12 + 5 = 17. All the numbers are lower than 17, so the set is {1, 5, 5, 10, 11, 12}, and the number of unique numbers in that set is 5. Why is your desired result 2? Perhaps, the search range is [x, x+5] rather than (-Inf, x+5]? Also, please clarify by "lower than" if you mean < or <=.Wholesome
@Gregor-reinstateMonica, I didn't explained it well. :( when you calculate this for one number, it is dropped from the database. It's like you have an apply over data.table.Castroprauxel
Oh, so it's counting how many numbers after x in the input are < (or maybe <=) x + 5?Wholesome
@Gregor-reinstateMonica, Yes <=(x+5). Sorry for the confusion.Castroprauxel
H
7

The fastest way I can think of in base R (works if x is sorted):

findInterval(x + 5, unique(x)) - cumsum(!duplicated(x)) + 1L
#[1] 2 2 2 3 2 1

edit: no problem with the sorting because with data.table, sorting integers is trivial:

nr <- 1e7
nn <- nr/2
set.seed(0L)
DT <- data.table(X=sample(nn, nr, TRUE))
#DT <- data.table(X=c(1,5,5,10,11,12))

system.time(
    DT[order(X), 
        COUNT := findInterval(X + 5L, unique(X)) - cumsum(!duplicated(X)) + 1L
    ]
)
#   user  system elapsed 
#   1.73    0.17    1.53 

2s for 10million rows.

Hairworm answered 14/2, 2020 at 16:17 Comment(0)
E
6

Try this:

x <- c(1,5,5,10,11,12)

sapply(seq_along(x), function(i)
  sum(unique(x[i:length(x)]) <= (x[i] + 5)))
# [1] 2 2 2 3 2 1
Engelhardt answered 14/2, 2020 at 14:59 Comment(1)
With 10M rows, not sure this is the right solution. Any data.table experts?Engelhardt
F
2

One option is to use a sql self-join

library(sqldf)


df$r <- seq(nrow(df))

sqldf('
select    a.V1
          , count(distinct b.V1) as n
from      df a
          left join df b
            on  b.V1 <= a.V1 + 5
                and b.r >= a.r
group by  a.r
')

#   V1 n
# 1  1 2
# 2  5 2
# 3  5 2
# 4 10 3
# 5 11 2
# 6 12 1

Data used:

df <- structure(list(V1 = c(1L, 5L, 5L, 10L, 11L, 12L)), row.names = c(NA, 
-6L), class = "data.frame")
Fulminant answered 14/2, 2020 at 14:45 Comment(0)
C
0
sapply(yourVector + 5, function(x, y) sum(x > y), y = unique(x))
Candelaria answered 14/2, 2020 at 14:47 Comment(1)
As far as I get this is the logic you are describing. However, your expected result shows something else. Please, clarify.Candelaria

© 2022 - 2024 — McMap. All rights reserved.