I have a vector of numbers:
numbers <- c(4,23,4,23,5,43,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435)
How can I have R count the number of times a value x appears in the vector?
I have a vector of numbers:
numbers <- c(4,23,4,23,5,43,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435)
How can I have R count the number of times a value x appears in the vector?
You can just use table()
:
> a <- table(numbers)
> a
numbers
4 5 23 34 43 54 56 65 67 324 435 453 456 567 657
2 1 2 2 1 1 2 1 2 1 3 1 1 1 1
Then you can subset it:
> a[names(a)==435]
435
3
Or convert it into a data.frame if you're more comfortable working with that:
> as.data.frame(table(numbers))
numbers Freq
1 4 2
2 5 1
3 23 2
4 34 2
...
The most direct way is sum(numbers == x)
.
numbers == x
creates a logical vector which is TRUE at every location that x occurs, and when sum
ing, the logical vector is coerced to numeric which converts TRUE to 1 and FALSE to 0.
However, note that for floating point numbers it's better to use something like: sum(abs(numbers - x) < 1e-6)
.
I would probably do something like this
length(which(numbers==x))
But really, a better way is
table(numbers)
table(numbers)
is going to do a lot more work than the easiest solution, sum(numbers==x)
, because it's going to figure out the counts of all the other numbers in the list too. –
Manipulate There is also count(numbers)
from plyr
package. Much more convenient than table
in my opinion.
My preferred solution uses rle
, which will return a value (the label, x
in your example) and a length, which represents how many times that value appeared in sequence.
By combining rle
with sort
, you have an extremely fast way to count the number of times any value appeared. This can be helpful with more complex problems.
Example:
> numbers <- c(4,23,4,23,5,43,54,56,657,67,67,435,453,435,324,34,456,56,567,65,34,435)
> a <- rle(sort(numbers))
> a
Run Length Encoding
lengths: int [1:15] 2 1 2 2 1 1 2 1 2 1 ...
values : num [1:15] 4 5 23 34 43 54 56 65 67 324 ...
If the value you want doesn't show up, or you need to store that value for later, make a
a data.frame
.
> b <- data.frame(number=a$values, n=a$lengths)
> b
values n
1 4 2
2 5 1
3 23 2
4 34 2
5 43 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
I find it is rare that I want to know the frequency of one value and not all of the values, and rle seems to be the quickest way to get count and store them all.
There is a standard function in R for that
tabulate(numbers)
numbers <- c(4,23,4,23,5,43,54,56,657,67,67,435 453,435,324,34,456,56,567,65,34,435)
> length(grep(435, numbers))
[1] 3
> length(which(435 == numbers))
[1] 3
> require(plyr)
> df = count(numbers)
> df[df$x == 435, ]
x freq
11 435 3
> sum(435 == numbers)
[1] 3
> sum(grepl(435, numbers))
[1] 3
> sum(435 == numbers)
[1] 3
> tabulate(numbers)[435]
[1] 3
> table(numbers)['435']
435
3
> length(subset(numbers, numbers=='435'))
[1] 3
If you want to count the number of appearances subsequently, you can make use of the sapply
function:
index<-sapply(1:length(numbers),function(x)sum(numbers[1:x]==numbers[x]))
cbind(numbers, index)
Output:
numbers index
[1,] 4 1
[2,] 23 1
[3,] 4 2
[4,] 23 2
[5,] 5 1
[6,] 43 1
[7,] 54 1
[8,] 56 1
[9,] 657 1
[10,] 67 1
[11,] 67 2
[12,] 435 1
[13,] 453 1
[14,] 435 2
[15,] 324 1
[16,] 34 1
[17,] 456 1
[18,] 56 2
[19,] 567 1
[20,] 65 1
[21,] 34 2
[22,] 435 3
here's one fast and dirty way:
x <- 23
length(subset(numbers, numbers==x))
You can change the number to whatever you wish in following line
length(which(numbers == 4))
sum(numbers == 4)
will also do. –
Commutate One option could be to use vec_count()
function from the vctrs
library:
vec_count(numbers)
key count
1 435 3
2 67 2
3 4 2
4 34 2
5 56 2
6 23 2
7 456 1
8 43 1
9 453 1
10 5 1
11 657 1
12 324 1
13 54 1
14 567 1
15 65 1
The default ordering puts the most frequent values at top. If looking for sorting according keys (a table()
-like output):
vec_count(numbers, sort = "key")
key count
1 4 2
2 5 1
3 23 2
4 34 2
5 43 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
One more way i find convenient is:
numbers <- c(4,23,4,23,5,43,54,56,657,67,67,435,453,435,324,34,456,56,567,65,34,435)
(s<-summary (as.factor(numbers)))
This converts the dataset to factor, and then summary() gives us the control totals (counts of the unique values).
Output is:
4 5 23 34 43 54 56 65 67 324 435 453 456 567 657
2 1 2 2 1 1 2 1 2 1 3 1 1 1 1
This can be stored as dataframe if preferred.
as.data.frame(cbind(Number = names(s),Freq = s), stringsAsFactors=F, row.names = 1:length(s))
here row.names has been used to rename row names. without using row.names, column names in s are used as row names in new dataframe
Output is:
Number Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 43 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
Using table but without comparing with names
:
numbers <- c(4,23,4,23,5,43,54,56,657,67,67,435)
x <- 67
numbertable <- table(numbers)
numbertable[as.character(x)]
#67
# 2
table
is useful when you are using the counts of different elements several times. If you need only one count, use sum(numbers == x)
Base r solution in 2021
aggregate(numbers, list(num=numbers), length)
num x
1 4 2
2 5 1
3 23 2
4 34 2
5 43 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
tapply(numbers, numbers, length)
4 5 23 34 43 54 56 65 67 324 435 453 456 567 657
2 1 2 2 1 1 2 1 2 1 3 1 1 1 1
by(numbers, list(num=numbers), length)
num: 4
[1] 2
--------------------------------------
num: 5
[1] 1
--------------------------------------
num: 23
[1] 2
--------------------------------------
num: 34
[1] 2
--------------------------------------
num: 43
[1] 1
--------------------------------------
num: 54
[1] 1
--------------------------------------
num: 56
[1] 2
--------------------------------------
num: 65
[1] 1
--------------------------------------
num: 67
[1] 2
--------------------------------------
num: 324
[1] 1
--------------------------------------
num: 435
[1] 3
--------------------------------------
num: 453
[1] 1
--------------------------------------
num: 456
[1] 1
--------------------------------------
num: 567
[1] 1
--------------------------------------
num: 657
[1] 1
There are different ways of counting a specific elements
library(plyr)
numbers =c(4,23,4,23,5,43,54,56,657,67,67,435,453,435,7,65,34,435)
print(length(which(numbers==435)))
#Sum counts number of TRUE's in a vector
print(sum(numbers==435))
print(sum(c(TRUE, FALSE, TRUE)))
#count is present in plyr library
#o/p of count is a DataFrame, freq is 1 of the columns of data frame
print(count(numbers[numbers==435]))
print(count(numbers[numbers==435])[['freq']])
This is a very fast solution for one-dimensional atomic vectors. It relies on match()
, so it is compatible with NA
:
x <- c("a", NA, "a", "c", "a", "b", NA, "c")
fn <- function(x) {
u <- unique.default(x)
out <- list(x = u, freq = .Internal(tabulate(match(x, u), length(u))))
class(out) <- "data.frame"
attr(out, "row.names") <- seq_along(u)
out
}
fn(x)
#> x freq
#> 1 a 3
#> 2 <NA> 2
#> 3 c 2
#> 4 b 1
You could also tweak the algorithm so that it doesn't run unique()
.
fn2 <- function(x) {
y <- match(x, x)
out <- list(x = x, freq = .Internal(tabulate(y, length(x)))[y])
class(out) <- "data.frame"
attr(out, "row.names") <- seq_along(x)
out
}
fn2(x)
#> x freq
#> 1 a 3
#> 2 <NA> 2
#> 3 a 3
#> 4 c 2
#> 5 a 3
#> 6 b 1
#> 7 <NA> 2
#> 8 c 2
In cases where that output is desirable, you probably don't even need it to re-return the original vector, and the second column is probably all you need. You can get that in one line with the pipe:
match(x, x) %>% `[`(tabulate(.), .)
#> [1] 3 2 3 2 3 1 2 2
This can be done with outer
to get a metrix of equalities followed by rowSums
, with an obvious meaning.
In order to have the counts and numbers
in the same dataset, a data.frame is first created. This step is not needed if you want separate input and output.
df <- data.frame(No = numbers)
df$count <- rowSums(outer(df$No, df$No, FUN = `==`))
A method that is relatively fast on long vectors and gives a convenient output is to use lengths(split(numbers, numbers))
(note the S at the end of lengths
):
# Make some integer vectors of different sizes
set.seed(123)
x <- sample.int(1e3, 1e4, replace = TRUE)
xl <- sample.int(1e3, 1e6, replace = TRUE)
xxl <-sample.int(1e3, 1e7, replace = TRUE)
# Number of times each value appears in x:
a <- lengths(split(x,x))
# Number of times the value 64 appears:
a["64"]
#~ 64
#~ 15
# Occurences of the first 10 values
a[1:10]
#~ 1 2 3 4 5 6 7 8 9 10
#~ 13 12 6 14 12 5 13 14 11 14
The output is simply a named vector.
The speed appears comparable to rle
proposed by JBecker and even a bit faster on very long vectors. Here is a microbenchmark in R 3.6.2 with some of the functions proposed:
library(microbenchmark)
f1 <- function(vec) lengths(split(vec,vec))
f2 <- function(vec) table(vec)
f3 <- function(vec) rle(sort(vec))
f4 <- function(vec) plyr::count(vec)
microbenchmark(split = f1(x),
table = f2(x),
rle = f3(x),
plyr = f4(x))
#~ Unit: microseconds
#~ expr min lq mean median uq max neval cld
#~ split 402.024 423.2445 492.3400 446.7695 484.3560 2970.107 100 b
#~ table 1234.888 1290.0150 1378.8902 1333.2445 1382.2005 3203.332 100 d
#~ rle 227.685 238.3845 264.2269 245.7935 279.5435 378.514 100 a
#~ plyr 758.866 793.0020 866.9325 843.2290 894.5620 2346.407 100 c
microbenchmark(split = f1(xl),
table = f2(xl),
rle = f3(xl),
plyr = f4(xl))
#~ Unit: milliseconds
#~ expr min lq mean median uq max neval cld
#~ split 21.96075 22.42355 26.39247 23.24847 24.60674 82.88853 100 ab
#~ table 100.30543 104.05397 111.62963 105.54308 110.28732 168.27695 100 c
#~ rle 19.07365 20.64686 23.71367 21.30467 23.22815 78.67523 100 a
#~ plyr 24.33968 25.21049 29.71205 26.50363 27.75960 92.02273 100 b
microbenchmark(split = f1(xxl),
table = f2(xxl),
rle = f3(xxl),
plyr = f4(xxl))
#~ Unit: milliseconds
#~ expr min lq mean median uq max neval cld
#~ split 296.4496 310.9702 342.6766 332.5098 374.6485 421.1348 100 a
#~ table 1151.4551 1239.9688 1283.8998 1288.0994 1323.1833 1385.3040 100 d
#~ rle 399.9442 430.8396 464.2605 471.4376 483.2439 555.9278 100 c
#~ plyr 350.0607 373.1603 414.3596 425.1436 437.8395 506.0169 100 b
Importantly, the only function that also counts the number of missing values NA
is plyr::count
. These can also be obtained separately using sum(is.na(vec))
Here is a way you could do it with dplyr:
library(tidyverse)
numbers <- c(4,23,4,23,5,43,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435)
ord <- seq(1:(length(numbers)))
df <- data.frame(ord,numbers)
df <- df %>%
count(numbers)
numbers n
<dbl> <int>
1 4 2
2 5 1
3 23 2
4 34 2
5 43 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
You can make a function to give you results.
# your list
numbers <- c(4,23,4,23,5,43,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435)
function1<-function(x){
if(x==value){return(1)}else{ return(0) }
}
# set your value here
value<-4
# make a vector which return 1 if it equal to your value, 0 else
vector<-sapply(numbers,function(x) function1(x))
sum(vector)
result: 2
Recent (fast) answers to an old question. Two collapse
options and one using tabulate
(a variation of this answer).
collapse::qtable
and ftabulate
(custom function) returns a named vector (à la table
), while collapse::fcount
returns a data.frame (à la dplyr::count
or vctrs::vec_count
). The three functions are optimized to be much faster, and both collapse
functions work with groups and weights.
collapse::qtab(numbers) #or collapse::qtable(numbers)
# numbers
# 4 5 23 34 43 54 56 65 67 324 435 453 456 567 657
# 2 1 2 2 1 1 2 1 2 1 3 1 1 1 1
ftabulate <- function(x){
u <- unique.default(x)
setNames(tabulate(match(x, u), length(u)), u)
}
ftabulate(numbers)
# 4 23 5 43 54 56 657 67 435 453 324 34 456 567 65
# 2 2 1 1 1 2 1 2 3 1 1 2 1 1 1
collapse::fcount(numbers)
# x N
# 1 4 2
# 2 23 2
# ...
Here's a panorama of existing solutions, grouped by whether they return a named vector or a data.frame, and a speed comparison over a vector of size 1,000,000 with 100 different values.
collapse
options are the fastest, whether one needs to return a named vector (qtab
) or a data.frame (fcount
). tabulate
solutions are also the fastest among base R
options.
Code:
library(microbenchmark)
library(ggplot2)
set.seed(1)
numbers <- sample(sample(1000, 100), size = 1e6, replace = TRUE)
fn <- function(x) {
u <- unique.default(x)
out <- list(x = u, freq = .Internal(tabulate(match(x, u), length(u))))
class(out) <- "data.frame"
attr(out, "row.names") <- seq_along(u)
out
}
mb <-
microbenchmark(
#Named vectors
"collapse::qtab" = qtab(numbers),
table = table(numbers),
tapply = tapply(numbers, numbers, length),
lengths = lengths(split(numbers,numbers)),
ftabulate = ftabulate(numbers),
#Data.frame
"collapse::fcount" = fcount(numbers),
"vctrs::vec_count" = vctrs::vec_count(numbers),
tabulate_df = fn(numbers),
aggregate = aggregate(numbers, list(num=numbers), length),
rle = with(rle(sort(numbers)), data.frame(number = values, n = lengths)),
times = 50L
)
mb$ntime <- microbenchmark:::convert_to_unit(mb$time, "t")
type <- setNames(levels(mb$expr), c(rep("Named vector", 5), rep("Data frame", 5)))
mb$type <- names(type)[match(mb$expr, levels(mb$expr))]
ggplot(mb, aes(x = expr, y = ntime)) +
geom_violin() +
scale_x_discrete(name = "", limits = rev) +
scale_y_log10(name = sprintf("Time [%s]", attr(mb$ntime, "unit"))) +
coord_flip() +
ggforce::facet_col(facets = vars(type), scales = "free_y", space = "free")
Unit: milliseconds
expr min lq mean median uq max neval
collapse::qtab 7.992801 12.122301 18.08057 15.03340 17.1635 71.8245 50
table 259.816201 370.759301 453.11712 447.25705 533.3613 613.8360 50
tapply 51.133300 69.950101 95.82087 91.15720 101.9496 248.4416 50
lengths 32.467401 51.007100 64.84489 60.97170 72.2331 179.6724 50
ftabulate 24.455901 37.864101 46.76249 43.06755 56.3162 92.3557 50
collapse::fcount 5.770500 7.499401 10.52078 10.53265 11.1268 32.4896 50
vctrs::vec_count 15.830001 27.466501 32.60882 29.31685 37.8101 87.4450 50
tabulate_df 24.235500 36.730401 45.35056 43.59385 51.3074 143.7773 50
aggregate 404.713901 510.090201 592.82286 606.85290 644.4701 922.6984 50
rle 52.259502 71.437701 92.71214 87.64515 99.4625 245.9526 50
© 2022 - 2024 — McMap. All rights reserved.