Pure R approaches
You have several excellent base R answers. I noticed you tagged stringr
. I don't think there's any advantage to using stringr
here. However, there may be in using stringi
THE R package for fast, portable, correct, consistent, and convenient string/text processing in any locale or character encoding.
stringi
tends to be extremely fast. stringr
depends on stringi
(and in fact many stringr
functions are thin wrappers for stringi
functions), so if you have stringr
installed then you also have stringi
.
Unlike stringr
, stringi
has a function to check for empty strings (equivalent to !base::nzchar()
) which is likely faster than string comparison, and almost certainly faster than counting the characters of all strings (including non-empty ones).
library(stringi)
sum(stri_isempty(stri_trim_both(vector)))
# [1] 3
Rcpp approaches
As S. Baldur's answer now demonstrates, you can use Rcpp
for this as well.
This is so much faster that I'm going to include it in a separate benchmarks section below, so that it's easier to see the differences in the pure R approaches.
Pure R Benchmarks
Just for fun, I ran some benchmarks with vectors up to length 1m. The second approach by S. Baldur is fastest for vectors length 10 and 100. With vectors length 1000 and upwards, the stringi
approach is the fastest.
If RAM is a factor, the answer by Ben Bolker consistently uses the least memory. Here is the data in tabular form (note the timings are relative and the fastest/lowest memory approach is always 1
).
expression vec_length min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <bch:tm>
1 bolker 10 1.95 1.65 4.91 NaN Inf 9999 1 161ms
2 rudolph 10 5.64 7.11 1 NaN Inf 6099 2 483ms
3 baldur 10 1.09 1 7.13 NaN Inf 9999 1 111ms
4 baldur2 10 1.05 1 6.91 NaN NaN 10000 0 115ms
5 thomas 10 1.55 1.92 4.11 NaN Inf 9999 1 193ms
6 rui 10 5.82 7.38 1.18 NaN Inf 7046 2 472ms
7 rui2 10 5.55 6.95 1.23 NaN Inf 7373 3 475ms
8 samr 10 1 1.27 6.72 NaN NaN 10000 0 118ms
9 bolker 100 1.56 1.57 4.30 1 Inf 9999 1 320ms
10 rudolph 100 6.83 6.53 1.10 5.79 Inf 3892 1 487ms
11 baldur 100 1.03 1.04 6.64 2.89 Inf 9999 1 207ms
12 baldur2 100 1 1 6.87 3.89 Inf 9999 1 200ms
13 thomas 100 1.64 1.58 4.19 2 NaN 10000 0 328ms
14 rui 100 7 7.07 1 5.79 Inf 3546 1 488ms
15 rui2 100 6.69 6.5 1.12 5.79 Inf 3904 2 482ms
16 samr 100 1.19 1.05 6.38 2.89 NaN 10000 0 216ms
17 bolker 1000 1.31 1.64 4.09 1 Inf 3025 1 487ms
18 rudolph 1000 5.43 5.97 1.10 5.98 NaN 830 0 499ms
19 baldur 1000 1.06 1.23 5.13 2.99 Inf 3109 1 399ms
20 baldur2 1000 1 1.18 5.57 3.99 NaN 4186 0 495ms
21 thomas 1000 1.17 1.41 4.89 2 NaN 3685 0 496ms
22 rui 1000 7.32 6.58 1 5.98 NaN 758 0 499ms
23 rui2 1000 7.17 5.85 1.12 5.98 NaN 853 0 500ms
24 samr 1000 1.05 1 6.01 2.99 Inf 4439 1 486ms
25 bolker 10000 1.77 1.61 4.39 1 NaN 355 0 501ms
26 rudolph 10000 8.56 6.18 1.13 6.00 NaN 92 0 502ms
27 baldur 10000 1.06 1.26 5.56 3.00 Inf 443 1 492ms
28 baldur2 10000 1 1.21 5.69 4.00 NaN 460 0 500ms
29 thomas 10000 1.57 1.44 4.81 2 Inf 388 1 499ms
30 rui 10000 8.28 6.89 1 6.00 NaN 81 0 501ms
31 rui2 10000 8.69 6.19 1.13 6.00 NaN 92 0 505ms
32 samr 10000 1.23 1 6.69 3.00 Inf 533 1 493ms
33 bolker 100000 1.92 1.58 4.21 1 NaN 36 0 510ms
34 rudolph 100000 7.83 6.31 1.07 6.00 NaN 9 0 504ms
35 baldur 100000 1.37 1.31 5.08 3.00 Inf 42 1 493ms
36 baldur2 100000 1.45 1.37 4.96 4.00 Inf 41 1 493ms
37 thomas 100000 1.52 1.37 4.89 2 NaN 41 0 500ms
38 rui 100000 7.46 6.60 1 6.00 NaN 9 0 537ms
39 rui2 100000 6.93 6.05 1.11 6.00 Inf 9 1 483ms
40 samr 100000 1 1 6.56 3.00 NaN 55 0 500ms
41 bolker 1000000 1.81 1.79 4.39 1 NaN 4 0 551ms
42 rudolph 1000000 7.76 7.13 1.09 6.00 NaN 1 0 553ms
43 baldur 1000000 1.48 1.44 5.42 3.00 Inf 4 1 447ms
44 baldur2 1000000 1.52 1.44 5.44 4.00 Inf 4 1 445ms
45 thomas 1000000 1.64 1.53 5.04 2 Inf 4 1 480ms
46 rui 1000000 8.50 7.80 1 6.00 NaN 1 0 605ms
47 rui2 1000000 7.59 6.97 1.12 6.00 NaN 1 0 541ms
48 samr 1000000 1 1 7.88 3.00 Inf 6 1 460ms
Benchmark code:
results <- bench::press(
vec_length = 10^(1:6),
{
vals <- c("Red", "", " ", " ", " ", " ", "5", letters[1:3])
v <- sample(vals, vec_length, replace = TRUE)
bench::mark(
relative = TRUE,
bolker = sum(grepl("^ *$", v)),
rudolph = sum(!nzchar(trimws(v))),
baldur = sum(gsub(" ", "", v, fixed = TRUE) == ""),
baldur2 = sum(!nzchar(gsub(" ", "", v, fixed = TRUE))),
thomas = sum(!grepl("\\S", v)),
rui = sum(nchar(trimws(v)) == 0),
rui2 = sum(!nzchar(trimws(v))),
samr = sum(stri_isempty(stri_trim_both(v)))
)
}
)
Rcpp
benchmarks
I include separately a benchmark of S. Baldur's count_empty_cpp()
function. This is much faster than my stringi
approach, so I added another Rcpp
function using the C++ standard library, based heavily on the answer to the C++ question, Efficient way to check if std::string has only spaces.
Rcpp::cppFunction("int count_empty_cpp2(CharacterVector x) {
int count = 0, j, n;
std::string str;
for (int i = 0; i < x.size(); i++) {
str = Rcpp::as<std::string>(x[i]);
if(str.find_first_not_of(' ') == std::string::npos)
{
count++;
}
}
return count;
}")
I also added a third Rcpp
function which looks at the underlying S-expression of each element of the character vector. This means we can avoid type casting in cases where the string is empty. Also where we need to look at the contents of the string, I use CHAR()
to cast the SEXP
to a C-style pointer to a null-terminated string (const char*
), rather than a C++ std::string
. This means we copy the reference (8 bytes per string probably), rather than the data.
Rcpp::cppFunction("int count_empty_cpp3(CharacterVector x) {
int count = 0;
for (int i = 0; i < x.size(); i++) {
SEXP elem = x[i];
R_xlen_t len = Rf_length(elem);
if (len == 0) {
count++;
} else {
const char* str = CHAR(elem);
bool is_empty = true;
for (R_xlen_t j = 0; j < len; j++) {
if (str[j] != ' ') {
is_empty = false;
break;
}
}
if (is_empty) count++;
}
}
return count;
}")
I benchmarked these against the two fastest R answers. All Rcpp
approaches are much faster than than the fastest R approaches once vector lengths are >1e4
.
Here is a table of results. There's very little differences between the first two Rcpp
approaches. The approach avoiding std::string
is slightly faster than the other two:
expression vec_length min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <bch:tm>
1 baldur2_baser 10 1.56 1.55 1.61 NaN Inf 9999 1 84.46ms
2 samr_stringi 10 2.44 2.18 1 NaN NaN 10000 0 135.71ms
3 baldercpp 10 1.11 1.05 2.22 Inf NaN 10000 0 61.11ms
4 samrcpp 10 1.11 1.09 1.95 Inf Inf 9999 1 69.62ms
5 samrcpp2 10 1 1 2.10 Inf Inf 9999 1 64.57ms
6 baldur2_baser 100 2.64 2.57 1 1.35 NaN 10000 0 200.12ms
7 samr_stringi 100 3 2.57 1.01 1 Inf 9999 1 198.57ms
8 baldercpp 100 1.14 1.03 2.33 1.97 NaN 10000 0 85.74ms
9 samrcpp 100 1.14 1.1 1.90 1.97 Inf 9999 1 105.58ms
10 samrcpp2 100 1 1 2.23 1.97 Inf 9999 1 89.85ms
11 baldur2_baser 1000 3.67 4.51 1 6.33 Inf 4408 2 481.84ms
12 samr_stringi 1000 3.64 3.93 1.03 4.74 Inf 4591 1 488.59ms
13 baldercpp 1000 1.45 1.41 2.77 1 Inf 9999 1 395.14ms
14 samrcpp 1000 1.55 1.49 2.58 1 Inf 9999 1 423.62ms
15 samrcpp2 1000 1 1 3.92 1 NaN 10000 0 278.97ms
16 baldur2_baser 10000 4.06 5.29 1 62.8 Inf 465 3 480.15ms
17 samr_stringi 10000 3.73 4.26 1.16 47.1 Inf 555 1 494.04ms
18 baldercpp 10000 1.52 1.56 2.99 1 NaN 1444 0 498.64ms
19 samrcpp 10000 1.58 1.64 3.05 1 Inf 1457 1 492.9ms
20 samrcpp2 10000 1 1 4.79 1 NaN 2305 0 496.85ms
21 baldur2_baser 100000 5.35 5.16 1 627. Inf 48 2 529.89ms
22 samr_stringi 100000 4.61 4.08 1.23 470. Inf 54 2 484.1ms
23 baldercpp 100000 1.51 1.53 3.34 1 NaN 152 0 501.8ms
24 samrcpp 100000 1.57 1.57 3.31 1 NaN 150 0 500.94ms
25 samrcpp2 100000 1 1 4.89 1 NaN 222 0 501.5ms
26 baldur2_baser 1000000 4.46 5.04 1 6270. Inf 27 23 2.89s
27 samr_stringi 1000000 3.94 3.89 1.27 4702. Inf 37 13 3.11s
28 baldercpp 1000000 1.30 1.41 3.50 1 NaN 50 0 1.53s
29 samrcpp 1000000 1.45 1.53 3.15 1 NaN 50 0 1.69s
30 samrcpp2 1000000 1 1 4.65 1 NaN 50 0 1.15s
These are relatively short strings. I'm not going to run more benchmarks but I suspect if the strings were longer we'd see more of a relative benefit to copying the pointer rather than the data.
A note on the benchmarks
These benchmarks are mostly for fun. The differences between all answers are relatively small, so unless you're repeating this many times with huge vectors, have extremely long strings or very limited memory resources, rather than rolling my own Rcpp solution that is nanoseconds faster, I would optimise for readable code.