Count empty strings?
Asked Answered
P

7

6

In R, suppose I have a vector like:

vector<-c("Red", "   ", "", "5", "")

I want to count how many elements of this vector are just empty strings that only consist of either spaces or no spaces at all. For this very short vector, it is just three. The second, third, and fifth elements are just spaces or no spaces at all. They don't have any characters like letters, numbers, symbols, etc.

Is there any function or method that will count this? I wanted something I could use on larger vectors rather than just looking at every element of the vector.

Phocomelia answered 30/6 at 19:21 Comment(1)
"I want to count how many elements of this vector are just empty strings that only consist of either spaces or no spaces at all. For this very short vector, it is just three." Actually it is 5. "Red" and "5" have no spaces at all, so they qualify. Maybe you mean "consist of either only spaces or of zero-length empty strings"?Metacarpus
P
6

Use sum(grepl()) plus an appropriate regular expression:

vector<-c("Red", "   ", "", "5", "")
sum(grepl("^ *$", vector))
  • ^: beginning of string
  • *: zero or more spaces
  • $: end of string

If you want to look for "white space" more generally (e.g. allowing tabs), use "^[[:space:]]*$", although as pointed out in ?grep, the definition of white space is locale-dependent ...

length(grep(...)) would also work, or stringr::str_count(vector, "^ *$").

For what it's worth:

 microbenchmark::microbenchmark(
     bolker =  sum(grepl("^ *$", vector)),
     rudolph = sum(! nzchar(trimws(vector))),
     baldur = sum(gsub(" ", "", vector, fixed = TRUE) == ""),
    baldur2 = sum(! nzchar(gsub(" ", "", vector, fixed = TRUE))))

Unit: microseconds
    expr    min      lq     mean  median      uq    max neval cld
  bolker 10.499 10.8900 12.31869 11.8020 12.7990 40.976   100 a  
 rudolph 19.306 20.0125 22.01722 20.7990 22.9480 66.815   100  b 
  baldur  2.294  2.5700  2.76420  2.7455  2.8950  3.567   100   c
 baldur2  2.294  2.4740  2.66267  2.6450  2.7755  5.130   100   c

(@RuiBarradas not included because vs similar to @KonradRudolph). I'm surprised that @s_baldur's answer is so fast ... but also probably worth keeping in mind that this operation will be fast enough to not worry about efficiency unless it is a large part of your overall workflow ...

Pantomimist answered 30/6 at 19:24 Comment(5)
fixed = TRUE dramatically boosts the speedOpinion
I agree, but I'm just surprised that gsub() is still faster than trimws(), or that modifying the string is faster than scanning it for regexps ...Pantomimist
when you read the source code of trimws, you will see that its key code consists of sub, BUT, paste0 is applied to generate the regex pattern, which slows down the performanceOpinion
sauce <- c(" Red", " ", " a", " 5", "", " fdsfd", " ff") ; vector <- sapply(1:100000, function (x) sauce[x %% length(sauce) + 1]) ## and bolker is the fastest... Highly dataset dependentLafayette
FYI I added some more benchmarks with vectors of varying length and all answers. The approach that is fastest with a short vector is not the fastest with longer ones (though still relatively fast). Your answer uses the least memory.Lucknow
O
9

For code-golfing, probably you may be interested

> sum(!grepl("\\S", vector))
[1] 3
Opinion answered 30/6 at 21:0 Comment(1)
Interestingly, I am seeing that in longer vectors perl = TRUE is faster.Musk
L
8

Pure R approaches

You have several excellent base R answers. I noticed you tagged stringr. I don't think there's any advantage to using stringr here. However, there may be in using stringi THE R package for fast, portable, correct, consistent, and convenient string/text processing in any locale or character encoding.

stringi tends to be extremely fast. stringr depends on stringi (and in fact many stringr functions are thin wrappers for stringi functions), so if you have stringr installed then you also have stringi.

Unlike stringr, stringi has a function to check for empty strings (equivalent to !base::nzchar()) which is likely faster than string comparison, and almost certainly faster than counting the characters of all strings (including non-empty ones).

library(stringi)
sum(stri_isempty(stri_trim_both(vector)))
# [1] 3

Rcpp approaches

As S. Baldur's answer now demonstrates, you can use Rcpp for this as well.

This is so much faster that I'm going to include it in a separate benchmarks section below, so that it's easier to see the differences in the pure R approaches.

Pure R Benchmarks

Just for fun, I ran some benchmarks with vectors up to length 1m. The second approach by S. Baldur is fastest for vectors length 10 and 100. With vectors length 1000 and upwards, the stringi approach is the fastest.

enter image description here

If RAM is a factor, the answer by Ben Bolker consistently uses the least memory. Here is the data in tabular form (note the timings are relative and the fastest/lowest memory approach is always 1).

   expression vec_length   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
   <bch:expr>      <dbl> <dbl>  <dbl>     <dbl>     <dbl>    <dbl> <int> <dbl>   <bch:tm>
 1 bolker             10  1.95   1.65      4.91    NaN         Inf  9999     1      161ms
 2 rudolph            10  5.64   7.11      1       NaN         Inf  6099     2      483ms
 3 baldur             10  1.09   1         7.13    NaN         Inf  9999     1      111ms
 4 baldur2            10  1.05   1         6.91    NaN         NaN 10000     0      115ms
 5 thomas             10  1.55   1.92      4.11    NaN         Inf  9999     1      193ms
 6 rui                10  5.82   7.38      1.18    NaN         Inf  7046     2      472ms
 7 rui2               10  5.55   6.95      1.23    NaN         Inf  7373     3      475ms
 8 samr               10  1      1.27      6.72    NaN         NaN 10000     0      118ms
 9 bolker            100  1.56   1.57      4.30      1         Inf  9999     1      320ms
10 rudolph           100  6.83   6.53      1.10      5.79      Inf  3892     1      487ms
11 baldur            100  1.03   1.04      6.64      2.89      Inf  9999     1      207ms
12 baldur2           100  1      1         6.87      3.89      Inf  9999     1      200ms
13 thomas            100  1.64   1.58      4.19      2         NaN 10000     0      328ms
14 rui               100  7      7.07      1         5.79      Inf  3546     1      488ms
15 rui2              100  6.69   6.5       1.12      5.79      Inf  3904     2      482ms
16 samr              100  1.19   1.05      6.38      2.89      NaN 10000     0      216ms
17 bolker           1000  1.31   1.64      4.09      1         Inf  3025     1      487ms
18 rudolph          1000  5.43   5.97      1.10      5.98      NaN   830     0      499ms
19 baldur           1000  1.06   1.23      5.13      2.99      Inf  3109     1      399ms
20 baldur2          1000  1      1.18      5.57      3.99      NaN  4186     0      495ms
21 thomas           1000  1.17   1.41      4.89      2         NaN  3685     0      496ms
22 rui              1000  7.32   6.58      1         5.98      NaN   758     0      499ms
23 rui2             1000  7.17   5.85      1.12      5.98      NaN   853     0      500ms
24 samr             1000  1.05   1         6.01      2.99      Inf  4439     1      486ms
25 bolker          10000  1.77   1.61      4.39      1         NaN   355     0      501ms
26 rudolph         10000  8.56   6.18      1.13      6.00      NaN    92     0      502ms
27 baldur          10000  1.06   1.26      5.56      3.00      Inf   443     1      492ms
28 baldur2         10000  1      1.21      5.69      4.00      NaN   460     0      500ms
29 thomas          10000  1.57   1.44      4.81      2         Inf   388     1      499ms
30 rui             10000  8.28   6.89      1         6.00      NaN    81     0      501ms
31 rui2            10000  8.69   6.19      1.13      6.00      NaN    92     0      505ms
32 samr            10000  1.23   1         6.69      3.00      Inf   533     1      493ms
33 bolker         100000  1.92   1.58      4.21      1         NaN    36     0      510ms
34 rudolph        100000  7.83   6.31      1.07      6.00      NaN     9     0      504ms
35 baldur         100000  1.37   1.31      5.08      3.00      Inf    42     1      493ms
36 baldur2        100000  1.45   1.37      4.96      4.00      Inf    41     1      493ms
37 thomas         100000  1.52   1.37      4.89      2         NaN    41     0      500ms
38 rui            100000  7.46   6.60      1         6.00      NaN     9     0      537ms
39 rui2           100000  6.93   6.05      1.11      6.00      Inf     9     1      483ms
40 samr           100000  1      1         6.56      3.00      NaN    55     0      500ms
41 bolker        1000000  1.81   1.79      4.39      1         NaN     4     0      551ms
42 rudolph       1000000  7.76   7.13      1.09      6.00      NaN     1     0      553ms
43 baldur        1000000  1.48   1.44      5.42      3.00      Inf     4     1      447ms
44 baldur2       1000000  1.52   1.44      5.44      4.00      Inf     4     1      445ms
45 thomas        1000000  1.64   1.53      5.04      2         Inf     4     1      480ms
46 rui           1000000  8.50   7.80      1         6.00      NaN     1     0      605ms
47 rui2          1000000  7.59   6.97      1.12      6.00      NaN     1     0      541ms
48 samr          1000000  1      1         7.88      3.00      Inf     6     1      460ms

Benchmark code:

results <- bench::press(
    vec_length = 10^(1:6),
    {
        vals <- c("Red", "", " ", " ", "   ", "   ", "5", letters[1:3])
        v <- sample(vals, vec_length, replace = TRUE)
        bench::mark(
            relative = TRUE,
            bolker = sum(grepl("^ *$", v)),
            rudolph = sum(!nzchar(trimws(v))),
            baldur = sum(gsub(" ", "", v, fixed = TRUE) == ""),
            baldur2 = sum(!nzchar(gsub(" ", "", v, fixed = TRUE))),
            thomas = sum(!grepl("\\S", v)),
            rui = sum(nchar(trimws(v)) == 0),
            rui2 = sum(!nzchar(trimws(v))),
            samr = sum(stri_isempty(stri_trim_both(v)))
        )
    }
)

Rcpp benchmarks

I include separately a benchmark of S. Baldur's count_empty_cpp() function. This is much faster than my stringi approach, so I added another Rcpp function using the C++ standard library, based heavily on the answer to the C++ question, Efficient way to check if std::string has only spaces.

Rcpp::cppFunction("int count_empty_cpp2(CharacterVector x) {
  int count = 0, j, n;
  std::string str;
  for (int i = 0; i < x.size(); i++) {
    str = Rcpp::as<std::string>(x[i]);
    if(str.find_first_not_of(' ') == std::string::npos)
    {
        count++;
    }
  }
  return count;
}")

I also added a third Rcpp function which looks at the underlying S-expression of each element of the character vector. This means we can avoid type casting in cases where the string is empty. Also where we need to look at the contents of the string, I use CHAR() to cast the SEXP to a C-style pointer to a null-terminated string (const char*), rather than a C++ std::string. This means we copy the reference (8 bytes per string probably), rather than the data.

Rcpp::cppFunction("int count_empty_cpp3(CharacterVector x) {
  int count = 0;
  for (int i = 0; i < x.size(); i++) {
    SEXP elem = x[i];
    R_xlen_t len = Rf_length(elem);
    if (len == 0) {
      count++;
    } else {
      const char* str = CHAR(elem);
      bool is_empty = true;
      for (R_xlen_t j = 0; j < len; j++) {
        if (str[j] != ' ') {
          is_empty = false;
          break;
        }
      }
      if (is_empty) count++;
    }
  }
  return count;
}")

I benchmarked these against the two fastest R answers. All Rcpp approaches are much faster than than the fastest R approaches once vector lengths are >1e4.

enter image description here

Here is a table of results. There's very little differences between the first two Rcpp approaches. The approach avoiding std::string is slightly faster than the other two:

   expression    vec_length   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
   <bch:expr>         <dbl> <dbl>  <dbl>     <dbl>     <dbl>    <dbl> <int> <dbl>   <bch:tm>
 1 baldur2_baser         10  1.56   1.55      1.61    NaN         Inf  9999     1    84.46ms
 2 samr_stringi          10  2.44   2.18      1       NaN         NaN 10000     0   135.71ms
 3 baldercpp             10  1.11   1.05      2.22    Inf         NaN 10000     0    61.11ms
 4 samrcpp               10  1.11   1.09      1.95    Inf         Inf  9999     1    69.62ms
 5 samrcpp2              10  1      1         2.10    Inf         Inf  9999     1    64.57ms
 6 baldur2_baser        100  2.64   2.57      1         1.35      NaN 10000     0   200.12ms
 7 samr_stringi         100  3      2.57      1.01      1         Inf  9999     1   198.57ms
 8 baldercpp            100  1.14   1.03      2.33      1.97      NaN 10000     0    85.74ms
 9 samrcpp              100  1.14   1.1       1.90      1.97      Inf  9999     1   105.58ms
10 samrcpp2             100  1      1         2.23      1.97      Inf  9999     1    89.85ms
11 baldur2_baser       1000  3.67   4.51      1         6.33      Inf  4408     2   481.84ms
12 samr_stringi        1000  3.64   3.93      1.03      4.74      Inf  4591     1   488.59ms
13 baldercpp           1000  1.45   1.41      2.77      1         Inf  9999     1   395.14ms
14 samrcpp             1000  1.55   1.49      2.58      1         Inf  9999     1   423.62ms
15 samrcpp2            1000  1      1         3.92      1         NaN 10000     0   278.97ms
16 baldur2_baser      10000  4.06   5.29      1        62.8       Inf   465     3   480.15ms
17 samr_stringi       10000  3.73   4.26      1.16     47.1       Inf   555     1   494.04ms
18 baldercpp          10000  1.52   1.56      2.99      1         NaN  1444     0   498.64ms
19 samrcpp            10000  1.58   1.64      3.05      1         Inf  1457     1    492.9ms
20 samrcpp2           10000  1      1         4.79      1         NaN  2305     0   496.85ms
21 baldur2_baser     100000  5.35   5.16      1       627.        Inf    48     2   529.89ms
22 samr_stringi      100000  4.61   4.08      1.23    470.        Inf    54     2    484.1ms
23 baldercpp         100000  1.51   1.53      3.34      1         NaN   152     0    501.8ms
24 samrcpp           100000  1.57   1.57      3.31      1         NaN   150     0   500.94ms
25 samrcpp2          100000  1      1         4.89      1         NaN   222     0    501.5ms
26 baldur2_baser    1000000  4.46   5.04      1      6270.        Inf    27    23      2.89s
27 samr_stringi     1000000  3.94   3.89      1.27   4702.        Inf    37    13      3.11s
28 baldercpp        1000000  1.30   1.41      3.50      1         NaN    50     0      1.53s
29 samrcpp          1000000  1.45   1.53      3.15      1         NaN    50     0      1.69s
30 samrcpp2         1000000  1      1         4.65      1         NaN    50     0      1.15s

These are relatively short strings. I'm not going to run more benchmarks but I suspect if the strings were longer we'd see more of a relative benefit to copying the pointer rather than the data.

A note on the benchmarks

These benchmarks are mostly for fun. The differences between all answers are relatively small, so unless you're repeating this many times with huge vectors, have extremely long strings or very limited memory resources, rather than rolling my own Rcpp solution that is nanoseconds faster, I would optimise for readable code.

Lucknow answered 30/6 at 20:55 Comment(7)
Thanks for the answer! I didn't know about stringi, so thanks for describing it here!Phocomelia
this benchmarking is impressiveOpinion
We've probably already excceded OP's expectations but I've added another solution if you feel like updating the benchmark.Rn
@Lucknow Thanks. I'm still a complete Rcpp beginner. Would be very interesting to see if someone can still optimise the code further.Rn
@Rn I'm sure everyone is beyond caring but I added an Rcpp approach which uses the underlying SEXP, rather than type converting to std::string. It is a little bit faster than the other two approaches. Interestingly, this time your Rcpp method is a little faster than my first one, so last time I may have benefited a bit from the stochastic nature of these benchmarks.Lucknow
Amazing! Very useful. Thanks so much. Now it's time to explore parallelisation (joke).Rn
@Rn 😂 - although in reality doing this in parallel would almost certainly be slower as the overheads from copying the data would be far too highLucknow
R
7

One more option:

sum(gsub(" ", "", vector, fixed = TRUE) == "")

And a concise variation on a previous answer:

sum(trimws(vector) == "")

Finally ... since we are have fun with benchmarks. There is room for improvement over base R as shown by SamR using stringi. Here is another example using Rcpp where we avoid modifying the vector and inspect character by character until first non-space character.

cppFunction("int count_empty_cpp(CharacterVector x) {
  int count = 0, j, n;
  std::string v;
  for (int i = 0; i < x.size(); i++) {
    v = Rcpp::as<std::string>(x[i]);
    n = v.length();
    j = 0;
    while (j < n && v[j] == ' ') j++;
    if (j == n) count++;
  }
  return count;
}")
Rn answered 30/6 at 20:28 Comment(0)
P
6

Use sum(grepl()) plus an appropriate regular expression:

vector<-c("Red", "   ", "", "5", "")
sum(grepl("^ *$", vector))
  • ^: beginning of string
  • *: zero or more spaces
  • $: end of string

If you want to look for "white space" more generally (e.g. allowing tabs), use "^[[:space:]]*$", although as pointed out in ?grep, the definition of white space is locale-dependent ...

length(grep(...)) would also work, or stringr::str_count(vector, "^ *$").

For what it's worth:

 microbenchmark::microbenchmark(
     bolker =  sum(grepl("^ *$", vector)),
     rudolph = sum(! nzchar(trimws(vector))),
     baldur = sum(gsub(" ", "", vector, fixed = TRUE) == ""),
    baldur2 = sum(! nzchar(gsub(" ", "", vector, fixed = TRUE))))

Unit: microseconds
    expr    min      lq     mean  median      uq    max neval cld
  bolker 10.499 10.8900 12.31869 11.8020 12.7990 40.976   100 a  
 rudolph 19.306 20.0125 22.01722 20.7990 22.9480 66.815   100  b 
  baldur  2.294  2.5700  2.76420  2.7455  2.8950  3.567   100   c
 baldur2  2.294  2.4740  2.66267  2.6450  2.7755  5.130   100   c

(@RuiBarradas not included because vs similar to @KonradRudolph). I'm surprised that @s_baldur's answer is so fast ... but also probably worth keeping in mind that this operation will be fast enough to not worry about efficiency unless it is a large part of your overall workflow ...

Pantomimist answered 30/6 at 19:24 Comment(5)
fixed = TRUE dramatically boosts the speedOpinion
I agree, but I'm just surprised that gsub() is still faster than trimws(), or that modifying the string is faster than scanning it for regexps ...Pantomimist
when you read the source code of trimws, you will see that its key code consists of sub, BUT, paste0 is applied to generate the regex pattern, which slows down the performanceOpinion
sauce <- c(" Red", " ", " a", " 5", "", " fdsfd", " ff") ; vector <- sapply(1:100000, function (x) sauce[x %% length(sauce) + 1]) ## and bolker is the fastest... Highly dataset dependentLafayette
FYI I added some more benchmarks with vectors of varying length and all answers. The approach that is fastest with a short vector is not the fastest with longer ones (though still relatively fast). Your answer uses the least memory.Lucknow
B
4

trimws removes all white spaces. Then nchar will get the number of characters. Compare to zero and count only those. But this is around 3 times slower than Ben Bolker's answer.

v <- c("Red", "   ", "", "5", "")
sum(nchar(trimws(v)) == 0)
#> [1] 3

Created on 2024-06-30 with reprex v2.1.0


Edit

Based on a comment now deleted,

sum(!nzchar(trimws(v)))
#> [1] 3

Created on 2024-06-30 with reprex v2.1.0

Berrios answered 30/6 at 19:33 Comment(1)
@BenBolker Good idea, will edit with a variation.Berrios
T
4

I’d use nzchar() in combination with trimws() (even though the double negation of !nzchar() makes this a bit awkward to read):

sum(! nzchar(trimws(vector)))
# [1] 3
Tartrazine answered 30/6 at 19:52 Comment(0)
M
0

Here are some stringr solutions:

library(stringr)

sum(word(vector) == "")
# [1] 3

sum(!str_count(vector, boundary("word")))
# [1] 3
Musk answered 2/7 at 17:59 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.