Unexpected behaviour with str_replace "NA"
Asked Answered
H

4

9

I'm trying to convert a character string to numeric and have encountered some unexpected behaviour with str_replace. Here's a minimum working example:

library(stringr)
x <- c("0", "NULL", "0")

# This works, i.e. 0 NA 0
as.numeric(str_replace(x, "NULL", ""))

# This doesn't, i.e. NA NA NA
as.numeric(str_replace(x, "NULL", NA))

To my mind, the second example should work as it should only replace the second entry in the vector with NA (which is a valid value in a character vector). But it doesn't: the inner str_replace converts all three entries to NA.

What's going on here? I had a look through the documentation for str_replace and stri_replace_all but don't see an obvious explanation.

EDIT: To clarify, this is with stringr_1.0.0 and stringi_1.0-1 on R 3.1.3, Windows 7.

Hungarian answered 17/12, 2015 at 15:10 Comment(17)
Certainly an unexpected behaviour in the source code which needs correction, you need to provide NA a string to make it working: as.numeric(str_replace(x, "NULL", "NA"))Fran
Possible workaround? x <- c("0", "NULL", "0"); y <- x; y[y=="NULL"] <- NA; as.numeric(y)Raiment
I must be missing something, the second example works for me as.numeric(str_replace(x, "NULL", NA)) [1] 0 NA 0Emmalynne
@PierreLafortune stringr used to wrap base functions; now it wraps stringi functions. You have a old version of stringr I guess. gsub behaves correctly here.Stansbury
like @PierreLafortune I'm getting the correct/expected output (with both lines in my case)Vinnie
@Stansbury CathG has the latest release and gets the correct output so not sure how OP is getting thisEmmalynne
I had several packages loaded so I re-tried on a "fresh" session with just stringr (stringr_1.0.0) and it still works...Vinnie
@CathG I am seeing the same behavior at the OP also with stringr 1.0.0 in a clean session. Platform differences...? I'm on OSX.Familiarity
Strange. I have stringr 1.0.0 and stringi 1.0-1 (which appears to be the latest versions) and can reproduce OP's results. I'm on Ubuntu. OS dependent?Stansbury
Also having stringr 1.0.0 & stringi 1.0-1 and get the same as OP in a clean session (on OSX)Lammastide
same versions of stringr and stringi, R 3.2.1 on Windows 7Vinnie
I am also getting the OP's result with stringr 1.0.0, stringi 1.0-1, and R 3.2.3 on Windows 7. I'm trying to trace the source now.Cavalierly
so more like R version related ?Vinnie
Weird. Looking through the source code, first NA is converted to NA_character_, then it winds up here: x <- c("123", "NULL", "456"); stringi:::stri_replace_first_regex(x, "NULL", NA_character_). (I changed the numbers to remove any possible issues with 0). After that it descends into C code...Shelving
I was finally able to reproduce the error by updating both packages. Perhaps file a feature request.Emmalynne
I recommend a feature request for stringi for that :)Arthritis
Don't bother, I filed an issue already github.com/Rexamine/stringi/issues/210Arthritis
C
4

Look at the source code of str_replace.

function (string, pattern, replacement) 
{
    replacement <- fix_replacement(replacement)
    switch(type(pattern), empty = , bound = stop("Not implemented", 
        call. = FALSE), fixed = stri_replace_first_fixed(string, 
        pattern, replacement, opts_fixed = attr(pattern, "options")), 
        coll = stri_replace_first_coll(string, pattern, replacement, 
            opts_collator = attr(pattern, "options")), regex = stri_replace_first_regex(string, 
            pattern, replacement, opts_regex = attr(pattern, 
                "options")), )
}
<environment: namespace:stringr>

This leads to finding fix_replacement, which is at Github, and I've put it below too. If you run it in your main environment, you find out that fix_replacement(NA) returns NA. You can see that it relies on stri_replace_all_regex, which is from the stringi package.

fix_replacement <- function(x) {
    stri_replace_all_regex(
        stri_replace_all_fixed(x, "$", "\\$"),
        "(?<!\\\\)\\\\(\\d)",
        "\\$$1")
}

The interesting thing is that stri_replace_first_fixed and stri_replace_first_regex both return c(NA,NA,NA) when run with your parameters (your string, pattern, and replacement). The problem is that stri_replace_first_fixed and stri_replace_first_regex are C++ code, so it gets a little trickier to figure out what's happening.

stri_replace_first_fixed can be found here.

stri_replace_first_regex can be found here.

As far as I can discern with limited time and my relatively rusty C++ knowledge, the function stri__replace_allfirstlast_fixed checks the replacement argument using stri_prepare_arg_string. According to the documentation for that, it will throw an error if it encounters an NA. I don't have time to fully trace it beyond this, but I would suspect that this error may be causing the odd return of all NAs.

Cavalierly answered 17/12, 2015 at 16:47 Comment(0)
A
5

This was a bug in the stringi package but now it is fixed (recall that stringr is based on stringi - the former shall be affected too).

With the most recent development version we get:

stri_replace_all_fixed(c("1", "NULL"), "NULL", NA)
## [1] "1" NA
Arthritis answered 30/1, 2016 at 16:26 Comment(1)
I still get the issue using stringr 1_2_0 which calls stringi_1.1.5? Though see the issue was closed though on github, github.com/tidyverse/stringr/issues/110 Any idea what is hapening? Thanks!Dispread
C
4

Look at the source code of str_replace.

function (string, pattern, replacement) 
{
    replacement <- fix_replacement(replacement)
    switch(type(pattern), empty = , bound = stop("Not implemented", 
        call. = FALSE), fixed = stri_replace_first_fixed(string, 
        pattern, replacement, opts_fixed = attr(pattern, "options")), 
        coll = stri_replace_first_coll(string, pattern, replacement, 
            opts_collator = attr(pattern, "options")), regex = stri_replace_first_regex(string, 
            pattern, replacement, opts_regex = attr(pattern, 
                "options")), )
}
<environment: namespace:stringr>

This leads to finding fix_replacement, which is at Github, and I've put it below too. If you run it in your main environment, you find out that fix_replacement(NA) returns NA. You can see that it relies on stri_replace_all_regex, which is from the stringi package.

fix_replacement <- function(x) {
    stri_replace_all_regex(
        stri_replace_all_fixed(x, "$", "\\$"),
        "(?<!\\\\)\\\\(\\d)",
        "\\$$1")
}

The interesting thing is that stri_replace_first_fixed and stri_replace_first_regex both return c(NA,NA,NA) when run with your parameters (your string, pattern, and replacement). The problem is that stri_replace_first_fixed and stri_replace_first_regex are C++ code, so it gets a little trickier to figure out what's happening.

stri_replace_first_fixed can be found here.

stri_replace_first_regex can be found here.

As far as I can discern with limited time and my relatively rusty C++ knowledge, the function stri__replace_allfirstlast_fixed checks the replacement argument using stri_prepare_arg_string. According to the documentation for that, it will throw an error if it encounters an NA. I don't have time to fully trace it beyond this, but I would suspect that this error may be causing the odd return of all NAs.

Cavalierly answered 17/12, 2015 at 16:47 Comment(0)
W
1

Here's a solution using dplyr's across method and the stringr package.

df <- data.frame(x=c("a","b","null","e"),
                 y=c("g","null","h","k"))  

df2 <- df %>% 
  mutate(across(everything(),str_replace,"null",NA_character_))
Wicker answered 15/7, 2021 at 1:6 Comment(0)
P
0

There is another way to answer this problem as shown here, using NA_character_

Short answer to the question:

library(stringr)
x <- c("0", "NULL", "0")
y <- as.numeric(str_replace(x, "NULL", NA_character_))

Produces:

> y
[1]  0 NA  0
> typeof(y)
[1] "double"

Going further

library(dplyr)
library(stringr)
# create a dummy dataset
ex = starwars %>% select(name, hair_color, homeworld) %>% head(6)
print(ex)
# lets say you want to replace all "Tatooine" by NA
# this produce the expected output
ex %>% mutate(homeworld = str_replace_all(homeworld, pattern = "Tatooine", NA_character_))

# HOWEVER,
# From Hadley's comment: "str_replace() has to replace parts of a string and replacing part of a string with NA doesn't make sense."
# then be careful using this method, see the example below:
ex %>% mutate(hair_color = str_replace_all(hair_color, pattern = "brown", NA_character_))
# all air colors with "brown", including "blond, grey" (Owen Lars, line 6) are now NA

Outputs

> print(ex)
# A tibble: 10 x 3
   name               hair_color    homeworld
   <chr>              <chr>         <chr>    
 1 Luke Skywalker     blond         Tatooine 
 2 C-3PO              NA            Tatooine 
 3 R2-D2              NA            Naboo    
 4 Darth Vader        none          Tatooine 
 5 Leia Organa        brown         Alderaan 
 6 Owen Lars          brown, grey   Tatooine  

> ex %>% mutate(homeworld = str_replace_all(homeworld, pattern = "Tatooine", NA_character_))
# A tibble: 10 x 3
   name               hair_color    homeworld
   <chr>              <chr>         <chr>    
 1 Luke Skywalker     blond         NA       
 2 C-3PO              NA            NA       
 3 R2-D2              NA            Naboo    
 4 Darth Vader        none          NA       
 5 Leia Organa        brown         Alderaan 
 6 Owen Lars          brown, grey   NA         

 > ex %>% mutate(hair_color = str_replace_all(hair_color, pattern = "brown", NA_character_))
# A tibble: 10 x 3
   name               hair_color    homeworld
   <chr>              <chr>         <chr>    
 1 Luke Skywalker     blond         Tatooine 
 2 C-3PO              NA            Tatooine 
 3 R2-D2              NA            Naboo    
 4 Darth Vader        none          Tatooine 
 5 Leia Organa        NA            Alderaan 
 6 Owen Lars          NA            Tatooine 
Paramour answered 14/8, 2020 at 8:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.