parsing quotes out of "NA" strings
Asked Answered
H

3

6

My dataframe has some variables that contain missing values as strings like "NA". What is the most efficient way to parse all columns in a dataframe that contain these and convert them into real NAs that are catched by functions like is.na()?

I am using sqldf to query the database.

Reproducible example:

vect1 <- c("NA", "NA", "BANANA", "HELLO")
vect2 <- c("NA", 1, 5, "NA")
vect3 <- c(NA, NA, "NA", "NA")


df = data.frame(vect1,vect2,vect3)
Hyades answered 2/1, 2016 at 14:30 Comment(10)
The best way would be to deal with these when you are reading in the data. If you are reading in with the standard tools have a look at the na.strings argument. Otherwise this should help, but replace the period with NAPolyhedron
what are considered the standard tools? I'm reading all data from a sql database which might make it hard to deal with these when reading the dataHyades
okay, what function are you using specifically.. pretty sure they will have an na.strings argumentPolyhedron
yes I don't think na.strings would work in my situation unfortunatelyHyades
I am using sqldf to query the databaseHyades
.hmmm, not so sure then. But the link in the comment should help to do this after reading inPolyhedron
thank you for adding that in. I can't seem to find an na.strings equivalent for sqldf. What's so wrong about parsing it after reading the data though?Hyades
For me, mainly it is just easier. But (i think) when you make these replacements that copies of the data are made which can be a problem for large data.Polyhedron
I agree, I could write a for loop that went through all rows and converted all "NA" strings to real NAs but that would take very long since my dataframe is about 5 million rows and about 250 variables, hence the "most efficient" in my questionHyades
Building on @user20650's suggestion, you can also use type.convert after having read the data in, as I did in my answer.Donatello
D
5

To add to the alternatives, you can also use replace instead of the typical blah[index] <- NA approach. replace would look like:

df <- replace(df, df == "NA", NA)

Another alternative to consider is type.convert. This is the function that R uses when reading data in to automatically convert column types. Thus, the result is different from your current approach in that, for instance, the second column gets converted to numeric.

df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
df

Here's a performance comparison. The sample data is from @roland's answer.

Here are the functions to test:

funop <- function() {
  df[df == "NA"] <- NA
  df
}

funr <- function() {
  ind <- which(vapply(df, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
  as.data.table(df)[, names(df)[ind] := lapply(.SD, function(x) {
    is.na(x) <- x == "NA"
    x
  }), .SDcols = ind][]
}

funam1 <- function() replace(df, df == "NA", NA)

funam2 <- function() {
  df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings = "NA"))
  df
}

Here's the benchmarking:

library(microbenchmark)
microbenchmark(funop(), funr(), funam1(), funam2(), times = 10)
# Unit: seconds
#      expr      min       lq     mean   median       uq      max neval
#   funop() 3.629832 3.750853 3.909333 3.855636 4.098086 4.248287    10
#    funr() 3.074825 3.212499 3.320430 3.279268 3.332304 3.685837    10
#  funam1() 3.714561 3.899456 4.238785 4.065496 4.280626 5.512706    10
#  funam2() 1.391315 1.455366 1.623267 1.566486 1.606694 2.253258    10

replace would be the same as @roland's approach, which is the same as @jgozal's. However, the type.convert approach would result in different column types.

all.equal(funop(), setDF(funr()))
all.equal(funop(), funam())

str(funop())
# 'data.frame': 10000000 obs. of  3 variables:
#  $ vect1: Factor w/ 3 levels "BANANA","HELLO",..: 2 2 NA 2 1 1 1 NA 1 1 ...
#  $ vect2: Factor w/ 3 levels "1","5","NA": NA 2 1 NA 1 NA NA 1 NA 2 ...
#  $ vect3: Factor w/ 1 level "NA": NA NA NA NA NA NA NA NA NA NA ...

str(funam2())
# 'data.frame': 10000000 obs. of  3 variables:
#  $ vect1: Factor w/ 2 levels "BANANA","HELLO": 2 2 NA 2 1 1 1 NA 1 1 ...
#  $ vect2: int  NA 5 1 NA 1 NA NA 1 NA 5 ...
#  $ vect3: logi  NA NA NA NA NA NA ...
Donatello answered 2/1, 2016 at 16:11 Comment(0)
H
4

I found this nice way of doing it from this question:

So for this particular situation it would just be:

df[df=="NA"]<-NA   

It only took about 30 seconds with 5 million rows and ~250 variables

Hyades answered 2/1, 2016 at 14:55 Comment(2)
You can do this much faster, e.g., using package data.table. But you don't provide a reproducible example of df, which precludes me from composing and testing/benchmarking an answer.Shwalb
reproducible example addedHyades
S
4

This is slightly faster:

set.seed(42)
df <- do.call(data.frame, lapply(df, sample, size = 1e7, replace = TRUE))
df2 <- df
system.time(df[df=="NA"]<-NA )
# user      system     elapsed 
#3.601       0.378       3.984 

library(data.table)
setDT(df2)
system.time({
  #find character and factor columns
  ind <- which(vapply(df2, function(x) class(x) %in% c("character", "factor"), FUN.VALUE = TRUE))
  #assign by reference
  df2[, names(df2)[ind] := lapply(.SD, function(x) {
  is.na(x) <- x == "NA"
  x
}), .SDcols = ind]
})
# user      system     elapsed 
#2.484       0.190       2.676 
all.equal(df, setDF(df2))
#[1] TRUE
Shwalb answered 2/1, 2016 at 15:31 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.