Converting Character to Numeric without NA Coercion in R
Asked Answered
B

3

13

I'm working in R and have a dataframe, dd_2006, with numeric vectors. When I first imported the data, I needed to remove $'s, decimal points, and some blank spaces from 3 of my variables: SumOfCost, SumOfCases, and SumOfUnits. To do that, I used str_replace_all. However, once I used str_replace_all, the vectors were converted to characters. So I used as.numeric(var) to convert the vectors to numeric, but NAs were introduced, even though when I ran the code below BEFORE I ran the as.numeric code, there were no NAs in the vectors.

sum(is.na(dd_2006$SumOfCost))
[1] 0
sum(is.na(dd_2006$SumOfCases))
[1] 0
sum(is.na(dd_2006$SumOfUnits))
[1] 0

Here is my code from after the import, beginning with removing the $ from the vector. In the str(dd_2006) output, I deleted some of the variables for the sake of space, so the column #s in the str_replace_all code below don't match the output I've posted here (but they do in the original code):

library("stringr")
dd_2006$SumOfCost <- str_sub(dd_2006$SumOfCost, 2, ) #2=the first # after the $

#Removes decimal pt, zero's after, and commas
dd_2006[ ,9] <- str_replace_all(dd_2006[ ,9], ".00", "")
dd_2006[,9] <- str_replace_all(dd_2006[,9], ",", "")

dd_2006[ ,10] <- str_replace_all(dd_2006[ ,10], ".00", "")
dd_2006[ ,10] <- str_replace_all(dd_2006[,10], ",", "")

dd_2006[ ,11] <- str_replace_all(dd_2006[ ,11], ".00", "")
dd_2006[,11] <- str_replace_all(dd_2006[,11], ",", "")

str(dd_2006)
'data.frame':   12604 obs. of  14 variables:
 $ CMHSP                     : Factor w/ 46 levels "Allegan","AuSable Valley",..: 1 1 1
 $ FY                        : Factor w/ 1 level "2006": 1 1 1 1 1 1 1 1 1 1 ...
 $ Population                : Factor w/ 1 level "DD": 1 1 1 1 1 1 1 1 1 1 ...
 $ SumOfCases                : chr  "0" "1" "0" "0" ...
 $ SumOfUnits                : chr  "0" "365" "0" "0" ...
 $ SumOfCost                 : chr  "0" "96416" "0" "0" ...

I found a response to a similar question to mine here, using the following code:

# create dummy data.frame
d <- data.frame(char = letters[1:5], 
                fake_char = as.character(1:5), 
                fac = factor(1:5), 
                char_fac = factor(letters[1:5]), 
                num = 1:5, stringsAsFactors = FALSE)

Let us have a glance at data.frame

> d
  char fake_char fac char_fac num
1    a         1   1        a   1
2    b         2   2        b   2
3    c         3   3        c   3
4    d         4   4        d   4
5    e         5   5        e   5

and let us run:

> sapply(d, mode)
       char   fake_char         fac    char_fac         num 
"character" "character"   "numeric"   "numeric"   "numeric" 
> sapply(d, class)
       char   fake_char         fac    char_fac         num 
"character" "character"    "factor"    "factor"   "integer" 

Now you probably ask yourself "Where's an anomaly?" Well, I've bumped into quite peculiar things in R, and this is not the most confounding thing, but it can confuse you, especially if you read this before rolling into bed.

Here goes: first two columns are character. I've deliberately called 2nd one fake_char. Spot the similarity of this character variable with one that Dirk created in his reply. It's actually a numerical vector converted to character. 3rd and 4th column are factor, and the last one is "purely" numeric.

If you utilize transform function, you can convert the fake_char into numeric, but not the char variable itself.

> transform(d, char = as.numeric(char))
  char fake_char fac char_fac num
1   NA         1   1        a   1
2   NA         2   2        b   2
3   NA         3   3        c   3
4   NA         4   4        d   4
5   NA         5   5        e   5
Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
but if you do same thing on fake_char and char_fac, you'll be lucky, and get away with no NA's:

transform(d, fake_char = as.numeric(fake_char), char_fac = as.numeric(char_fac))

  char fake_char fac char_fac num
1    a         1   1        1   1
2    b         2   2        2   2
3    c         3   3        3   3
4    d         4   4        4   4
5    e         5   5        5   5

So I tried the above code in my script, but still came up with NAs (without a warning message about coercion).

#changing sumofcases, cost, and units to numeric
dd_2006_1 <- transform(dd_2006, SumOfCases = as.numeric(SumOfCases), SumOfUnits = as.numeric(SumOfUnits), SumOfCost = as.numeric(SumOfCost))

> sum(is.na(dd_2006_1$SumOfCost))
[1] 12
> sum(is.na(dd_2006_1$SumOfCases))
[1] 7
> sum(is.na(dd_2006_1$SumOfUnits))
[1] 11

I've also used table(dd_2006$SumOfCases) etc. to look at the observations to see if there are any characters that I missed in the observations, but there weren't any. Any thoughts on why the NAs are popping up, and how to get rid of them?

Beverleybeverlie answered 11/7, 2013 at 15:52 Comment(2)
Sorry, but what's the question? The linked answer seems to sum everything up pretty well, and without a reproducible example of the problem you are actually facing, I'm not sure how others can help....Noggin
I'm guessing this data came from Excel or another spreadsheet. Next time, clear all formatting before you export.Ola
K
17

As Anando pointed out, the problem is somewhere in your data, and we can't really help you much without a reproducible example. That said, here's a code snippet to help you pin down the records in your data that are causing you problems:

test = as.character(c(1,2,3,4,'M'))
v = as.numeric(test) # NAs intorduced by coercion
ix.na = is.na(v)
which(ix.na) # row index of our problem = 5
test[ix.na]  # shows the problematic record, "M"

Instead of guessing as to why NAs are being introduced, pull out the records that are causing the problem and address them directly/individually until the NAs go away.

UPDATE: Looks like the problem is in your call to str_replace_all. I don't know the stringr library, but I think you can accomplish the same thing with gsub like this:

v2 = c("1.00","2.00","3.00")
gsub("\\.00", "", v2)

[1] "1" "2" "3"

I'm not entirely sure what this accomplishes though:

sum(as.numeric(v2)!=as.numeric(gsub("\\.00", "", v2))) # Illustrate that vectors are equivalent.

[1] 0

Unless this achieves some specific purpose for you, I'd suggest dropping this step from your preprocessing entirely, as it doesn't appear necessary and seems to be giving you problems.

Kaolin answered 11/7, 2013 at 16:24 Comment(2)
When I run that code for SumOfUnits, I get > which(ix.na) #row index of problem NAs [1] 1098 2297 4728 5559 5592 5702 6955 8191 10517 10881 10955 > test[ix.na] # [1] "" "" "" "" "" "" "" "" "" "" "" But when I look at those rows in the original dataset, it looks like this: SumOfUnits 800.00 0.00 100.00 100.00 100.00 300.00 400.00 200.00 200.00 600.00 100.00 So how is that a problem with the data instead something happening when I run the str_replace_all code?Beverleybeverlie
Looks like you pinned down the problem. I don't really know what pre-processing those strings accomplishes since you can cast those values to numbers just fine with or without the decimals, but I illustrate an alternative solution using gsub in my updated answer.Kaolin
C
5

If you want to convert the character to a numeric as well, then first convert it to a factor (using as.factor) and save/ overwrite existing variable. Next convert this factor variable to numeric (using as.numeric). You wouldn't be creating NAs this way and will be able to convert the data-set you have into numeric.

Centrist answered 10/2, 2016 at 22:53 Comment(1)
as.numeric(as.factor(df$x)) just replaces values with 1 and 2, indicesLozar
T
2

A simple solution is to let retype guess new data types for each column

library(dplyr)
library(hablar)

dd_2006 %>% retype()
Typify answered 1/11, 2018 at 15:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.