Any way to automatically correct all variable classes in a dataframe
Asked Answered
A

1

8

I have a dataframe with about ~250 variables. Unfortunately, all of these variables were imported as character classes from a sql database using sqldf. The problem: all of them should not be character classes. There are numeric variables, integers, as well as dates. I'd like to build a model that runs over all the variables and to do this I need to make sure that variables have the right classes. Doing it one by one is probably best, but still very manual.

How could I automatically correct all classes? Perhaps a way to detect whether there are alphabet characters in the column or only number characters?

I don't think it's possible for an automatic approach to be perfect in correcting all classes. But it might correct most of the classes, then those that are not good, I can take care of them manually.

I am adding a sqldf tag in case anybody knows of any way to correct this when importing the data, but I assume it's not sqldf's fault but rather the database's.

Adversary answered 4/1, 2016 at 20:18 Comment(0)
R
9

The closest thing to "automatic" type conversion on a data frame would probably be

df[] <- lapply(df, type.convert)

where df is your data set. The function type.convert()

Converts a character vector to logical, integer, numeric, complex or factor as appropriate.

Have a read of help(type.convert), it might be just what you want.

In my experience, type.convert() is very reliable. You can use as.is = TRUE if you don't want characters coerced to factors. Plus it's used internally in many important R functions (like read.table), so it's definitely safe.

Here's a quick example of it working on iris. First we'll change all columns to character, then run type.convert() on it.

## Original column classes in iris
sapply(iris, class)
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#    "numeric"    "numeric"    "numeric"    "numeric"     "factor" 

## Change all columns to character
iris[] <- lapply(iris, as.character)
sapply(iris, class)
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#  "character"  "character"  "character"  "character"  "character" 

## Run type.convert()
iris[] <- lapply(iris, type.convert)
sapply(iris, class)
# Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
#    "numeric"    "numeric"    "numeric"    "numeric"     "factor" 

We can see that the columns were returned to their original classes. This is because type.convert() coerces columns to the "most appropriate" type.

Rajkot answered 4/1, 2016 at 20:35 Comment(8)
hello Richard, I recently used this on a different data frame and it gave this error Error in FUN(X[[i]], ...) : the first argument must be of mode character I was wondering if you knew why this was happeningAdversary
it looks like type.convert() expects a character vector as its first argument. I have tried converting my df to as.character(df) but then it just converted everything into factor typeAdversary
@Adversary If you want characters to remain characters and not be coerced to factors, set as.is=TRUE in type.convertRajkot
won't that still convert the other columns to characters though?Adversary
@Adversary - It will coerce them to their appropriate type. So if R decides they should be numeric, they will be numeric. Try it out. type.convert(as.character(1:5)) goes back to numeric, type.convert(letters[1:5]) goes to factor, and type.convert(letters[1:5], as.is = TRUE) remains characterRajkot
so if I understand correctly. If I want to solve this issue and still convert my df to what R thinks each column should be converted to, I should do df[] <- lapply(as.character(df), type.convert)? I'm a bit confused.Adversary
@Adversary - No, you would have to do df[] <- lapply(df, function(x) type.convert(as.character(x), as.is = TRUE)) That way you will have no factor columns. If you want R to decide about factors, leave out as.is = TRUE. You cannot run as.character() on a data frame, only on atomic vectors one-by-oneRajkot
I see. So just to clarify the process there. Would the function above be converting each vector in the dataframe to a character vector? then type.convert() converts each vector to the appropriate class/type. And I would only leave as.is =TRUE if I didn't want type.convert() to convert any character vectors to factors. Is this correct?Adversary

© 2022 - 2024 — McMap. All rights reserved.