Importing unquoted strings as factors using read_csv from the readr package in R

Asked 1/11, 2016 at 19:20 Answered 10/5, 2024 at 17:43

I have a .csv datafile with many columns. Unfortunately, string values do not have quotation marks (i.e., apples i.o. "apples). When I use read_csv from the readr package, the string values are imported as characters:

library(readr)

mydat = data.frame(first = letters, numbers = 1:26, second = sample(letters, 26))
write.csv(mydat, "mydat.csv", quote = FALSE, row.names = FALSE)

read_csv("mydat.csv")

results in:

Parsed with column specification:
cols(
  first = col_character(),
  numbers = col_integer(),
  second = col_character()
)
# A tibble: 26 x 3
   first numbers second
   <chr>   <int>  <chr>
1      a       1      r
2      b       2      n
3      c       3      m
4      d       4      z
5      e       5      p
6      f       6      j
7      g       7      u
8      h       8      l
9      i       9      e
    10     j      10      h
    # ... with 16 more rows

Is there a way to force read_csv to import the string values as factors i.o. characters?

Importantly, my datafile has so many columns (string and numeric variables) that, AFAIK, there is no way to make this work by providing column specifications with the col_types argument.

Alternative solutions (e.g. using read.csv to import the data, or dplyr code to change all character variables in a dataframe to factors) are appreciated too.

Update: I learned that whether or not the values in the csv file have quotes or not makes no difference for read.csv or read_csv. read.csv will import these values as factors; read_csv will import them as characters. I prefer to use read_csv because it's considerably faster than read.csv.

Ber answered 1/11, 2016 at 19:20 Comment(8)

Specify col_factor() within col_types. Or just use read.csv. – Caste 1/11, 2016 at 19:32

e.g. read_csv('mydat.csv', col_types = cols(first = col_factor(levels = letters))). I think your question might be misguided, though; R handles quotations automatically. – Caste 1/11, 2016 at 19:41

Not sure if I understand what you mean; the string values in my csv file lack quotation marks. – Ber 1/11, 2016 at 20:32

col_types doesn't work for my data; I have too many columns (some of which contain numerical values, others string values) to specify a column type for each. – Ber 1/11, 2016 at 20:35

If it's a whitespace-delimited file quotes might matter, but they don't whatsoever for a CSV, so I'm not sure what you mean. col_types can be used for a single column by name within cols; it will default to col_guess() for those you don't specify. – Caste 1/11, 2016 at 20:40

Got it: quotes or not, read_csv imports the string values as characters and read.csv imports them as factors. I would prefer to use read_csv because it's considerably faster than read.csv – Ber 1/11, 2016 at 21:0

About specifying column types: imagine a dataframe with 100 columns, half of which are numerical variables and the other half string variables (mixed, so not all numerical followed by all string). I'm not sure how I would use col_types so that each of the string variables will be imported as a factor (as opposed to read_csv's default character type). – Ber 1/11, 2016 at 21:4

Read in the first few rows and build a column specification programmatically, or use spec_csv, or use data.table::fread, which has a more normal stringsAsFactors parameter. – Caste 1/11, 2016 at 21:8

This function uses dplyr to convert all character columns in a tbl_df or data frame to factors:

char.to.factors <- function(df){
  # This function takes a tbl_df and returns same with any character column converted to a factor

  require(dplyr)

  char.cols = names(df)[sapply(df, function(x) {class(x) == "character" })]
  tmp = mutate_each_(df, funs(as.factor), char.cols)
  return(tmp)
}

Acting answered 1/11, 2016 at 19:41 Comment(3)

Or just df %>% mutate_if(is.character, factor) – Caste 1/11, 2016 at 19:42

Looks like that's a new addition in dplyr 0.5.0, that simplifies this nicely. – Acting 1/11, 2016 at 19:50

mutate_if works nicely! Though it's slow when the dataframe is large. – Ber 1/11, 2016 at 20:28

I like the alistaire's mutate_if() solution in the comments above, but for completeness, there is another solution which should be mentioned. You can use unclass() which will force a re-parse. You'll see this in a lot of code that uses readr.

df <- data.frame(unclass(fr))

df <- df %>% unclass %>% data.frame

Aver answered 6/12, 2017 at 22:4 Comment(0)

This function uses dplyr to convert all character columns in a tbl_df or data frame to factors:

char.to.factors <- function(df){
  # This function takes a tbl_df and returns same with any character column converted to a factor

  require(dplyr)

  char.cols = names(df)[sapply(df, function(x) {class(x) == "character" })]
  tmp = mutate_each_(df, funs(as.factor), char.cols)
  return(tmp)
}

Acting answered 1/11, 2016 at 19:41 Comment(3)

Or just df %>% mutate_if(is.character, factor) – Caste 1/11, 2016 at 19:42

Looks like that's a new addition in dplyr 0.5.0, that simplifies this nicely. – Acting 1/11, 2016 at 19:50

mutate_if works nicely! Though it's slow when the dataframe is large. – Ber 1/11, 2016 at 20:28

There's no version of stringsAsFactors = FALSE in read_csv unfortunately, and I think col_types= requires specific columns without more trickery.

A straightforward solution is to convert strings to factors, using across in dplyr instead of the superseded mutate_if:

df %>% mutate(across(where(is.character), factor))

By default, base R's factor infers the levels and ordering unless specified. where can also handle more complicated predicates, and you can use tidyselect for a lot more control.

Boycott answered 10/5, 2024 at 17:43 Comment(0)

Recommended topics

Hot tags