Get column types of excel sheet automatically

library(readxl) file= "myfile.xlsx" firstread<-read_excel(file, sheet = "mysheet", col_names = TRUE, na = "", skip = 0) #firstread has the problem of the a column with "0.492 ± 0.6", #being interpreted as number (returns NA) colna<-colnames(firstread) # read every column as character colnumt<-ncol(firstread) textcol<-rep("text", colnumt) secondreadchar<-read_excel(file, sheet = "mysheet", col_names = TRUE, col_types = textcol, na = "", skip = 0) # another column, with the number 0.532, is now 0.5319999999999999 # and several other similar cases. # read again with stringsAsFactors # critical step, in real life, I "cannot" write a csv file. write.csv(secondreadchar, "allcharac.txt", row.names = FALSE) stringsasfactor<-read.csv("allcharac.txt", stringsAsFactors = FALSE) colnames(stringsasfactor)<-colna # column with "0.492 ± 0.6" now is character, as desired, others numeric as desired as well

Here is a script that imports all the data in your excel file. It puts each sheet's data in a list called dfs:

library(readxl)

# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")

# Loop through the sheet names and get the data in each sheet
dfs <- lapply(all_sheets, function(x) {

  #Get the number of column in current sheet
  col_num <- NCOL(read_excel(path = "myfile.xlsx", sheet = x))

  # Get the dataframe with columns as text
  df <- read_excel(path = "myfile.xlsx", sheet = x, col_types = rep('text',col_num))

  # Convert to data.frame
  df <- as.data.frame(df, stringsAsFactors = FALSE)

  # Get numeric fields by trying to convert them into
  # numeric values. If it returns NA then not a numeric field.
  # Otherwise numeric.
  cond <- apply(df, 2, function(x) {
    x <- x[!is.na(x)]
    all(suppressWarnings(!is.na(as.numeric(x))))
  })
  numeric_cols <- names(df)[cond]
  df[,numeric_cols] <- sapply(df[,numeric_cols], as.numeric)

  # Return df in desired format
  df
})

# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets

The process goes as follows:

First, you get all the sheets in the file with excel_sheets and then loop through the sheet names to create dataframes. For each of these dataframes, you initially import the data as text by setting the col_types parameter to text. Once you have gotten the dataframe's columns as text, you can convert the structure from a tibble to a data.frame. After that, you then find columns that are actually numeric columns and convert them into numeric values.

Edit:

As of late April, a new version of readxl got released, and the read_excel function got two enhancements pertinent to this question. The first is that you can have the function guess the column types for you with the argument "guess" provided to the col_types parameter. The second enhancement (corollary to the first) is that guess_max parameter got added to the read_excel function. This new parameter allows you to set the number of rows used for guessing the column types. Essentially, what I wrote above could be shortened with the following:

library(readxl)

# Get all the sheets
all_sheets <- excel_sheets("myfile.xlsx")

dfs <- lapply(all_sheets, function(sheetname) {
    suppressWarnings(read_excel(path = "myfile.xlsx", 
                                sheet = sheetname, 
                                col_types = 'guess', 
                                guess_max = Inf))
})

# Just for convenience in order to remember
# which sheet is associated with which dataframe
names(dfs) <- all_sheets

I would recommend that you update readxl to the latest version to shorten your script and as a result avoid possible annoyances.

I hope this helps.

Edit:

Recommended topics

Hot tags