R Markdown struggling with read_xlsx, Warning: Expecting logical

Asked 5/5, 2022 at 8:57 Answered 23/8, 2024 at 8:54

When running read_xlsx() in my normal .R script, I'm able to read in the data. But when running the .R script with source() in R Markdown, it suddenly takes a long time (> 20+++ mins I always terminate before the end) and I keep getting these warning messages where it is evaluating every single column and expecting it to be a logical:

Warning: Expecting logical in DE5073 / R5073C109: got 'HOSPITAL/CLINIC'
Warning: Expecting logical in DG5073 / R5073C111: got 'YES'
Warning: Expecting logical in CQ5074 / R5074C95: got '0'
Warning: Expecting logical in CR5074 / R5074C96: got 'MARKET/GROCERY STORE'
Warning: Expecting logical in CT5074 / R5074C98: got 'NO'
Warning: Expecting logical in CU5074 / R5074C99: got 'YES'
Warning: Expecting logical in CV5074 / R5074C100: got 'Less than one week'
Warning: Expecting logical in CW5074 / R5074C101: got 'NEXT'
Warning: Expecting logical in CX5074 / R5074C102: got '0'
.. etc

I can't share the data here, but it is just a normal xlsx file (30k obs, 110 vars). The data has responses in all capitals like YES and NO. The raw data has filters applied, some additional sheets, and some mild formatting in Excel (no borders, white fill) but I don't think these are affecting it.

An example of my workflow setup is like this:

Dataprep.R:

setwd()
pacman::p_load()  # all my packages

df <- read_xlsx("./data/Data.xlsx") %>% type_convert()

## blabla more cleaning stuff

Report.Rmd:

setwd()
pacman::p_load() # all my packages again

source("Dataprep.R")

When I run Dataprep.R, everything works in < 1 min. But when I try to source("Dataprep.R") from Report.Rmd, then it starts being slow at read_xlsx() and giving me those warnings.

I've tried also taking df <- read_xlsx() from Dataprep.R and moving it to Report.Rmd, and it is still as slow as running source(). I've also removed type_convert() and tried other things like removing the extra sheets in the Excel. source() was also in the setup chunk in Report.Rmd, but I took it out and still the same thing.

So I think it is something to do with R Markdown and readxl/read_xlsx(). The exact same code and data is evaluating so differently in R vs Rmd and it's very puzzling.

Would appreciate any insight on this. Is there a fix? Or is this something I will just have to live with (i.e. convert to csv)?

> sessionInfo()
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8    LC_MONETARY=English_United Kingdom.utf8
[4] LC_NUMERIC=C                            LC_TIME=English_United Kingdom.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] digest_0.6.29    R6_2.5.1         lifecycle_1.0.1  pacman_0.5.1     evaluate_0.15    scales_1.2.0     rlang_1.0.2      cli_3.3.0        rstudioapi_0.13 
[10] rmarkdown_2.14   tools_4.2.0      munsell_0.5.0    xfun_0.30        yaml_2.3.5       fastmap_1.1.0    compiler_4.2.0   colorspace_2.0-3 htmltools_0.5.2 
[19] knitr_1.39

UPDATE:

So in Markdown, I can use the more generic read_excel() and that works in my setup chunk. But I still get the same Warning messages if I try to source() it, even if the R script sourced is also using read_excel() instead of read_xlsx(). Very puzzling all around.

Headlong answered 5/5, 2022 at 8:57 Comment(0)

When you run that code on a .R (and probably other kinds of codes that generate warnings), you will get a summary of warnings. Something like "There were 50 or more warnings (use warning() to see the first 50)".

While if you run that same code on a standard Rmarkdown code chunk, you will actually get the whole 50+ warnings. That could mean you are printing thousands, millions, or more warnings.

If your question is WHY does that happen on Rmarkdown and not on R, I'm not sure.

But if your question is how to solve it, it's simple. Just make sure to add the options message=FALSE and warning=FALSE to your code chunk.

It should look something like this:

{r chunk_name, message=FALSE, warning=FALSE}
setwd()
pacman::p_load() # all my packages again

source("Dataprep.R")

Now, about the "setwd()", I would advise against using anything that changes the state of your system (avoid "side effect" functions). They can create problems if you are not very careful. But that is another topic for another day.

Filagree answered 13/7, 2022 at 17:25 Comment(10)

Thanks for your reply! I don't think it's the message or warning settings issue as I have them set in opts_chunk$set already for global settings. knitr::opts_chunk$set(message = FALSE, warning = FALSE, echo = FALSE ) I still haven't found the issue... I've since just resorted to manually saving as .csv each time I get the updated data. – Headlong 19/7, 2022 at 5:54

:o ok. Mmm then I don't know what could it be. I hope somebody else replies. – Sconce 19/7, 2022 at 5:57

d'oh! Your suggestion was spot on. I had the source() script in my setup chunk, together with the global options. However, the setup chunk itself didn't have message = FALSE and warning = FALSE settings, so it was still printing those as I guess the options were not set yet. Thanks! – Headlong 19/7, 2022 at 6:1

Hahaha nice! :D – Sconce 19/7, 2022 at 6:2

and also on setwd(), the non-minimal version of it that I use is setwd(dirname(rstudioapi::getActiveDocumentContext()$path)) so that the directory is set dynamically to the location of the script. Wish that it looked nicer than this, but welp, it works. Hope that this is acceptable and will absolve me from a separate conversation on a separate day :P – Headlong 19/7, 2022 at 6:5

Hahaha nah, at the end of the day, it's up to you, but in general is bad to change the directory because when you run a series of scripts, you are stucked with a specific order. You cannot test each one by it's own, because at some point you chabge the directory and if you don't change it back or use specific routes from, you will have errors. And you will be forced to run the scripts in order. It makes more difficult debbuging and modularizing your code. – Sconce 19/7, 2022 at 6:10

You can set the directory where you run rmarkdown on your file settings or global settings. I always set my global settings to run from project path. And I always work with paths relatives to my project. That ways, it won't matter if you move your files arround. The paths will still be all valid. And your Rmd will run anywhere. – Sconce 19/7, 2022 at 6:12

Some times I am forced to use the file path, but I try to avoid it like the plage – Sconce 19/7, 2022 at 6:13

Yup, I don't ever specify file paths explicitly. I only use setwd() once at the start of my markdown, then everything else is relative to the .Rmd so no issues there! It's just there as a safety because I admittedly have very bad session/project management in RStudio (but all is fine in my folders) and frequently have everything just open in a single R session x_x I added this to my question just in case that had some bearing on the issue I was running into! – Headlong 19/7, 2022 at 6:29

Ok :P nice :D good luck with your project :) – Sconce 19/7, 2022 at 6:33

I experienced the same issue, after investigating the reason behind the warning message I found out that when using read_excel function without specifying columns types read_excel try guessing column type and when there are some values mismatch (in terms of their types) within the range of guessing, the results will give you that warning message indicating that it was expecting this but got that...

To address this issue, I had to import all the data columns as "text". this is done like the following:

read_excel("file_to_read.xlsx", col_types = "text")

this had correctly solved my issue.

Cutlerr answered 6/4, 2023 at 9:59 Comment(1)

why it doesn't use Excel-defined column types? – Pascual 23/8, 2024 at 8:33

This is due to the fact that readxl::read_excel guesses column types based on the value of the first guess_max (an argument of the function) of its cells (reference)

You can use xlsx, openxlsx or openxlsx2 packages to read your Excel file, which doesn't seem to have this problem.

Pascual answered 23/8, 2024 at 8:54 Comment(1)

I agree, if you play with guess_max values it will check more lines and may return a different class. – Anabantid 5/9, 2024 at 13:39

Recommended topics

Hot tags