Behaviour of case_when with numeric(0)
Asked Answered
B

0

6

I have a problem understanding how dplyr::case_when works. Here with this pretty simple line :

library(tidyverse)
case_when(TRUE ~ 50,
          FALSE ~ numeric(0))

I get numeric(0) while obviously, TRUE is TRUE and so it should send back 50. Besides, FALSE is FALSE so it should never send back numeric(0). I have not the problem if I write :

case_when(TRUE ~ 50,
      FALSE ~ NaN)

Where I get 50, which is right. What do I miss ?

Beneficence answered 28/1, 2021 at 17:22 Comment(11)
I think the problem is that numeric(0) returns a vector of length 0. If you try numeric(1) (which is a vector of length 1 with a value of 0) then it works. case_when should be reporting an error I would say, but it's not.Wavawave
For me this is unwanted behavour and I wasn't aware of it. Maybe you can notice the dplyr team on github. Generally, every outcome of case_when should have the same type and the same length. For example, case_when(TRUE ~ 1:3, FALSE ~ 1:2) throws an error.Concettaconcettina
Huh, on rereading the question, I was assuming (and mis-reading) that the first code block failed. It should, in my mind. I'm with @Cettt, this is unwanted behavior.Problematic
Apparently the dplyr team sees this as a feature?Gauntlett
It is complicated though. My immediate reaction is that I don't want case_when evaluating things it doesn't need to. I'd forego length checking for efficiency. case_when(TRUE ~ 1, FALSE ~ {Sys.sleep(10); 0}) takes 10 seconds to return, but it could be instant.Gauntlett
if_else and case_when are not short-circuited, @GregorThomas; while I agree that it would be a great thing, I don't think it's in the cards to make it so. :-(Problematic
Apparently not. I had assumed that was one of the things if_else did to improve performance over ifelse, but base::ifelse(TRUE, 1, {Sys.sleep(10); 0}) actually is short-circuited!Gauntlett
I am opening a new issue, because the documentation seems murky at the very least.Gauntlett
@GregorThomas, I disagree about optimizing out length-checking: R recycling, as long as its been around, has led to so many bugs when not recognized. When recycling is not desired but it just happens to be that the one vector length is a multiple of the other, recycling happens and likely corrupts the data. In my head, recycling should be length-same or length-1, nothing else unless explicitly allowed </rant>. Unlikely to change in base R, unfortunately. But dplyr makes intentional effort on things similar to this (enforcing class, e.g., when ifelse does not), surprised about this.Problematic
I agree with you 100% on recycling - I love data.table's approach there as well. But this seems more restrictive. Why does this throw warnings? x <- 1:-1; case_when(x > 0 ~ log(x), TRUE ~ as.numeric(x)).Gauntlett
fcase warns, too ... and it does no recycling (a problem in my book), so TRUE would need to be rep(TRUE,3) here (c.f., github.com/Rdatatable/data.table/issues/4258, still open).Problematic

© 2022 - 2024 — McMap. All rights reserved.