Remove unused factor levels from a ggplot bar plot
Asked Answered
E

4

21

I want to do the opposite of this question, and sort of the opposite of this question, though that's about legends, not the plot itself.

The other SO questions seem to be asking about how to keep unused factor levels. I'd actually like mine removed. I have several name variables and several columns (wide format) of variable attributes that I'm using to create numerous bar plots. Here's a reproducible example:

library(ggplot2)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5))
ggplot(df, aes(x=name,y=var1)) + geom_bar()

I get this:

enter image description here

I'd like only the names that have corresponding varn's show up in my bar plot (as in, there would be no empty space for B).

Reusing the base plot code will be quite easy if I can simply change my output file name and y=var bit. I'd like not have to subset my data frame just to use droplevels on the result for each plot if possible!


Update based on the na.omit() suggestion

Consider a revised data set:

library(ggplot2)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5), var3=c(NA,6,7))
ggplot(df, aes(x=name,y=var1)) + geom_bar()

I need to use na.omit() for plotting var1 because there's an NA present. But since na.omit makes sure values are present for all columns, the plot removes A as well since it has an NA in var3. This is more analogous to my data. I have 15 total responses with NAs peppered about. I only want to remove factor levels that don't have values for the current plotted y vector, not that have NAs in any vector in the whole data frame.

Eisenstark answered 9/7, 2012 at 21:3 Comment(6)
Why do you need have a data-frame to begin with, when you're actually just plotting one row of it? If you didn't have the data frame (and, rather, had just a list/vector) you could just drop the NA fields).Legume
@TiloWiklund: I'm an R novice, so feel free to suggest an alternative. I'm plotting a column of names against numerous different columns of data for a series of plots. Some columns are incomplete, some aren't. The incomplete ones are analogous to the above and leave gaps which I don't want since I only need to compare the variables that actually have data associated with them for that particular measured response. Does that make sense?Eisenstark
You can also simply drop rows by setting a condition on only that column: ggplot(df[!is.na(df$var1),], aes(x=name,y=var1)) + geom_bar().Ela
@joran: this seems quite similar to Tilo's solution below, though a bit simpler than passing two vector names. Regardless of the tweak on omitting na's, I guess the real lesson learned is that there's no way to do this automatically from ggplot.Eisenstark
@TiloWiklund ggplot() expects a data frame as it's first argument. And anyway, isn;t he plotting two rows of it, one a factor (name) the other a numeric (var1)? Hendy needs to pass both variables otherwise how does ggplot() know to plot the values as two bars not a numeric vector of data?Metamathematics
@GavinSimpson true, bad wording on my part. I should have said he used a fixed and finite number of columns.Legume
M
22

One easy options is to use na.omit() on your data frame df to remove those rows with NA

ggplot(na.omit(df), aes(x=name,y=var1)) + geom_bar()

Given your update, the following

ggplot(df[!is.na(df$var1), ], aes(x=name,y=var1)) + geom_bar()

works OK and only considers NA in Var1. Given that you are only plotting name and Var, apply na.omit() to a data frame containing only those variables

ggplot(na.omit(df[, c("name", "var1")]), aes(x=name,y=var1)) + geom_bar()
Metamathematics answered 9/7, 2012 at 21:7 Comment(8)
This is great and very easy. My actual data set must be missing one or more values from numerous columns as applying na.omit leaves me with a data frame with no rows... Any other suggestions?Eisenstark
It would have been helpful to know that initially. See my updated Answer.Metamathematics
I would have liked to specify that as well. Not knowing what solution would come up, I didn't realize that would be an issue. Honestly, I expected there to be a ggplot option to drop things. Given that people want to keep unused levels in the linked questions and are able to specify drop=FALSE, I kind of wondered why drop=T wouldn't do exactly what I wanted! Thanks for the updated answer.Eisenstark
@Eisenstark The level B is not unused in your example. It is very much used as it is present in the data. The NA is just as valid a data point as any other value as far as R is concerned. A truly unused level would be the B in A <- factor(c("a","c"), levels = c("a","b","c")). In A, the level b is not present in the data.Metamathematics
Sure, technically, I suppose. From my perspective, I have prototypes and numerous measured test results. There is no data at the intersection of prototype B and test method var1. My data frame is composed of a column of prototype names and columns of test data. Wide format. "Truly unused levels" are only possible in long, right?Eisenstark
You are missing the point that there is data at the intersection of B and var1; we just don't have the value of that data available to us. Re wide vs long, again that is not strictly correct; for example, if we do df[!is.na(df$var1), ] the variable name in the resulting data frame is a factor with the same levels as the full data set and hence it now does have a truly un-used level, that of B. Un-used levels can crop up in any factor.Metamathematics
Put another way, forget about var1 as that has nothing to do with the un-used level. If there are no elements of a factor that correspond to one or more levels of the factor then those levels are considered un-used. This is independent of the data structure or format.Metamathematics
I guess my point was that if my data was in long format with vectors name, var and value, I could create a "truly unused factor" without massaging data with !is.na. That's not possible in my current data arrangement, correct? (With the above, I have no choice but to have a "blank" at the intersection of name and var1. In long form I just wouldn't have a row in which name=B, var=1.Eisenstark
L
6

Notice that, when plotting, you're using only two columns of your data frame, meaning that, rather than passing your whole data.frame you could take the relevant columns x[,c("name", "var1")] apply na.omit to remove the unwanted rows (as Gavin Simpson suggests) na.omit(x[,c("name", "var1")]) and then plot this data.

My R/ggplot is quite rusty, and I realise that there are probably cleaner ways to achieve this.

Legume answered 9/7, 2012 at 21:30 Comment(1)
Didn't realize I could do that from within ggplot, but it now seems obvious. This definitely works. It would be great if there was an equivalent to drop=T or scale="free" as I'll have to tweak all of my plot functions this way. Shouldn't b too bad and I'll just use dat[, c(1,n)] so that I can just iterate through each without much hassle. Thanks!Eisenstark
R
2

A lot of time has passed since this question was originally asked. In 2021 if I was handling this I would use something like:

library(ggplot2)
library(tidyr)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5))

df %>% 
  drop_na(var1) %>% 
  ggplot(aes(name, var1)) +
  geom_col()

Created on 2021-12-03 by the reprex package (v2.0.1)

Razor answered 3/12, 2021 at 17:54 Comment(1)
This is awesome! I think there are a lot of questions on SO like this. I just did something similar to another question, where all of the answers now seemed fiddly and tedious vs. dplyr. Thanks for taking the time to modernize() :)Eisenstark
D
0

This would be more easy and simple

library(ggplot2)
df <- data.frame(name=c("A","B","C"), var1=c(1,NA,2),var2=c(3,4,5))
a1<-ggplot(df, aes(x=name,y=var1)) + geom_bar()
a1+scale_x_discrete(limits=c("A", "C"))
Doy answered 11/4 at 0:4 Comment(1)
Reasonable with few variables, might be cleaner if generalized to something like limits = unique(df$name) vs. explicitly having to define the values to include.Eisenstark

© 2022 - 2024 — McMap. All rights reserved.