How can I replace a factor levels with the top n levels (by some metric), plus [other]?
Asked Answered
M

1

9

For a factor with more than a sensible number of levels to color in a graph, I want to replace any levels that are not in the 'top 10' with 'other'.

Alternate Question: How do I reduce my factor levels to the number rcolorbrewer can plot as separate colors?

For example, if I want to plot number of homeruns per decade from the baseball data:

require(ggplot2)
qplot(data=baseball,10*year%/%10,hr,
  stat="identity",geom="bar")

simple graph to set the scene

Perhaps I'd like to see what teams contributed to this:

qplot(data=baseball,10*year%/%10,hr,
  fill=team,
  stat="identity",geom="bar")

too many teams to tell colors apart or plot on page

This creates too many color levels! The colors are so similar you can't distinguish them, and there are so many they won't fit on the screen.

I'd really like to see the top X (7) teams (by total homerun count) and then the rest all lumped together in a single category/color called 'other'.

Let's imagine we have a function called hotfactor which knows how to do this:

hotfactor(afactor,orderby,count)={ ??? }

qplot(data=baseball,10*year%/%10,hr,
  fill=hotfactor(factor(team),hr,n=7),
  stat="identity",geom="bar") + 
  scale_fill_brewer("team","Dark2")

sample image for solution

So what can I use for 'hotfactor'?

Mervin answered 13/7, 2011 at 2:55 Comment(0)
M
9

So after going through several iterations and searching the web, I have created this nice short one.

hotfactor= function(fac,by,n=10,o="other") {
   levels(fac)[rank(-xtabs(by~fac))[levels(fac)]>n] <- o
   fac
}

It's great for summarising data, and you can use it to access the great rcolorbrewer color schemes (which each have a limited number of carefully selected colors).


Usage notes:

fac should be a factor, and works best with no empty factor levels. You may want to run droplevels(as.factor(mydata)) first.

It doesn't sort the factor levels. for best results in barcharts you should run the following on the output factor.

x <- hotfactor(f,val)
x <- reorder(x,-val,sum)
Mervin answered 13/7, 2011 at 2:56 Comment(6)
sorry, couldn't help editing. it's compact enough as it is ... I added a few spaces.Ileneileo
Hah! I was already late leaving work so I didn't have time to think about neatness. Thanks.Mervin
It might seem odd that I'm answering my own question, but it's a great way to get an answer that I can google when I forget it or move jobs.Mervin
It's actually completely allowed/expected, I think.Ileneileo
See the question for usage examplesMervin
Some of the parameters (not variables) to the hotfactor function have default values. This means the function will probably do something sensible if you don't supply them. This is fairly basic R.Mervin

© 2022 - 2024 — McMap. All rights reserved.