Group by a column and sort by another column in R
Asked Answered
E

3

5

I am examining the imdb movie dataset in kaggle with R.

Here is a minimal repro dataset:

> movies <- data.frame(movie = as.factor(c("Movie 1", "Movie 2", "Movie 3", "Movie 4")), director = as.factor(c("Dir 1", "Dir 2", "Dir 1", "Dir 3")), director_rating =  c(1000, 2000, 1000, 3000))

> movies
    movie director director_rating
1 Movie 1    Dir 1            1000
2 Movie 2    Dir 2            2000
3 Movie 3    Dir 1            1000
4 Movie 4    Dir 3            3000

Note that every row that has the same director has the same value of rating for the director.

I want to list the directors, sorted by rating, and one row per director. The following code works:

> library(dplyr)
> movies %>% 
  group_by(director) %>%
  summarize(director_rating = mean(director_rating)) %>%
  arrange(desc(director_rating))

    # A tibble: 3 x 2
  director director_rating
    <fctr>           <dbl>
1    Dir 3            3000
2    Dir 2            2000
3    Dir 1            1000

But it seems wasteful to compute the mean when I know that all the ratings for a single director are identical. What is a more idiomatic/efficient way to do this in R?

Ejector answered 13/9, 2016 at 12:59 Comment(7)
Just use first instead of mean to return the first rating?Omar
@Omar - yes, that works - I forgot about first being available in R :-)Ejector
"every row that has the same director has the same value of rating for the director" -- Fyi, this is a bad way to structure data. You should have separate movies and directors tables with attributes split accordingly. You might want to read Hadley's thoughts on "tidy" data: jstatsoft.org/article/view/v059i10Rna
@Rna - I realize that this is not ideal, from a tidyverse or a SQL normalization viewpoint - but this is the form in which the data is often needed during modeling - for example, if I wanted to predict movie rating using the movie's director rating, and each actor's rating, as predictors, I would need to denormalize it into something like the above, anyway, no?Ejector
Yes, I agree with that. In my own work, I'd at least start with them separate, only merging when necessary, though that may be true in your case, too (just in the background to this question).Rna
@Rna - here is my ML noob view - if I get everything in one big table, I could, as suggested by Hadley, split them up in tidying, and then use join functions from dplyr to denormalize them before modeling - but seems redundant to perform two steps that cancel each other out, especially when original data is loaded as denormalized. Instead, I could extract temp tables that extract individual normalized entities (like directors) for EDA (as I do above), and then use the original denormalized dataset for modeling.Ejector
Yeah, that may be true. Usually, I start out with ~five tables (from various sources: the Census, the client, some third party... working on economics stuff); then do various merges to break it down into a different set of ~five tables based on "unit" (person, place, firm, etc.); and only in the final step merge things as needed by the model. I find it a lot cleaner for catching bugs and thinking about the model, and it takes up less memory. In your case, though, what you describe sounds reasonable. You just need to keep track of which variables are determined by which other ones.Rna
M
6

There's actually no need to group and summarise, since you are just looking for distinct / unique entries. A dplyr option is therefore:

select(movies, -movie) %>% 
  distinct() %>% 
  arrange(desc(director_rating))
#  director director_rating
#1    Dir 3            3000
#2    Dir 2            2000
#3    Dir 1            1000

Or in case you like to keep other columns:

distinct(movies, director, .keep_all = TRUE) %>%   # for dplyr >= 0.5.0
  arrange(desc(director_rating))
#    movie director director_rating
#1 Movie 4    Dir 3            3000
#2 Movie 2    Dir 2            2000
#3 Movie 1    Dir 1            1000
Minivet answered 13/9, 2016 at 13:6 Comment(2)
In dplyr 0.5.0, your second way does not work. The default in distinct is to drop cols other than director there. Gotta do distinct(movies, director, .keep_all = TRUE) %>% arrange(desc(director_rating)) Maybe my version is behind the times or something...Rna
@Rna thanks for that. I haven't upgraded to 0.5 yetMinivet
A
3

Here's a base R option:

unique(movies[,2:3])[order(-unique(movies[,2:3])[,2]),]
#  director director_rating
#4    Dir 3            3000
#2    Dir 2            2000
#1    Dir 1            1000
Apostolic answered 13/9, 2016 at 13:9 Comment(1)
Always good to show a base R alternative. In this case, you could consider storing unique(movies[,2:3]) in a new object so you don't need to compute it twice (in case of larger data sets)Minivet
S
2

We can use data.table

library(data.table)
setDT(movies)[, .(director_rating = director_rating[1]), director][order(-director_rating)]
#    director director_rating
#1:    Dir 3            3000
#2:    Dir 2            2000
#3:    Dir 1            1000

Or with setorder/unique where the unique from data.table also have the by option.

unique(setorder(setDT(movies), -director_rating), by = "director")
Sultanate answered 13/9, 2016 at 13:2 Comment(2)
Yes, that works - thanks. Would still like to know how to do this idiomatically in the dplyr way.Ejector
@Ejector After the group_by step use summarise(director_rating = first(director_rating))Sultanate

© 2022 - 2024 — McMap. All rights reserved.