I am examining the imdb movie dataset in kaggle with R.
Here is a minimal repro dataset:
> movies <- data.frame(movie = as.factor(c("Movie 1", "Movie 2", "Movie 3", "Movie 4")), director = as.factor(c("Dir 1", "Dir 2", "Dir 1", "Dir 3")), director_rating = c(1000, 2000, 1000, 3000))
> movies
movie director director_rating
1 Movie 1 Dir 1 1000
2 Movie 2 Dir 2 2000
3 Movie 3 Dir 1 1000
4 Movie 4 Dir 3 3000
Note that every row that has the same director has the same value of rating for the director.
I want to list the directors, sorted by rating, and one row per director. The following code works:
> library(dplyr)
> movies %>%
group_by(director) %>%
summarize(director_rating = mean(director_rating)) %>%
arrange(desc(director_rating))
# A tibble: 3 x 2
director director_rating
<fctr> <dbl>
1 Dir 3 3000
2 Dir 2 2000
3 Dir 1 1000
But it seems wasteful to compute the mean when I know that all the ratings for a single director are identical. What is a more idiomatic/efficient way to do this in R?
first
instead ofmean
to return the first rating? – Omarfirst
being available in R :-) – Ejectormovies
anddirectors
tables with attributes split accordingly. You might want to read Hadley's thoughts on "tidy" data: jstatsoft.org/article/view/v059i10 – Rna