Make a table showing the 10 largest values of a variable in R?

Asked 11/8, 2015 at 10:45 Answered 21/9, 2021 at 5:56

I want to make a simple table that showcases the largest 10 values for a given variable in my dataset, as well as 4 other variables for each observation, so basically a small subset of my data. It would look something like this:

Score  District  Age  Group  Gender
17     B         23    Red   1
12     A         61    Red   0
11.7   A         18    Blue  0
10     B         18    Red   0
.
.
etc.

whereby the data is ordered on the Score var. All the data is contained within the same dataframe.

Soluk answered 11/8, 2015 at 10:45 Comment(6)

Is it grouped by Var4? What is the expected output – Burne 11/8, 2015 at 10:46

hi @akrun, the expected output is pretty much what I've written in the box, just with 10 rows instead of 4. Var1 could be something like a test score, and var2-var5 would be demographic data, e.g. var2=district, var3=age, var4=class, var5=sex – Soluk 11/8, 2015 at 10:48

It is better to show that also because description can be confusing. Sorry, I didn't understand what you wanted. – Burne 11/8, 2015 at 10:49

Updated the variable names – Soluk 11/8, 2015 at 10:50

Do you want the 10 largest rows based on the Score, grouped by 'Group' – Burne 11/8, 2015 at 10:52

I basically just want the 10 largest values for score in the dataset, and include the other 4 variables for reference. No grouping variable. – Soluk 11/8, 2015 at 10:53

This should do it...

data <- data[with(data,order(-Score)),]

data <- data[1:10,]

Demolish answered 11/8, 2015 at 10:52 Comment(2)

Awesome, this worked! Thanks so much. It seems that brackets in R do much the same that the replace command does in Stata. – Soluk 11/8, 2015 at 11:0

Maybe you could just wrap it into head(data[order(-data$Score),], 10) – Babette 11/8, 2015 at 13:47

You can do this using arrange from dplyr. This should also work if there are grouping variables. Just add group_by before the arrange. We filter the first 10 observations using slice.

 library(dplyr)
 df1 %>%
    arrange(desc(Score)) %>%
    slice(1:10)

Or another option is ?top_n (commented by @docendodiscimus) from dplyr which is a wrapper that uses filter and min_rank to select the top n (i.e. 10) entries for 'Score'.

 top_n(df1, 10, Score)

Or we use filter by creating a logical condition with row_number which is equivalent to rank(ties.method='first') (contributed by @Steven Beaupre)

 filter(df1, row_number(desc(Score)) <= 10)

Or a data.table option (by @David Arenburg). We convert the 'data.frame' to 'data.table' (setDT(df1)), order (decreasing) the 'Score' variable, and select the first 10 observations. .SD means Subset of DataTable.

 library(data.table)
 setDT(df1)[order(-Score), .SD[1:10]]

Burne answered 11/8, 2015 at 10:53 Comment(3)

Or top_n(df1, 10, Score) – Tradelast 11/8, 2015 at 11:16

top_n uses min_rank and rank(ties.method = "min"). If you want to have the results with ties.method = "first" you could do: filter(df1, row_number(desc(Score)) <= 10) – Tops 11/8, 2015 at 11:19

I wonder if you also could add setDT(df1)[order(-Score), .SD[1:10]] or head(setDT(df1)[order(-Score)], 10) – Babette 11/8, 2015 at 13:46

This should do it...

data <- data[with(data,order(-Score)),]

data <- data[1:10,]

Demolish answered 11/8, 2015 at 10:52 Comment(2)

Awesome, this worked! Thanks so much. It seems that brackets in R do much the same that the replace command does in Stata. – Soluk 11/8, 2015 at 11:0

Maybe you could just wrap it into head(data[order(-data$Score),], 10) – Babette 11/8, 2015 at 13:47

You can get the highest values of a vector using the code below:

my_vec <- c(1:100)
tail(sort(my_vec),10)

So if you want to use this method as a data frame filter you could do:

data(mtcars)
mtcars[mtcars$mpg %in% tail(sort(mtcars$mpg),4),]

which would produce:

> mtcars[mtcars$mpg %in% tail(sort(mtcars$mpg),4),]
                mpg cyl disp  hp drat    wt  qsec vs am gear carb
Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

Frit answered 11/8, 2015 at 11:11 Comment(0)

From dplyr >= 1.0.0, we can use slice_max function.

library(dplyr)

mtcars %>% slice_max(mpg, n = 4)

#                mpg cyl disp  hp drat    wt  qsec vs am gear carb
#Toyota Corolla 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
#Fiat 128       32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
#Honda Civic    30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
#Lotus Europa   30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

By default rows with ties are selected, if you want to ignore ties and strictly return n rows use with_ties = FALSE.

Kassa answered 21/9, 2021 at 5:56 Comment(0)

Using sqldf:

library(sqldf)
sqldf("SELECT * FROM mtcars 
      ORDER BY mpg DESC 
      LIMIT 10", row.names = TRUE)

Output:

               mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1

Proteinase answered 11/8, 2015 at 16:0 Comment(0)

Recommended topics

Hot tags