Preserve order of columns when going from wide to long format

Asked 27/10, 2017 at 18:15 Answered 28/1, 2023 at 15:6

I'm trying to preserve the order of columns when I gather them from wide to long format. The problem I'm having is after I gather and summarize the order is lost. The number of columns is huge so I don't want to manually type the order.

Here's an example:

library(tidyr)
library(dplyr)

N <- 4
df <- data.frame(sample = c(1,1,2,2),
                 y1.1 = rnorm(N), y2.1 = rnorm(N), y10.1 = rnorm(N))
> df
  sample      y1.1      y2.1      y10.1
1      1  1.040938 0.8851727 -0.3617224
2      1  1.175879 1.0009824 -1.1352406
3      2 -1.501832 0.3446469 -1.8687008
4      2 -1.326817 0.4434628 -0.8795962

What I want is to preserve the order of the columns. After I do some manipulation, the order is lost. Seen here:

dfg <- df %>% 
  gather(key="key", value="value", -sample) %>%
  group_by(sample, key) %>%
  summarize(mean = mean(value))

> filter(dfg, sample == 1)
  sample   key       mean
   <dbl> <chr>      <dbl>
1      1  y1.1  0.2936335
2      1 y10.1  0.6170505
3      1  y2.1 -0.2250543

You can see how it puts y10.1 ahead of y2.1 which I don't want. What I want is to preserve that order, seen here:

dfg <- df %>% 
  gather(key="key", value="value", -sample)

> filter(dfg, sample == 1)
  sample   key       value
1      1  y1.1  0.60171521
2      1  y1.1 -0.01444823
3      1  y2.1  0.81566726
4      1  y2.1 -1.26577581
5      1 y10.1  0.41686388
6      1 y10.1  0.81723707

For some reason the group_by and summarize operations change the order. I'm not sure why. I tried the ungroup command but that doesn't do anything. As I said earlier, my actual data frame has many columns and I need to preserve the order. The reason to preserve order is so I can plot the data in the correct order.

Any ideas?

Unhandy answered 27/10, 2017 at 18:15 Comment(0)

Or you can convert the key column to a factor with levels reflecting the original column names' order:

df %>% 
    gather(key="key", value="value", -sample) %>%
    mutate(key=factor(key, levels=names(df)[-1])) %>% # add this line to convert the key to a factor
    group_by(sample, key) %>%
    summarize(mean = mean(value)) %>%
    filter(sample == 1)

# A tibble: 3 x 3
# Groups:   sample [1]
#  sample    key       mean
#   <dbl> <fctr>      <dbl>
#1      1   y1.1  0.8310786
#2      1   y2.1 -1.2596933
#3      1  y10.1  0.8208812

Mulley answered 27/10, 2017 at 18:23 Comment(1)

I selected this as the answer because it is the most general solution. However the solution by @Moody_Mudskipper provides a unique option that it sorts numerically which may be desired in same cases (where the columns aren't in the desired order). – Unhandy 30/10, 2017 at 13:14

The tidyverse packages allow the elegant solution now:

    library(tidyverse)
    N <- 4
    df <- data.frame(sample = c(1,1,2,2),
                    y1.1 = rnorm(N), y2.1 = rnorm(N), y10.1 = rnorm(N))
    df %>% 
        gather("key", "value", -sample, factor_key = T) %>% 
        group_by(sample, key) %>%
        summarise(mean = mean(value))

which results in

    # A tibble: 6 x 3
    # Groups:   sample [2]
    sample key      mean
    <dbl> <fct>   <dbl>
    1      1 y1.1   0.0894
    2      1 y2.1   0.551 
    3      1 y10.1  0.254 
    4      2 y1.1  -0.555 
    5      2 y2.1  -1.36  
    6      2 y10.1 -0.794

Pisa answered 3/10, 2019 at 9:56 Comment(2)

This is a really nice solution. gather is now depreciated in favor of pivot_longer, but it doesn't appear at a first glance at the help files that pivot_longer has this functionality. – Unhandy 3/10, 2019 at 13:46

The next release of tidyr (after 1.0.2) introduces the names_transform argument for pivot_longer(). This argument applies a function to the names created by pivot_longer(). The factor_key = TRUE argument is equivalent to names_transform = list(key = forcats::fct_inorder). – Chuu 19/5, 2020 at 1:21

I found a workable solution by using a lookup table. It seems to work for me because I can extract the column names and assign an ordered number to the column name and then pair with my data.frame.

Here's the solution:

lookup <- tibble(key = c("y1.1", "y2.1", "y10.1"),
                 index = c(1,2,3))

> left_join(dfg, lookup, by="key")
# A tibble: 6 x 4
  sample   key       mean index
   <dbl> <chr>      <dbl> <dbl>
1      1  y1.1  0.2936335     1
2      1 y10.1  0.6170505     3
3      1  y2.1 -0.2250543     2
4      2  y1.1  1.3652070     1
5      2 y10.1  0.9889233     3
6      2  y2.1  0.5216553     2

Unhandy answered 27/10, 2017 at 18:17 Comment(0)

If your columns are really ordered by the number it contains, this should work :

library(readr)

df %>% 
  gather(key="key", value="value", -sample) %>%
  group_by(sample, key)         %>%
  summarize(mean = mean(value)) %>%
  arrange(parse_number(key))    %>%  # <- sorting by number contained in key
  filter(sample == 1)

# # A tibble: 3 x 3
# # Groups:   sample [1]
#     sample   key       mean
# <dbl> <chr>      <dbl>
#   1      1  y1.1 -0.9236688
#   2      1  y2.1 -0.2168337
#   3      1 y10.1  0.5041981

Corn answered 29/10, 2017 at 11:20 Comment(0)

Yet another way could be to arrange the dataframe using a customised version of the key column you want to sort on:

library(dplyr)
library(tidyr)

df %>% 
  gather(key="key", value="value", -sample) %>%
  group_by(sample, key) %>%
  summarize(mean = mean(value)) %>%
  arrange(as.numeric(stringr::str_replace(key, "y", "")), .by_group = TRUE)

#> # A tibble: 6 x 3
#> # Groups:   sample [2]
#>   sample   key        mean
#>    <dbl> <chr>       <dbl>
#> 1      1  y1.1  0.07001689
#> 2      1  y2.1  1.15349430
#> 3      1 y10.1  1.18266024
#> 4      2  y1.1  0.42616604
#> 5      2  y2.1  1.05891682
#> 6      2 y10.1 -0.12561209

Feculent answered 27/10, 2017 at 21:36 Comment(2)

Is this essentially converting the key to numeric and then sorting based on that? – Unhandy 30/10, 2017 at 13:12

Hi @LloydChristmas, yes you are right. It does the same thing as Moody_Mudskipper's more recent answer with the more readable parse_number function. You do need to add .by_group = TRUE though to arrange if you want the final result sorted by the grouped variables first followed by the key. – Feculent 30/10, 2017 at 23:5

If we combine the ideas of previous suggested answers AND we use pivot_longer() because it is not depricated, we could add a step to set the key as_factor() instead of class character. If we leave it as character, it'll get resorted into alphanumeric order.

library(tidyverse)

N <- 4
df <- data.frame(sample = c(1,1,2,2),
                 y1.1 = rnorm(N), y2.1 = rnorm(N), y10.1 = rnorm(N))


dfg <- df |> 
  pivot_longer(2:4, names_to = "key", values_to = "value") |> 
  mutate(key = as_factor(key)) |> 
  group_by(sample, key) |> 
  summarize(mean = mean(value)) |> 
  ungroup()

dfg

# A tibble: 6 × 3
  sample key     mean
   <dbl> <fct>  <dbl>
1      1 y1.1  -0.789
2      1 y2.1   1.16 
3      1 y10.1 -0.187
4      2 y1.1   0.962
5      2 y2.1   0.673
6      2 y10.1  0.502

Disappointed answered 28/1, 2023 at 15:6 Comment(2)

What is the |> syntax versus %>% – Unhandy 31/1, 2023 at 0:46

@LloydChristmas |> is the native R pipe. %>% is the magrittr pipe. r4ds.hadley.nz/workflow-pipes.html You'll see more of the native R pipe going forward. I like to use fonts with ligatures like Fira Code or Cascadia Code so that it displays as a neat triangle. – Disappointed 1/2, 2023 at 1:33

Recommended topics

Hot tags