loess regression on each group with dplyr::group_by()
Asked Answered
L

3

11

Alright, I'm waving my white flag.

I'm trying to compute a loess regression on my dataset.

I want loess to compute a different set of points that plots as a smooth line for each group.

The problem is that the loess calculation is escaping the dplyr::group_by function, so the loess regression is calculated on the whole dataset.

Internet searching leads me to believe this is because dplyr::group_by wasn't meant to work this way.

I just can't figure out how to make this work on a per-group basis.

Here are some examples of my failed attempts.

test2 <- test %>% 
  group_by(CpG) %>% 
  dplyr::arrange(AVGMOrder) %>% 
  do(broom::tidy(predict(loess(Meth ~ AVGMOrder, span = .85, data=.))))

> test2
# A tibble: 136 x 2
# Groups:   CpG [4]
   CpG            x
   <chr>      <dbl>
 1 cg01003813 0.781
 2 cg01003813 0.793
 3 cg01003813 0.805
 4 cg01003813 0.816
 5 cg01003813 0.829
 6 cg01003813 0.841
 7 cg01003813 0.854
 8 cg01003813 0.866
 9 cg01003813 0.878
10 cg01003813 0.893

This one works, but I can't figure out how to apply the result to a column in my original dataframe. The result I want is column x. If I apply x as a column in a separate line, I run into issues because I called dplyr::arrange earlier.

test2 <- test %>% 
  group_by(CpG) %>% 
  dplyr::arrange(AVGMOrder) %>% 
  dplyr::do({
    predict(loess(Meth ~ AVGMOrder, span = .85, data=.))
  })

This one simply fails with the following error.

"Error: Results 1, 2, 3, 4 must be data frames, not numeric"

Also it still isn't applied as a new column with dplyr::mutate

fems <- fems %>% 
  group_by(CpG) %>% 
  dplyr::arrange(AVGMOrder) %>% 
  dplyr::mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.)))

This was my fist attempt and mostly resembles what I want to do. Problem is that this one performs the loess prediction on the entire dataframe and not on each CpG group.

I am really stuck here. I read online that the purr package might help, but I'm having trouble figuring it out.

data looks like this:

> head(test)
    X geneID        CpG                                        CellLine       Meth AVGMOrder neworder Group SmoothMeth
1  40     XG cg25296477 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.81107210         1        1     5  0.7808767
2  94     XG cg01003813 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.97052120         1        1     5  0.7927130
3 148     XG cg13176022 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.06900448         1        1     5  0.8045080
4 202     XG cg26484667 iPS__HDF51IPS14_passage27_Female____165.592.1.2 0.84077890         1        1     5  0.8163997
5  27     XG cg25296477  iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.81623880         2        2     3  0.8285259
6  81     XG cg01003813  iPS__HDF51IPS6_passage33_Female____157.647.1.2 0.95569240         2        2     3  0.8409501

unique(test$CpG) [1] "cg25296477" "cg01003813" "cg13176022" "cg26484667"

So, to be clear, I want to do a loess regression on each unique CpG in my dataframe, apply the resulting "regressed y axis values" to a column matching the original y axis values (Meth).

My actual dataset has a few thousand of those CpG's, not just the four.

https://docs.google.com/spreadsheets/d/1-Wluc9NDFSnOeTwgBw4n0pdPuSlMSTfUVM0GJTiEn_Y/edit?usp=sharing

Leastwise answered 3/5, 2018 at 20:8 Comment(6)
Have you looked at the Many Models chapter of R for Data Science? It walks through a very similar exerciseNadenenader
I will take a look. Thanks.Leastwise
So you want the loess predicted values as an additional column in your dataset? I think you can just change do(broom::tidy...) in your first example to do(x = broom::tidy...), or use broom::augment. WIll test when i can make some data up or if you supply someWyckoff
trying now. also added google sheets link to test dataframeLeastwise
This might by helpful for you: https://mcmap.net/q/103656/-error-when-using-broom-augment-and-dplyr-with-loess-fitReaper
hi Markus, Yes that post did help. I was able to modify my command to this: test2 <- test %>% nest(-CpG) %>% mutate(fit = map(data, ~ loess(Meth ~ AVGMOrder, .))) %>% unnest(map2(fit, data, augment)) It returned my original dataframe with the column .fitted. That generated a gorgeous line. Thank You. Add your answer and I will choose it as best.Leastwise
P
3

You may have already figured this out -- but if not, here's some help.

Basically, you need to feed the predict function a data.frame (a vector may work too but I didn't try it) of the values you want to predict at.

So for your case:

fems <- fems %>% 
  group_by(CpG) %>% 
  arrange(CpG, AVGMOrder) %>% 
  mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.),
    data.frame(AVGMOrder = seq(min(AVGMOrder), max(AVGMOrder), 1))))

Note, loess requires a minimum number of observations to run (~4? I can't remember precisely). Also, this will take a while to run so test with a slice of your data to make sure it's working properly.

Pena answered 20/11, 2018 at 19:16 Comment(0)
C
9

This is a neat Tidyverse way to make it work:

library(dplyr)
library(tidyr)
library(purrr)
library(ggplot2)

models <- fems %>%
        tidyr::nest(-CpG) %>%
        dplyr::mutate(
                # Perform loess calculation on each CpG group
                m = purrr::map(data, loess,
                               formula = Meth ~ AVGMOrder, span = .5),
                # Retrieve the fitted values from each model
                fitted = purrr::map(m, `[[`, "fitted")
        )

# Apply fitted y's as a new column
results <- models %>%
        dplyr::select(-m) %>%
        tidyr::unnest()

# Plot with loess line for each group
ggplot(results, aes(x = AVGMOrder, y = Meth, group = CpG, colour = CpG)) +
        geom_point() +
        geom_line(aes(y = fitted))

This is what the output looks like

Craggie answered 12/3, 2019 at 17:31 Comment(4)
It would be great if you could explain what's going on with the nesting and mapping. I was quite surprised that the first map took data as the argument rather than a dot or the actual name of the dataframe. This worked for me though, so thank you!Valedictory
@Valedictory by default, nest will assign data as the name of the new column. nest basically makes a dataframe from each group to give a series of smaller dataframes. So by taking data as its first argument, map will take each nested dataframe in turn and compute loess on each.Craggie
How would someone apply this not only to different groups; but to different groups and different columns? I have many indicators that need detrending. So must fit many non-parametric splines.Salaam
How would you extract the x-value when, say, Meth == 0.8 for all CpG groups? I know cg13176022 will result in an NA but I have a similar situation where I need to find the x-value for a modeled y-value for each group. Curious where predict comes into play. Thanks.Aqaba
P
3

You may have already figured this out -- but if not, here's some help.

Basically, you need to feed the predict function a data.frame (a vector may work too but I didn't try it) of the values you want to predict at.

So for your case:

fems <- fems %>% 
  group_by(CpG) %>% 
  arrange(CpG, AVGMOrder) %>% 
  mutate(Loess = predict(loess(Meth ~ AVGMOrder, span = .5, data=.),
    data.frame(AVGMOrder = seq(min(AVGMOrder), max(AVGMOrder), 1))))

Note, loess requires a minimum number of observations to run (~4? I can't remember precisely). Also, this will take a while to run so test with a slice of your data to make sure it's working properly.

Pena answered 20/11, 2018 at 19:16 Comment(0)
A
0

Unfortunately, the approaches described above did not work in my case. Thus, I implemented the Loess prediction into a regular function, which worked very well. In the example below, the data is contained in the df data frame while we group by df$profile and want to fit the Loess prediction into the df$daily_sum values.

# Define important variables
span_60 <- 60/365  # 60 days of a year
span_365 <- 365/365  # a whole year


# Group and order the data set
df <- as.data.frame(
  df %>% 
    group_by(profile) %>%
    arrange(profile, day) %>%
    )
)


# Define the Loess function. x is the data frame that has to be passed
predict_loess <- function(x) {
  
  # Declare that the loess column exists, but is blank
  df$loess_60 <- NA
  df$loess_365 <- NA
  
  # Identify all unique profilee IDs
  all_ids <- unique(x$profile)
  
  # Iterate through the unique profilee IDs, determine the length of each vector (which should correspond to 365 days)
  # and isolate the according rows that belong to the profilee ID.
  for (i in all_ids) {
    len_entries <- length(which(x$profile == i))
    queried_rows <- result <- x[which(x$profile == i), ]
    
    # Run the loess fit and write the result to the according column
    fit_60 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_60))
    fit_365 <- predict(loess(daily_sum ~ seq(1, len_entries), data=queried_rows, span = span_365))
    x[which(x$profile == i), "loess_60"] <- fit_60
    x[which(x$profile == i), "loess_365"] <- fit_365
  }
  
  # Return the initial data frame
  return(x)
}


# Run the Loess prediction and put the results into two columns - one for a short and one for a long time span
df <- predict_loess(df)
Acquainted answered 14/7, 2022 at 14:41 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.