vertically distribute multiple lines with smart spacing

Asked 8/11, 2013 at 17:3 Answered 22/7, 2017 at 20:41

A common display of spectroscopic data (intensity vs wavelength) is used below to compare the position of peaks in the data across multiple spectra. Assuming they all share a baseline at 0, it is convenient to offset the multiple lines vertically by a constant spacing, to avoid the distraction of overlapping lines.

enter image description here

Thus becomes

enter image description here

I'm looking for a better strategy to perform this vertical shift automatically, starting from data in long format. Here is a minimal example.

# fake data (5 similar-looking spectra)
spec <- function(){
  x <- runif(100, 0, 100)
  data.frame(x=x, y=jitter(dnorm(x, mean=jitter(50), sd=jitter(5)), amount=0.01))
}
require(plyr)
all <- ldply(1:5, function(ii) data.frame(spec(), id=ii))

My current strategy is as follows:

convert the spectra from long format to wide format. This involves interpolation, as the spectra do not necessarily have identical x axis values.
find the minimum offset between spectra to avoid overlap between neighbours
shift the spectra by multiples of this distance
melt back to long format

I implemented this using plyr,

# function that evenly spaces the spectra to avoid overlap
# d is in long format, s is a scaling factor for the vertical shift
require(plyr); require(ggplot2)

spread_plot <- function(d, s=1){
  ranges <- ddply(d, "id", with, each(min,max,length)(x))
  common_x <- seq(max(ranges$min), min(ranges$max), length=max(ranges$length))
  new_y <- dlply(d, "id", function(x) approx(x$x, x$y, common_x)$y)
  mat <- do.call(cbind, new_y)
  test <- apply(mat, 1, diff)
  shift <- max(-test[test < 0])
  origins <- s*seq(0, by=shift, length=ncol(mat))

  for(ii in seq_along(origins)){
    current <- unique(d[["id"]])[ii]
    d[d[["id"]] == current, "y"] <- 
      d[d[["id"]] == current, "y"] + origins[ii]
  }
  d
}

test <- spread_plot(all)

ggplot(test, aes(x, y, colour=id, group=id))+
  geom_line() + guides(colour=guide_legend())

This strategy suffers from a few shortcomings:

it is slow
the offset is not a pretty number; I do not know how to automatically round it well so that spectra are offset e.g. by 0.02, or 50, etc. depending on the range of the intensities. pretty(origins) is problematic in that it can return a different number of values.

I feel I'm missing a simpler solution, perhaps working directly with the original data in long format.

Krasnoyarsk answered 8/11, 2013 at 17:3 Comment(3)

Typically such spectra exhibit identical x values. Is that really not the case for yours? – Harder 8/11, 2013 at 17:41

in my case it's Raman spectra acquired at different laser excitation wavelengths, so the dispersion of the gratings results in slightly different wavenumbers. – Krasnoyarsk 8/11, 2013 at 18:26

now, bonus points if this is made into a new position_xxx() function for ggplot2. – Krasnoyarsk 8/11, 2013 at 19:12

Interesting question.

Here's a possibility, offered without detailed comment, except to point out that it:

Should be very fast, due to a combo of its avoidance of plyr, use of data.table, and operation on data in its original long format.
Uses pretty() to pick a pretty offset.
Like your code, is not guaranteed to produce no intersections of lines, since overlap can happen between the lattice of points formed by common_x.

Here's the code

## Setup
library(data.table)
library(plyr)
library(ggplot2)

spec <- function(){
  x <- runif(100, 0, 100)
  data.frame(x=x, y=jitter(dnorm(x, mean=jitter(50), sd=jitter(5)), amount=0.01))
}
all <- ldply(1:5, function(ii) data.frame(spec(), id=ii))

## Function that uses data.table rather than plyr to compute and add offsets
spread_plot <- function(d, s=1){
    d <- data.table(d, key="id")
    ranges <- d[, list(min=min(x), max=max(x), length=length(x)),by="id"]
    common_x <- seq(max(ranges$min), min(ranges$max), length=max(ranges$length))
    new_y <- d[,list(y=approx(x, y, common_x)$y, N=seq_along(common_x)),
               by="id"]
    shift <- max(new_y[, max(abs(diff(y))), by = "N"][[2]])
    shift <- pretty(c(0, shift), n=0)[2]
    origins <- s*seq(0, by=shift, length=length(unique(d$id)))
    d[,y:=(y + origins[.GRP]),by="id"]
    d
}

## Try it out
test <- spread_plot(all)
ggplot(test, aes(x, y, colour=id, group=id))+
  geom_line() + guides(colour=guide_legend())

enter image description here

Petrography answered 8/11, 2013 at 18:8 Comment(1)

thanks, that looks pretty straight-forward even for a non-DT user. The only thing I would need to read up on is this .GRP, but its meaning is pretty obvious. – Krasnoyarsk 8/11, 2013 at 19:11

I still think you could rely on some assumptions about typical data from spectroscopy. Usually, x values are sorted, the number of them is equal for all spectra and they are quite similar:

# new fake data (5 similar-looking spectra)
spec <- function(){
  x <- jitter(seq(0,100,1),0.1)
  data.frame(x=x, y=jitter(dnorm(x, mean=jitter(50), sd=jitter(5)), amount=0.01))
}
require(plyr)
all <- ldply(1:5, function(ii) data.frame(spec(), id=ii))

If these assumptions are valid, you could treat the spectra as having identical x values:

library(ggplot2)
spread_plot  <- function(d, s=0.05) {
  #add some checks here, e.g., for equal length 
  d <- d[order(d$x),]
  d$id <- factor(d$id)
  l <- levels(d$id)
  pretty_offset <- pretty(s*min(tapply(d$y, d$id, function(x) abs(diff(range(x))))))[2]

  for (i in seq_len(length(l)-1)+1) {
      mean_delta_y <- mean(d[d$id == l[i], "y"] - d[d$id == l[i-1], "y"])
      d[d$id == l[i], "y"] <-  d[d$id == l[i], "y"] - mean_delta_y
      min_delta_y <- abs(1.05 * min(d[d$id == l[i], "y"] - d[d$id == l[i-1], "y"]))
      pretty_delta_y <- max(min_delta_y, pretty_offset)
      d[d$id == l[i], "y"] <-  d[d$id == l[i], "y"] + pretty_delta_y
      }
  p <- ggplot(d, aes(x=x, y=y, col=id)) + geom_line()
  print(p)
}
spread_plot(all, s=0)

enter image description here

spread_plot(all, s=0.5)

enter image description here

Harder answered 9/11, 2013 at 17:32 Comment(1)

good points regarding the set of reasonable assumptions. The data.table solution is more compact and presumably faster but it's good to see alternative approaches. – Krasnoyarsk 11/11, 2013 at 11:30

As suggested by hadley, the for loop can be avoided very simply,

d$y <- d$y + origins[d$id]

Full code:

spread_plot <- function(d, s=1){
  ranges <- ddply(d, "id", with, each(min,max,length)(x))
  common_x <- seq(max(ranges$min), min(ranges$max), length=max(ranges$length))
  new_y <- dlply(d, "id", function(x) approx(x$x, x$y, common_x)$y)
  mat <- do.call(cbind, new_y)
  test <- apply(mat, 1, diff)
  shift <- max(-test[test < 0])
  origins <- s*seq(0, by=shift, length=ncol(mat))

  d$y <- d$y + origins[d$id]

  d
}

test <- spread_plot(all)

ggplot(test, aes(x, y, colour=id, group=id))+
  geom_line() + guides(colour=guide_legend())

Krasnoyarsk answered 22/7, 2017 at 20:41 Comment(0)

Recommended topics

Hot tags