Piecewise regression with R: plotting the segments

Asked 6/1, 2012 at 13:37 Answered 6/1, 2012 at 18:52

I have 54 points. They represent offer and demand for products. I would like to show there is a break point in the offer.

First, I sort the x-axis (offer) and remove the values that appears twice. I have 47 values, but I remove the first and last ones (doesn't make sense to consider them as break points). Break is of length 45:

Break<-(sort(unique(offer))[2:46])

Then, for each of these potential break points, I estimate a model and I keep in "d" the residual standard error (sixth element in model summary object).

d<-numeric(45)
for (i in 1:45) {
model<-lm(demand~(offer<Break[i])*offer + (offer>=Break[i])*offer)
d[i]<-summary(model)[[6]] }

Plotting d, I notice that my smaller residual standard error is 34, that corresponds to "Break[34]": 22.4. So I write my model with my final break point:

model<-lm(demand~(offer<22.4)*offer + (offer>=22.4)*offer)

Finally, I'm happy with my new model. It's significantly better than the simple linear one. And I want to draw it:

plot(demand~offer)
i <- order(offer)
lines(offer[i], predict(model,list(offer))[i])

But I have a warning message:

Warning message:
In predict.lm(model, list(offer)) :
  prediction from a rank-deficient fit may be misleading

And more important, the lines are really strange on my plot.

My plot with the supposedly two segments, but not joining

Here are my data:

demand <- c(1155, 362, 357, 111, 703, 494, 410, 63, 616, 468, 973, 235,
            180, 69, 305, 106, 155, 422, 44, 1008, 225, 321, 1001, 531, 143,
            251, 216, 57, 146, 226, 169, 32, 75, 102, 4, 68, 102, 462, 295,
            196, 50, 739, 287, 226, 706, 127, 85, 234, 153, 4, 373, 54, 81,
            18)
offer <- c(39.3, 23.5, 22.4, 6.1, 35.9, 35.5, 23.2, 9.1, 27.5, 28.6, 41.3,
           16.9, 18.2, 9, 28.6, 12.7, 11.8, 27.9, 21.6, 45.9, 11.4, 16.6,
           40.7, 22.4, 17.4, 14.3, 14.6, 6.6, 10.6, 14.3, 3.4, 5.1, 4.1,
           4.1, 1.7, 7.5, 7.8, 22.6, 8.6, 7.7, 7.8, 34.7, 15.6, 18.5, 35,
           16.5, 11.3, 7.7, 14.8, 2, 12.4, 9.2, 11.8, 3.9)

Spiegleman answered 6/1, 2012 at 13:37 Comment(3)

54 points does not seem a very large amount of points to detect such a transistion. You could opt to also post this in stats.stackexchange.com with the specfic question if this is enough points to detect the break in the data. Just my 2 ct. – Maui 6/1, 2012 at 20:4

I think this is pretty statistically dubious. Better to estimate the breakpoint in the model itself (although that makes it non linear). You can't trust the p-vals or standard errors from the current informal estimation process. – Misconstrue 7/1, 2012 at 20:35

54 points isn't a large amount, I agree, but both my linear and my piecewise linear regressions are significant. And moreover, the residual standard error of the piecewise linear model is 103.9 compared to 121.3 for the linear model with two less degrees of freedom. The piecewise model is significantly better. – Spiegleman 9/1, 2012 at 10:52

Here is an easier approach using ggplot2.

require(ggplot2)
qplot(offer, demand, group = offer > 22.4, geom = c('point', 'smooth'), 
   method = 'lm', se = F, data = dat)

EDIT. I would also recommend taking a look at this package segmented which supports automatic detection and estimation of segmented regression models.

enter image description here

UPDATE:

Here is an example that makes use of the R package segmented to automatically detect the breaks

library(segmented)
set.seed(12)
xx <- 1:100
zz <- runif(100)
yy <- 2 + 1.5*pmax(xx - 35, 0) - 1.5*pmax(xx - 70, 0) + 15*pmax(zz - .5, 0) + 
  rnorm(100,0,2)
dati <- data.frame(x = xx, y = yy, z = zz)
out.lm <- lm(y ~ x, data = dati)
o <- segmented(out.lm, seg.Z = ~x, psi = list(x = c(30,60)),
  control = seg.control(display = FALSE)
)
dat2 = data.frame(x = xx, y = broken.line(o)$fit)

library(ggplot2)
ggplot(dati, aes(x = x, y = y)) +
  geom_point() +
  geom_line(data = dat2, color = 'blue')

segmented

Ectoderm answered 6/1, 2012 at 18:52 Comment(6)

Thank you for the idea about using the "segmented" package. "Muggeo, V.M.R. (2003) Estimating regression models with unknown break-points. Statistics inMedicine 22, 3055–3071" is an interesting paper to understand what's going on in the package. – Spiegleman 10/1, 2012 at 12:19

In particular, it has an advantage to the code I used: the two segments are connected! In "The R Book", the author is not mentioning his segments are not connected and is even showing a plot with connected segments... – Spiegleman 10/1, 2012 at 14:39

Where is an example of using ggplot() with segmented()? I cannot seem to find one anywhere. – Winniewinnifred 18/9, 2015 at 22:20

I have added an example that uses ggplot with segmented. – Ectoderm 20/9, 2015 at 3:5

After qplot(…) I get Error: Unknown parameters: method, se – Klingel 24/3, 2016 at 0:6

Is you solution applicable HERE as well? – Hungary 22/11, 2020 at 2:53

Vincent has you on the right track. The only thing "weird" about the lines in your resulting plot is that lines draws a line between each successive point, which means that "jump" you see if it simply connecting the two ends of each line.

If you don't want that connector, you have to split the lines call into two separate pieces.

Also, I feel like you can simplify your regression a bit. Here's what I did:

#After reading your data into dat
Break <- 22.4
dat$grp <- dat$offer < Break

#Note the addition of the grp variable makes this a bit easier to read
m <- lm(demand~offer*grp,data = dat)
dat$pred <- predict(m)

plot(dat$offer,dat$demand)
dat <- dat[order(dat$offer),]
with(subset(dat,offer < Break),lines(offer,pred))
with(subset(dat,offer >= Break),lines(offer,pred))

which produces this plot:

enter image description here

Agronomy answered 6/1, 2012 at 16:37 Comment(0)

The weird lines are simply due to the order in which the points are plotted. The following should look better:

i <- order(offer)
lines(offer[i], predict(model,list(offer))[i])

The warning comes from the fact that the * character is interpreted by lm.

> lm(demand~(offer<22.4)*offer + (offer>=22.4)*offer)
Call:
lm(formula = demand ~ (offer < 22.4) * offer + (offer >= 22.4) * offer)
Coefficients:
            (Intercept)         offer < 22.4TRUE                    offer  
                -309.46                   356.08                    29.86  
      offer >= 22.4TRUE   offer < 22.4TRUE:offer  offer:offer >= 22.4TRUE  
                     NA                   -20.79                       NA

In addition, (offer<22.4)*offer is a discontinuous function: this is where the discontinuity comes from.

The following should be closer to what you want.

model <- lm(
  demand ~ ifelse(offer<22.4,offer-22.4,0) 
           + ifelse(offer>=22.4,offer-22.4,0) )

Euromarket answered 6/1, 2012 at 13:51 Comment(2)

Thanks for answer! I still have the warning message: "Warning message: In predict.lm(model2, list(offer)) : prediction from a rank-deficient fit may be misleading". The result is a bit better: I've only one piecewise segment, but I don't understand why there are 3 segments (i.e., my two segments are not connected naturally...) – Spiegleman 6/1, 2012 at 14:21

Can you please explain why this makes it smooth. Intuitively thought it would be 22.4-offer in first part of function. And offer-22.4 in second part. Doesn't first part mean "if offer is less than 22.4 then calculated offer-22.4", meaning it will give a negative covariate? Therefore our coefficient will be multiplied against these negative values when we estimate the input? – Cosmography 19/12, 2018 at 6:30

Recommended topics

Hot tags