Linear model singular because of large integer datetime in R?
Asked Answered
P

1

1

Simple regression of random normal on date fails, but identical data with small integers instead of dates works as expected.

# Example dataset with 100 observations at 2 second intervals.
set.seed(1)
df <- data.frame(x=as.POSIXct("2017-03-14 09:00:00") + seq(0, 199, 2),
                 y=rnorm(100))

#> head(df)
#                     x          y
# 1 2017-03-14 09:00:00 -0.6264538
# 2 2017-03-14 09:00:02  0.1836433
# 3 2017-03-14 09:00:04 -0.8356286

# Simple regression model.
m <- lm(y ~ x, data=df)

The slope is missing due to singularities in the data. Calling the summary demonstrates this:

summary(m)

# Coefficients: (1 not defined because of singularities)
#             Estimate Std. Error t value Pr(>|t|)
# (Intercept)  0.10889    0.08982   1.212    0.228
# x                 NA         NA      NA       NA

Could this be because of the POSIXct class?

# Convert date variable to integer.
df$x2 <- as.integer(df$x)
lm(y ~ x2, data=df)

# Coefficients:
# (Intercept)           x2  
#      0.1089           NA

Nope, coefficient for x2 still missing.

What if we make the baseline of x2 zero?

# Subtract minimum of x.
df$x3 <- df$x2 - min(df$x2)
lm(y ~ x3, data=df)

# Coefficients:
# (Intercept)           x3  
#   0.1312147   -0.0002255

This works!

One more example to rule out that this is due to datetime variable.

# Subtract large constant from date (data is now from 1985).
df$x4 <- df$x - 1000000000
lm(y ~ x4, data=df)

# Coefficients:
# (Intercept)           x4  
#   1.104e+05   -2.255e-04

Not expected (why would an identical dataset with 30 years difference cause different behaviour?), but this works too.

Could be that .Machine$integer.max (2147483647 on my PC) has something to do with it, but I can't figure it out. It would be greatly appreciated if someone could explain what's going on here.

Peden answered 14/3, 2017 at 9:7 Comment(1)
Since the POSIXct origin is completely arbitrary, subtracting the minimum time is usually advisable. It also makes interpreting the coefficients a bit easier.Breunig
T
2

Yes, it could. QR factorization is stable, but is not almighty God.

X <- cbind(1, 1e+11 + 1:10000)
qr(X)$rank
# 1

Here the X is like the model matrix for your linear regression model, where there is a all-1 column for intercept, and there is a sequence for datetime (note the large offset).

If you center the datetime column, these two columns will be orthogonal hence very stable (even when solving normal equation directly!).

Turgescent answered 14/3, 2017 at 9:24 Comment(1)
See also qr(X, tol = 1e-16)$rank or lm.fit(cbind(1, df$x), df$y, tol = 1e-16) for OP's example.Breunig

© 2022 - 2024 — McMap. All rights reserved.