I have relied on the qr()
function a lot in dealing with rank-deficient situations, but have recently run into some examples where it doesn't work correctly. Consider the
matrix badX
below:
badX <-
structure(c(-1.641906809157e-10, 0, 0, 0, 0, -0.5, 0, 0, -1.10482935525559e-16,
0, -3.06266685765538e-17, 0, -4.83736007092039e-17, 0, -3.14414492582296e-18,
-3.06158275836099e-18), dim = c(4L, 4L), dimnames = list(c("(Intercept)",
"A2", "A3", "B2"), NULL))
We cannot invert this matrix using the solve()
:
solve(badX)
## Error in solve.default(badX): system is computationally singular: reciprocal condition number = 5.55308e-18
Yet qr()
and its associated routines thinks this matrix has a rank of 4 and it can invert it:
qr(badX)$rank
## [1] 4
qr.solve(badX)
## [,1] [,2] [,3] [,4]
## [1,] -6090479645 0 2.197085e+10 7.366741e+10
## [2,] 0 -2 0.000000e+00 0.000000e+00
## [3,] 0 0 -3.265128e+16 3.353179e+16
## [4,] 0 0 0.000000e+00 -3.266284e+17
This is a pretty ugly result. I have tried varying the tol
argument, with no change in results.
For context, the origin of this result is this contrast matrix:
badL <-
structure(c(0, 0, 0, 0, 0, -9.89189274870351e-11, 0, -5.55111512312578e-17,
-2.77555756156289e-17, 1.11022302462516e-16, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -0.25, 0, 0, 0, 0, -0.25, 0, 0,
0, 9.89189274870351e-11, 0, 5.55111512312578e-17, 2.77555756156289e-17,
-1.11022302462516e-16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, -4.23939184015843e-11, 0, -4.16333634234434e-17, -1.38777878078145e-17,
5.55111512312578e-17, 0, 0, 0, 0, 0, -4.23939184015843e-11, 0,
-4.16333634234434e-17, -1.38777878078145e-17, 5.55111512312578e-17,
0, 0, 0, 0, 0, 0, 0.25, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.25, 0, 0,
0, 0, 0, 0, 0, 0, 4.23939184015843e-11, 0, 4.16333634234434e-17,
1.38777878078145e-17, -5.55111512312578e-17, 0, 0, 0, 0, 0, -1.41313127284714e-11,
0, -6.93889390390723e-18, -6.93889390390723e-18, 1.38777878078145e-17,
4.23939184015843e-11, 0, 4.16333634234434e-17, 1.38777878078145e-17,
-5.55111512312578e-17, 0, 0, 0, 0, 0), dim = c(5L, 24L), dimnames = list(
NULL, c("(Intercept)", "A2", "A3", "B2", "B3", "C2", "C3",
"A2:B2", "A3:B2", "A2:B3", "A3:B3", "A2:C2", "A3:C2", "A2:C3",
"A3:C3", "B2:C2", "B3:C2", "B2:C3", "B3:C3", "A2:B2:C2",
"A3:B2:C2", "A3:B3:C2", "A2:B2:C3", "A3:B2:C3")))
... from which I obtained the QR decomposition of its transpose, to find that it is supposedly of rank 4:
badQR <- qr(t(badL))
badQR$rank
## [1] 4
The above matrix badX
is equal to qr.R(badQR)[1:4, 1:4]
which based on the rank calculation, was supposed to be a full-rank upper-triangular matrix.
My remedy seems to be to use zapsmall()
so that I get the rank right...
qr(zapsmall(t(badL)))$rank
## [1] 1
My question is, why does this happen? If you look at badL
, it's pretty clear that it has three zero rows and only the second row is nonzero. I would have thought that qr()
's pivoting methods would work better with this. Is there a better way to get more reliable code?
I am running Windows 11 Pro, version 10.0.22000 build 22000. Here's my R system information.
R.Version()
## $platform
## [1] "x86_64-w64-mingw32"
##
## $arch
## [1] "x86_64"
##
## $os
## [1] "mingw32"
##
## $crt
## [1] "ucrt"
##
## $system
## [1] "x86_64, mingw32"
##
## $status
## [1] ""
##
## $major
## [1] "4"
##
## $minor
## [1] "2.0"
##
## $year
## [1] "2022"
##
## $month
## [1] "04"
##
## $day
## [1] "22"
##
## $`svn rev`
## [1] "82229"
##
## $language
## [1] "R"
##
## $version.string
## [1] "R version 4.2.0 (2022-04-22 ucrt)"
##
## $nickname
## [1] "Vigorous Calisthenics"
Created on 2022-06-21 by the reprex package (v2.0.1)
More on context
This question came up because I was trying to produce results like this (for a simpler example) in the emmeans package:
> (jt = joint_tests(warpx.emm))
model term df1 df2 F.ratio p.value note
tension 1 37 5.741 0.0217 e
wool:tension 1 37 5.867 0.0204 e
(confounded) 2 37 7.008 0.0026 d e
d: df1 reduced due to linear dependence
e: df1 reduced due to non-estimability
... and in particular, the (confounded)
part. This example is with a two-factor model with wool
at 2 levels and tension
at 3 levels; however, one of the factor combinations is omitted in the data, which means that we can estimate only 1 d.f. for each of the tension
main effect and the wool:tension
interaction effect, and no main effect at all for wool
. There being 4 d.f. for all possible contrasts of the 5 nonempty cells, there are 2 d.f. left over, and those are in the confounded)
part.
The computation is based on the associated estimable functions:
> attr(jt, "est.fcns")
$tension
(Intercept) woolB tensionM tensionH woolB:tensionM woolB:tensionH
[1,] 0 0 1 0 0.5 0
$`wool:tension`
(Intercept) woolB tensionM tensionH woolB:tensionM woolB:tensionH
[1,] 0 0 0 0 1 0
$`(confounded)`
(Intercept) woolB tensionM tensionH woolB:tensionM woolB:tensionH
[1,] 0 -1 0 0 0 0
[2,] 0 1 0 0 0 0
[3,] 0 -1 0 0 0 0
[4,] 0 -1 0 1 0 0
... and on the contrasts among all cells in the design:
> contrast(warpx.emm, "consec")@linfct
(Intercept) woolB tensionM tensionH woolB:tensionM woolB:tensionH
[1,] 0 1 0 0 0 0
[2,] 0 -1 1 0 0 0
[3,] 0 1 0 0 1 0
[4,] 0 -1 -1 1 -1 0
[5,] 0 1 0 0 0 1
The method I use is to combine the estimable functions for tension
and wool:tension
, and obtain the QR decomposition of its transpose. Then I use qr.resid()
with that and the transpose of the above cell contrasts. That leaves us (after transposing yet again) with the estimable functions shown for (confounded)
. That matrix has 4 rows, but its rank is only 2, as determined by the QR decomposition of this result; then I extract the 2x2 portion of the R part to complete the computation of the F statistic.
The example at the beginning of this question is similar, but with a larger, more complex model; the badL
matrix is the result of the qr.resid()
procedure described above. In this context, some of those rows arguably should be zero. My workaround at present is to examine the diagonal elements of R (badR
in the OP) and select those that exceed an absolute threshold.
The essential idea here is that I need to decompose that matrix of all contrasts into two parts -- the known estimable functions and the leftovers. And an interesting aspect of this is that the rank of the latter part is known, a fact that I have not taken advantage of. In future development, it may well be, per @duffymo, to use a SVD rather than these gyrations with qr.resid()
. There's always new stuff to learn...
solve()
is well taken, and thresholding does seem to be the most useful suggestion. But a couple of quick questions -- 1. How do we know thatsqrt(.Machine$double.eps)
is an appropriate threshold? and 2. Supposing it is, isn't that evidence that those columns ofX
are just noise? (I'm not really preferringsolve()
, I'm preferring to toss out columns that look like just noise. – Gallfly