Something weird in pheatmap (a bug?)

Reproducible Data:

data(crabs, package = "MASS")
df <- crabs[-(1:3)]
set.seed(12345)
df$GRP <- kmeans(df, 4)$cluster
df.order <- dplyr::arrange(df, GRP)

Data Description:

df has 5 numerical variables. I did the K-means algorithm according to these 5 attributes and produced a new categorical variable GRP which has 4 levels. Next, I ordered it with GRP and named it df.order.

What I did with pheatmap:

## 5 numerical variables for coloring
colormat <- df.order[c("FL", "RW", "CL", "CW", "BD")]

## Specify the annotation variable `GRP` shown on left side of the heatmap
ann_row <- df.order["GRP"]

## gap indices
gapRow <- cumsum(table(ann_row$GRP))

library(pheatmap)
pheatmap(colormat, cluster_rows = F, show_rownames = F,
         annotation_row = ann_row, gaps_row = gapRow)

Error in annotation_colors[[colnames(annotation)[i]]] : subscript out of bounds

Here is where I got something weird:

At first, I guess the problem resulted from the argument annotation_row.I check the row names of the two data frames.

all.equal(rownames(colormat), rownames(ann_row))
# [1] TRUE

You can see that they are equal. However, I executed the following code and the heatmap work.

rownames(colormat) <- rownames(ann_row)
pheatmap(colormat, cluster_rows = F, show_rownames = F,
         annotation_row = ann_row, gaps_row = gapRow)

Theoretically this code "rownames(colormat) <- rownames(ann_row)" should make no sense because these two objects are equal originally, but why does it make the pheatmap() function work?

Edit: From @steveb's comment, I don't even have to set the rownames using ann_row. I just set

rownames(colormat) <- rownames(colormat)

and the pheatmap also works. This situation is still counterintuitive.

Final Output:

In short, colormat does not have rownames before rownames(colormat) <- rownames(colormat) but has rownames after. This answer begins to touch on the nature of the issue but doesn't dive deep into why or how pheatmap is running into this, or why R is working this way. In other words, I am not digging into the details of how rownames are handled in R.

The nature of this issue has to do with rownames returning a default vector of row number; each element is a numeric value but represented as a string, so row 10 becomes row name "10". When using attributes(colormat), you will see $row.names is a numeric vector before rownames(colormat) <- rownames(colormat) and a character vector after (it now has row names). It is not clear to me why anything (other than NULL or NA) is returned when something doesn't have row names set.

attributes(colormat)
## $names
## [1] "FL" "RW" "CL" "CW" "BD"
## 
## $row.names
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38
##  [39]  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76
##  [77]  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114
## [115] 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152
## [153] 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190
## [191] 191 192 193 194 195 196 197 198 199 200
## 
## $class
## [1] "data.frame"

rownames(colormat) <- rownames(colormat)

attributes(colormat)
## $names
## [1] "FL" "RW" "CL" "CW" "BD"
## 
## $row.names
##   [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12"  "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24"  "25" 
##  [26] "26"  "27"  "28"  "29"  "30"  "31"  "32"  "33"  "34"  "35"  "36"  "37"  "38"  "39"  "40"  "41"  "42"  "43"  "44"  "45"  "46"  "47"  "48"  "49"  "50" 
##  [51] "51"  "52"  "53"  "54"  "55"  "56"  "57"  "58"  "59"  "60"  "61"  "62"  "63"  "64"  "65"  "66"  "67"  "68"  "69"  "70"  "71"  "72"  "73"  "74"  "75" 
##  [76] "76"  "77"  "78"  "79"  "80"  "81"  "82"  "83"  "84"  "85"  "86"  "87"  "88"  "89"  "90"  "91"  "92"  "93"  "94"  "95"  "96"  "97"  "98"  "99"  "100"
## [101] "101" "102" "103" "104" "105" "106" "107" "108" "109" "110" "111" "112" "113" "114" "115" "116" "117" "118" "119" "120" "121" "122" "123" "124" "125"
## [126] "126" "127" "128" "129" "130" "131" "132" "133" "134" "135" "136" "137" "138" "139" "140" "141" "142" "143" "144" "145" "146" "147" "148" "149" "150"
## [151] "151" "152" "153" "154" "155" "156" "157" "158" "159" "160" "161" "162" "163" "164" "165" "166" "167" "168" "169" "170" "171" "172" "173" "174" "175"
## [176] "176" "177" "178" "179" "180" "181" "182" "183" "184" "185" "186" "187" "188" "189" "190" "191" "192" "193" "194" "195" "196" "197" "198" "199" "200"
## 
## $class
## [1] "data.frame"

It is not the numeric value vs. character value of rownames that is the issue, it is whether rownames is set or not. If you did the following:

rownames(colormat) <- 1:nrow(colormat)

You would find that would fix the issue too, as rownames is now set to the numeric values of row number (see attributes(colormat) output).

If you use tibble::has_rownames(colormat) before rownames(colormat) <- rownames(colormat), then you will get FALSE. After assignment, you will get TRUE.

tibble::has_rownames(colormat)
## [1] FALSE
rownames(colormat) <- rownames(colormat)
tibble::has_rownames(colormat)
## [1] TRUE

I am not sure how pheatmap is using the colormat internally but it must be running into this issue of the rownames not being set. If you reach out to the authors of this package (probably through GitHub : https://github.com/raivokolde/pheatmap), they may update the code to handle this corner case for the next release.

Recommended topics

Hot tags