partykit: Displaying terminal node percentile values above terminal node boxplots
Asked Answered
M

1

6

I'm trying to plot a regression tree generated with rpart using partykit. Let's suppose the formula used is y ~ x1 + x2 + x3 + ... + xn. What I would like to achieve is a tree with boxplots in terminal nodes, with a label on top listing the 10th, 50th, and 90th percentiles of the distribution of the y values for the observations assigned to each node, i.e., above the boxplot representing each terminal node, I would like to display a label like "10th percentile = $200, mean = $247, 90th percentile = $292."

The code below generates the desired tree:

library("rpart")
fit <- rpart(Price ~ Mileage + Type + Country, cu.summary)
library("partykit")
tree.2 <- as.party(fit)

The following code generates the terminal plots but without the desired labels on the terminal nodes:

plot(tree.2, type = "simple", terminal_panel = node_boxplot(tree.2,
  col = "black", fill = "lightgray", width = 0.5, yscale = NULL,
  ylines = 3, cex = 0.5, id = TRUE))

If I can display a mean y-value for a node, then it should be easy enough to augment the label with percentiles, so my first step is to display, above each terminal node, just its mean y-value.

I know I can retrieve the mean y-value within a node (here node #12) with code such as this:

colMeans(tree.2[12]$fitted[2])

So I tried to create a formula and use the mainlab parameter of the boxplot panel-generating function to generate a label containing this mean:

labf <- function(node) colMeans(node$fitted[2])
plot(tree.2, type = "simple", terminal_panel = node_boxplot(tree.2,
  col = "black", fill = "lightgray", width = 0.5, yscale = NULL,
  ylines = 3, cex = 0.5, id = TRUE, mainlab = tf))

Unfortunately, this generates the error message:

Error in mainlab(names(obj)[nid], sum(wn)) : unused argument (sum(wn)).

But it seems this is on the right track, since if I use:

plot(tree.2, type = "simple", terminal_panel = node_boxplot(tree.2,
  col = "black", fill = "lightgray", width = 0.5, yscale = NULL,
  ylines = 3, cex = 0.5, id = TRUE, mainlab = colMeans(tree.2$fitted[2])))

then I get the correct mean y-value at the root node displayed. I would appreciate help with fixing the error described above so that I show the mean y-values for each separate terminal node. From there, it should be easy to add in the other percentiles and format things nicely.

Martell answered 24/10, 2015 at 3:22 Comment(2)
Could you try to make a reproducible version of the problem? Then I'll try to have a look at it.Collateral
Sure. Thanks @AchimZeileis! The code below uses the cu Consumer Reports dataset that comes with RPART. fit <- rpart(Price ~ Mileage + Type + Country, cu.summary) par(xpd = TRUE)plot(fit, compress = TRUE) text(fit, use.n = TRUE) tree.2<-as.party(fit) plot(tree.2) This will generate a tree plot with boxplots at the terminal nodes. What I'm trying to do is to put the mean (and later some other percentiles) above each of the terminal nodes in a label. So instead of "Node 4 (n=21)" the leftmost terminal node would have a label saying something like "mean = 7629.048"Martell
C
4

In principle, you are on the right track. But if mainlab should be a function, it is not a function of the node but of id and nobs, see ?node_boxplot. Also you can compute the table of means (or some quantiles) more easily for all terminal nodes using the fitted data for the whole tree:

tab <- tapply(tree.2$fitted[["(response)"]],
  factor(tree.2$fitted[["(fitted)"]], levels = 1:length(tree.2)),
  FUN = mean)

Then you can prepare this for plotting by rounding/formatting:

tab <- format(round(tab, digits = 3))
tab
##           1           2           3           4           5           6 
## "       NA" "       NA" "       NA" " 7629.048" "       NA" "12241.552" 
##           7           8           9          10          11          12 
## "14846.895" "22317.727" "       NA" "       NA" "17607.444" "21499.714" 
##          13 
## "27646.000" 

And for adding this into the display, write your own helper function for the mainlab:

mlab <- function(id, nobs) paste("Mean =", tab[id])
plot(tree.2, tp_args = list(mainlab = mlab))

enter image description here

Collateral answered 26/10, 2015 at 21:24 Comment(4)
Thank you @AchimZeileis! This solved my problem and I was able to extend the example you provided to include the percentiles. I really appreciate the assistance and the detailed example code. Is there any way to similarly modify the labels for the edges (to replace the commas with newline characters, for example) via an ep_args argument? I found a split parameter but don't see its impact. Setting justmin=3 prevented overlaps of the edge labels, but they're still quite long Also, what is nobs? Number of observations? I can't seem to find details on that parameter. Many thanks again!Martell
At the moment newlines instead of commas are not supported, you would have to hack your own version of edge_simple for that. I'll try to think about it when working on the next revision of partykit. As for nobs: This stands for "number of observations" as in the ?nobs extractor function. This should probably be documented better.Collateral
Thanks again! I'm finding partykit to be incredibly useful.Martell
Great, glad if it's useful for you. Please also accept the answer if it solved the original question.Collateral

© 2022 - 2024 — McMap. All rights reserved.