What does "training" the data mean in the internals of ggplot2?
Asked Answered
S

1

9

I'm following along with the internals of the ggplot2 library and I'm trying to understand how non-positional aesthetics get mapped to the values that get passed to grid. The book describes this process as

The last part of the data transformation is to train and map all non-positional aesthetics, i.e. convert whatever discrete or continuous input that is mapped to graphical parameters such as colours, linetypes, sizes etc."

However, this is the first time that the idea of "training" the data appears in the text.

The code for this process (from ggplot2:::ggplot_build.ggplot) appears to be:

  # Train and map non-position scales and guides
  npscales <- scales$non_position_scales()
  if (npscales$n() > 0) {
    lapply(data, npscales$train_df)
    plot$guides <- plot$guides$build(npscales, plot$layers, plot$labels, data)
    data <- lapply(data, npscales$map_df)
  } else {
    # Only keep custom guides if there are no non-position scales
    plot$guides <- plot$guides$get_custom()
  }

but I'm unable to follow along with what's actually happening here. Does the lapply(data, npscales$train_df) actually do anything? It doesn't seem to be saved and I would've expected it to be data <- lapply(data, npscales$train_df) instead, but the function seems to always return NULLs no matter what plot I try it with.

What does "training" non-positional data mean in the ggplot2 package?

Selfrespect answered 18/7, 2024 at 1:28 Comment(3)
I don't have an answer but usually if you see a lapply loop and the result isn't assigned, there are two possible reasons: (i) the function in the loop is called for a side-effect or (ii) it is done to force evaluation of something (which is a specific type effect). Personally, I prefer using a for loop for such purposes.Kenzie
In the book you link to, all chapters have a caveat at the top of the page: "You are reading the work-in-progress third edition of the ggplot2 book. This chapter is currently a dumping ground for ideas, and we don’t recommend reading it.".Valerie
@Valerie that part is same for version2, too: github.com/hadley/ggplot2-book:L190Icy
H
13

In ggplot2 terms, 'training' means keeping track of possible values. For continuous variables, this means keeping track of the range and for discrete variables, that means keeping track of the levels. 'Keeping track' here means to go over every layer's data and update the possible values based on the values encountered in the data.

Under the hood, this is all orchestrated by {scale's} DiscreteRange and ContinuousRange classes. See below for examples how these are updated.

# At first, tracked variable is empty
range <- scales::DiscreteRange$new()
range$range
#> NULL

# Observe data in first layer
range$train(c("A", "X"))
range$range
#> [1] "A" "X"

# Observe data in second layer
range$train(c("B"))
range$range
#> [1] "A" "B" "X"

For continuous ranges.

# Again empty at first
range <- scales::ContinuousRange$new()
range$range
#> NULL

# Observe data in first layer
range$train(c(0, 10))
range$range
#> [1]  0 10

# Observe data in second layer
range$train(c(100))
range$range
#> [1]   0 100

Created on 2024-07-18 with reprex v2.1.1

In the code you present, lapply(data, npscales$train_df) is doing this job. The train_df method is called for the side effect of updating the scale's ranges and returns NULL as it does not alter the data itself and the function result is not needed.

The 'non-positional' part means that the x and y aesthetics (and related ones such as xmin, yend) don't participate as they need special treatment and be trained much earlier in the plot building process.

Hippocampus answered 18/7, 2024 at 7:40 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.