What does "training" the data mean in the internals of ggplot2?

# Train and map non-position scales and guides npscales <- scales$non_position_scales() if (npscales$n() > 0) { lapply(data, npscales$train_df) plot$guides <- plot$guides$build(npscales, plot$layers, plot$labels, data) data <- lapply(data, npscales$map_df) } else { # Only keep custom guides if there are no non-position scales plot$guides <- plot$guides$get_custom() }

In ggplot2 terms, 'training' means keeping track of possible values. For continuous variables, this means keeping track of the range and for discrete variables, that means keeping track of the levels. 'Keeping track' here means to go over every layer's data and update the possible values based on the values encountered in the data.

Under the hood, this is all orchestrated by {scale's} DiscreteRange and ContinuousRange classes. See below for examples how these are updated.

# At first, tracked variable is empty
range <- scales::DiscreteRange$new()
range$range
#> NULL

# Observe data in first layer
range$train(c("A", "X"))
range$range
#> [1] "A" "X"

# Observe data in second layer
range$train(c("B"))
range$range
#> [1] "A" "B" "X"

For continuous ranges.

# Again empty at first
range <- scales::ContinuousRange$new()
range$range
#> NULL

# Observe data in first layer
range$train(c(0, 10))
range$range
#> [1]  0 10

# Observe data in second layer
range$train(c(100))
range$range
#> [1]   0 100

^{Created on 2024-07-18 with reprex v2.1.1}

In the code you present, lapply(data, npscales$train_df) is doing this job. The train_df method is called for the side effect of updating the scale's ranges and returns NULL as it does not alter the data itself and the function result is not needed.

The 'non-positional' part means that the x and y aesthetics (and related ones such as xmin, yend) don't participate as they need special treatment and be trained much earlier in the plot building process.

Recommended topics

Hot tags