To address the slow performance of raster::calc
for calculating the standard deviation of large rasters, I wrote a gist (here) that does it using GDAL's gdal_calc.py.
To calculate the sd of 5 rasters with 4 million cells (2000 x 2000), gdal_calc takes about 2 seconds, compared to about 60 seconds for raster::calc
(see example below).
You'll need gdal_calc.py on the system path so that it can be found by Sys.which
, or else modify gdal_calc <- Sys.which('gdal_calc.py')
to specify the full path to gdal_calc.py.
Note that, from memory, this function supports a maximum of 26 input rasters, since gdal_calc refers to each by a letter of the alphabet. Perhaps this constraint can be circumvented - I'm not familiar enough with gdalnumeric syntax to work it out.
E.g.:
First, let's create a stack of five random, 2000 x 2000 rasters. We'll use raster::calc
on this stack, but we'll also write them out to temp files, because gdal_calc needs file paths:
s <- stack(replicate(5, raster(matrix(runif(4000000), 2000))))
ff <- replicate(5, tempfile(fileext='.tif')) # a vector of temp file paths
writeRaster(s, ff, bylayer=TRUE)
Here's the raster::calc
version:
system.time(sd_calc <- calc(s, sd))
## user system elapsed
## 79.83 0.08 80.00
and the gdal_sd
(i.e., using gdal_calc.py) version:
devtools::source_gist('61c8062938e05a4c6b92') # source the gdal_sd gist
system.time(gdal_sd(infile=ff, outfile=f <- tempfile(fileext='.tif')))
## Calculating standard deviation and writing to file17b0146640ac.tif
## user system elapsed
## 0.00 0.03 2.05
And a comparison of their values:
range(sd_calc[] - raster(f)[])
## [1] -1.110223e-16 1.110223e-16
Note that raster::calc
is likely to be faster for small rasters, but clearly gdal_sd
is a huge improvement for large rasters.
Documentation for the standard deviation function used by gdal_calc is given here. I've specified ddof=1
(i.e. delta degrees of freedom) such that results are consistent with those returned by R's sd
.
For posterity, here is the source of gdal_sd
as it was at the time of posting:
gdal_sd <- function(infile, outfile, quiet=TRUE) {
require(rgdal)
# infile: The multiband raster file (or a vector of paths to multiple raster
# files) for which to calculate cell standard deviations.
# outfile: Path to raster output file.
# quiet: Logical. Should gdal_calc.py output be silenced?
gdal_calc <- Sys.which('gdal_calc.py')
if(gdal_calc=='') stop('gdal_calc.py not found on system.')
if(file.exists(outfile)) stop('outfile already exists.')
nbands <- sapply(infile, function(x) nrow(attr(GDALinfo(x), 'df')))
if(length(infile) > 26 || nbands > 26) stop('Maximum number of inputs is 26.')
if(length(nbands) > 1 & any(nbands > 1))
warning('One or more rasters have multiple bands. First band used.')
if(length(infile)==1) {
inputs <- paste0('-', LETTERS[seq_len(nbands)], ' ', infile, ' --',
LETTERS[seq_len(nbands)], '_band ', seq_len(nbands), collapse=' ')
n <- nbands
} else {
inputs <- paste0('-', LETTERS[seq_along(nbands)], ' ', infile, ' --',
LETTERS[seq_along(nbands)], '_band 1', collapse=' ')
n <- length(infile)
}
message('Calculating standard deviation and writing to ', basename(outfile))
cmd <- 'python %s %s --outfile=%s --calc="std([%s], 0, ddof=1)"'
out <- system(
sprintf(cmd, gdal_calc, inputs, outfile,
paste0(LETTERS[seq_len(n)], collapse=',')),
show.output.on.console=!quiet, intern=TRUE
)
if(any(grepl('Error', out))) stop(out, call.=FALSE) else NULL
}
calc(qq2stack, fun=sd)
– Eran