Working with lots of data and lots of rasters in R?

Asked 7/4, 2012 at 0:14 Answered 11/4, 2016 at 23:44

r memory-management large-files spatial raster

G'day, I am working with a large dataset with ~125,000 lon/lat locations with date, for species presence/absence records. For at each location I want to work out what the weather was like at each location on the date and during the 3mths prior to the date. To do this I have downloaded daily weather data for a given weather variable (e.g., max temperature) during the 5yr period the data was taken. I have a total of 1,826 raster files, all between 2-3mb.

I had planned to stack all raster files, then extract a value from every raster (1,826) for each point. This would produce a massive file I could use to search for the dates I need. This is, however, not possible because I can't stack that many rasters. I tried splitting the rasters into stacks of 500, this works, but the files it produces are about 1Gb and very slow (rows, 125,000; columns, 500). Also, when I try to bring all of these files into R to create a big data frame it doesn't work.

I would like to know if there is a way to work with this amount of data in R, or if there is a package that I could use to help. Could I use a package like ff? Does anyone have any suggestions for a less power intensive method to do what I want to do? I have thought about something like a lapply function, but have never used one before and am not really sure where to begin.

Any help would be really great, thanks in advance for your time. The code I am currently using without success is below.

Kind regards, Adam

library(raster)
library(rgdal)
library (maptools)
library(shapefiles)

# To create weather data files, first set the working directory to the appropriate location (i.e., maxt)
# list of raster weather files
files<- list.files(getwd(), pattern='asc')
length(files)

memory.size(4000)  
memory.limit(4000)

# read in lon/lat data
X<-read.table(file.choose(), header=TRUE, sep=',')
SP<- SpatialPoints(cbind(X$lon, X$lat)) 

#separate stacks into mannageable sizes
s1<- stack(files[1:500])
i1 <- extract( s1,SP, cellnumbers = True, layer = 1, nl = 500)
write.table(i1, file="maxt_vals_all_points_all_dates_1.csv", sep=",", row.names= FALSE, col.names= TRUE)
rm(s1,i1)
s2<- stack(files[501:1000])
i2 <- extract( s2,SP, cellnumbers = True, layer = 1, nl = 500)
write.table(i2, file="maxt_vals_all_points_all_dates_2.csv", sep=",", row.names= FALSE, col.names= TRUE)
rm(s2,i2)
s3<- stack(files[1001:1500])
i3 <- extract( s3,SP, cellnumbers = True, layer = 1, nl = 500)
write.table(i3, file="maxt_vals_all_points_all_dates_3.csv", sep=",", row.names= FALSE, col.names= TRUE)
rm(s3,i3)
s4<- stack(files[1501:1826])
i4 <- extract( s4,SP, cellnumbers = True, layer = 1, nl =325)
write.table(i4, file="maxt_vals_all_points_all_dates_4.csv", sep=",", row.names= FALSE, col.names= TRUE)
rm(s4,i4)

# read files back in to bind into final file !!! NOT WORKING FILES ARE TOO BIG!!
i1<-read.table(file.choose(),header=TRUE,sep=',')
i2<-read.table(file.choose(),header=TRUE,sep=',')
i3<-read.table(file.choose(),header=TRUE,sep=',')
i4<-read.table(file.choose(),header=TRUE,sep=',')

vals<-data.frame(X, i1, i2, i3 ,i4)
write.table(vals, file="maxt_master_lookup.csv", sep=",", row.names= FALSE, col.names= TRUE)

Arcboutant answered 7/4, 2012 at 0:14 Comment(6)

I can't recall the sheer size of data I've done this to, but I have had luck bringing a ton or rasters into a named list. You may have figured out that extract will probably be your time bottleneck, so I would try to limit its use as much as possible. I have experimented with using the tapply/by/ddply family of functions to split a large dataframe into groups, then use extract on each group on the appropriate file (which, in your case, would be some sort of date grouping), then reassembling, but I have not had a ton of success with this. – Camera 7/4, 2012 at 0:27

It would be simple enough to read those files into an ff array, then set up the extraction for points from those. Can you provide a link to one or two files for working with, or produce useable data with code? Also, you use file.choose() for reading the files, but why not the names you used to write them? – Subaltern 7/4, 2012 at 6:21

But, why read them all at once anyway? Why not just do the extract one raster file at a time? And, if the final result is too big to pre-initialize as a single object just append to a file as you go. – Subaltern 7/4, 2012 at 6:43

Hmm, also there is no such thing as True defined here, reproducible and tested code please . . – Subaltern 7/4, 2012 at 6:47

Hi @mdsummer, the answer to most of those questions is that I simply don't know how yet. I don't know enough about R and coding to make things simple. A link to drop box with some of the weather data files is here link. – Arcboutant 10/4, 2012 at 3:30

@mdsummer, also in the same link are the points that I am working with. All 125,000. – Arcboutant 10/4, 2012 at 3:48

I would do the extract one raster file at a time, and append the results to file as you go.

I cheat making a list of matrices, but since raster can take a filename or a matrix (amongst other things) and you can index with "[[" on a character vector it should work pretty much the same in your case.

files <- list(volcano, volcano * 2, volcano * 3)
library(sp)
SP <- SpatialPoints(structure(c(0.455921585146703, 0.237608166502031, 0.397704673508124, 0.678393354622703, 0.342820219769366, 0.554888036966903, 0.777351335399613, 0.654684656824567), .Dim = c(4L, 2L)))

library(raster)
for (i in seq_len(length(files))) {

    r <- raster(files[[i]])
    e <- extract(r, SP)
    ## print(e)  ## print for debugging
    write.table(data.frame(file = i, extract = e),"cellSummary.csv", col.names = i == 1, append = i > 1, sep = ",", row.names = FALSE)
}

Subaltern answered 7/4, 2012 at 7:1 Comment(6)

Hi @mdsummer, thanks for this answer. I am pretty sure it will work perfectly. At the moment it is creating a data.frame in only two columns, meaning after a few extractions it runs out of rows to write into. I am trying to figure out how to change your code to get cell numbers, then the extracted info in each proceeding column. To me your code looks like it should do that, I'm not sure what's wrong. Thanks again for your help, much appreciated. – Arcboutant 10/4, 2012 at 3:37

PS. this is the error I get. Error in .rasterObjectFromFile(x, band = band, objecttype = "RasterLayer", : Cannot create a RasterLayer object from this file. – Arcboutant 10/4, 2012 at 3:45

We cannot help if you don't provide the file or details about it. This is just one of your files not being what you expect I imagine. – Subaltern 10/4, 2012 at 4:45

Hi @mdsummer, I have linked some files to a drop box folder in a comment under my original question. Is there an easier way to share these files? I have used these files with the code you have suggested. Instead of writing each rasters extracted data in separate columns, all extracted data are written in the same column until it reaches a huge number of rows at about 1,025,000. I think that your suggestion is the best way to do the extraction, but would need each raster data in a new column, i.e., 125,000 rows for each presence/absence record X 1862 columns for each raster. Thanks lots – Arcboutant 12/4, 2012 at 23:33

@mdsummer here are the links, link, link, link, link, link – Arcboutant 12/4, 2012 at 23:51

You just want presence/absence for every cell? It's the same for every raster file. If you want the values, just create the big matrix upfront and populated each column with extract. If it's too big for memory, use ff or bigmemory – Subaltern 18/4, 2012 at 23:23

I am using parallel processing and a form of cropping based on cell number. This function will take any spatial points or polygon and return the values from a large raster stack. This is a variant on code good example for large polygons.

For my data this takes about 350 seconds using extract, or 32 seconds on a 16 core linux server. Hope it helps someone!

 # Define Functions
  extract_value_point_polygon = function(point_or_polygon, raster_stack, num_workers){
          # Returns list containing values from locations of spatial points or polygons
          lapply(c('raster','foreach','doParallel'), require, character.only = T)
          registerDoParallel(num_workers)
          ply_result = foreach(j = 1:length(point_or_polygon),.inorder=T) %do%{
                print(paste('Working on feature: ',j,' out of ',length(point_or_polygon)))
                get_class= class(point_or_polygon)[1]
                if(get_class=='SpatialPolygons'|get_class=='SpatialPolygonsDataFrame'){
                    cell = as.numeric(na.omit(cellFromPolygon(raster_stack, point_or_polygon[j], weights=F)[[1]]))}
                if(get_class=='SpatialPointsDataFrame'|get_class=='SpatialPoints'){
                    cell = as.numeric(na.omit(cellFromXY(raster_stack, point_or_polygon[j,])))}
                if(length(cell)==0)return(NA)
                r = rasterFromCells(raster_stack, cell,values=F)
                result = foreach(i = 1:dim(raster_stack)[3],.packages='raster',.inorder=T) %dopar% {
                   crop(raster_stack[[i]],r)
                }
                result=as.data.frame(getValues(stack(result)))
                return(result)
          }
          endCluster()
          return(ply_result)
  }

Burdelle answered 11/4, 2016 at 23:44 Comment(0)

Recommended topics

Hot tags