Efficient way to plot data on an irregular grid

Asked 23/2, 2018 at 16:36 Answered 29/6, 2018 at 22:5

I work with satellite data organized on an irregular two-dimensional grid whose dimensions are scanline (along track dimension) and ground pixel (across track dimension). Latitude and longitude information for each centre pixel are stored in auxiliary coordinate variables, as well as the four corners coordinate pairs (latitude and longitude coordinates are given on the WGS84 reference ellipsoid). The data is stored in netCDF4 files.

What I am trying to do is efficiently plotting these files (and possibly a combination of files—next step!) on a projected map.

My approach so far, inspired by Jeremy Voisey's answer to this question, has been to build a data frame that links my variable of interest to the pixel boundaries, and to use ggplot2 with geom_polygon for the actual plot.

Let me illustrate my workflow, and apologies in advance for the naive approach: I just started coding with R since a week or two.

Note

To fully reproduce the problem:
1. download the two dataframes: so2df.Rda (22M) and pixel_corners.Rda (26M)
2. load them in your environment, e.g.

so2df <- readRDS(file="so2df.Rda")
pixel_corners <- readRDS(file="pixel_corners.Rda")

jump to the "Merge the dataframes" step.

Initial setup

I'm going to read the data, and the latitude/longitude boundaries from my file.

library(ncdf4)
library(ggplot2)
library(ggmap) 
# set path and filename
ncpath <- "/Users/stefano/src/s5p/products/e1dataset/L2__SO2/"
ncname <- "S5P_OFFL_L2__SO2____20171128T234133_20171129T003956_00661_01_022943_00000000T000000"  
ncfname <- paste(ncpath, ncname, ".nc", sep="")
nc <- nc_open(ncfname)

# save fill value and multiplication factors
mfactor = ncatt_get(nc, "PRODUCT/sulfurdioxide_total_vertical_column", 
                    "multiplication_factor_to_convert_to_DU")
fillvalue = ncatt_get(nc, "PRODUCT/sulfurdioxide_total_vertical_column", 
                      "_FillValue")

# read the SO2 total column variable
so2tc <- ncvar_get(nc, "PRODUCT/sulfurdioxide_total_vertical_column")

# read lat/lon of centre pixels
lat <- ncvar_get(nc, "PRODUCT/latitude")
lon <- ncvar_get(nc, "PRODUCT/longitude")

# read latitude and longitude bounds
lat_bounds <- ncvar_get(nc, "GEOLOCATIONS/latitude_bounds")
lon_bounds <- ncvar_get(nc, "GEOLOCATIONS/longitude_bounds")

# close the file
nc_close(nc)
dim(so2tc)
## [1]  450 3244

So for this file/pass there are 450 ground pixels for each of the 3244 scanlines.

Create the dataframes

Here I create two dataframes, one for the values, with some post-processing, and one for the latitude/longitude boundaries, I then merge the two dataframes.

so2df <- data.frame(lat=as.vector(lat), lon=as.vector(lon), so2tc=as.vector(so2tc))
# add id for each pixel
so2df$id <- row.names(so2df)
# convert to DU
so2df$so2tc <- so2df$so2tc*as.numeric(mfactor$value)
# replace fill values with NA
so2df$so2tc[so2df$so2tc == fillvalue] <- NA
saveRDS(so2df, file="so2df.Rda")
summary(so2df)

##       lat              lon              so2tc              id           
##  Min.   :-89.97   Min.   :-180.00   Min.   :-821.33   Length:1459800    
##  1st Qu.:-62.29   1st Qu.:-163.30   1st Qu.:  -0.48   Class :character  
##  Median :-19.86   Median :-150.46   Median :  -0.08   Mode  :character  
##  Mean   :-13.87   Mean   : -90.72   Mean   :  -1.43                     
##  3rd Qu.: 31.26   3rd Qu.: -27.06   3rd Qu.:   0.26                     
##  Max.   : 83.37   Max.   : 180.00   Max.   :3015.55                     
##                                     NA's   :200864

I saved this dataframe as so2df.Rda here (22M).

num_points = dim(lat_bounds)[1]
pixel_corners <- data.frame(lat_bounds=as.vector(lat_bounds), lon_bounds=as.vector(lon_bounds))
# create id column by replicating pixel's id for each of the 4 corner points
pixel_corners$id <- rep(so2df$id, each=num_points)
saveRDS(pixel_corners, file="pixel_corners.Rda")
summary(pixel_corners)


##    lat_bounds       lon_bounds           id           
##  Min.   :-89.96   Min.   :-180.00   Length:5839200    
##  1st Qu.:-62.29   1st Qu.:-163.30   Class :character  
##  Median :-19.86   Median :-150.46   Mode  :character  
##  Mean   :-13.87   Mean   : -90.72                     
##  3rd Qu.: 31.26   3rd Qu.: -27.06                     
##  Max.   : 83.40   Max.   : 180.00

As expected, the lat/lon boundaries dataframe is four time as big as the value dataframe (four points for each pixel/value).
I saved this dataframe as pixel_corners.Rda here (26M).

Merge the dataframes

I then merge the two data frames by id:

start_time <- Sys.time()
so2df <- merge(pixel_corners, so2df, by=c("id"))
time_taken <- Sys.time() - start_time
print(paste(dim(so2df)[1], "rows merged in", time_taken, "seconds"))

## [1] "5839200 rows merged in 42.4763631820679 seconds"

As you can see, it's a somehow CPU intensive process. I wonder what would happen if I were to work with 15 files at once (global coverage).

Plotting the data

Now that I've got my pixels corners linked to the pixel value, I can easily plot them. Usually, I'm interested in a particular region of the orbit, so I made a function that subsets the input dataframe before plotting it:

PlotRegion <- function(so2df, latlon, title) {
  # Plot the given dataset over a geographic region.
  #
  # Args:
  #   df: The dataset, should include the no2tc, lat, lon columns
  #   latlon: A vector of four values identifying the botton-left and top-right corners 
  #           c(latmin, latmax, lonmin, lonmax)
  #   title: The plot title

  # subset the data frame first
  df_sub <- subset(so2df, lat>latlon[1] & lat<latlon[2] & lon>latlon[3] & lon<latlon[4])

  subtitle = paste("#Pixel =", dim(df_sub)[1], "- Data min =", 
                   formatC(min(df_sub$so2tc, na.rm=T), format="e", digits=2), "max =", 
                   formatC(max(df_sub$so2tc, na.rm=T), format="e", digits=2))

  ggplot(df_sub) + 
    geom_polygon(aes(y=lat_bounds, x=lon_bounds, fill=so2tc, group=id), alpha=0.8) +
    borders('world', xlim=range(df_sub$lon), ylim=range(df_sub$lat), 
            colour='gray20', size=.2) + 
    theme_light() + 
    theme(panel.ontop=TRUE, panel.background=element_blank()) +
    scale_fill_distiller(palette='Spectral') +
    coord_quickmap(xlim=c(latlon[3], latlon[4]), ylim=c(latlon[1], latlon[2])) +
    labs(title=title, subtitle=subtitle, 
         x="Longitude", y="Latitude", 
         fill=expression(DU)) 
}

I then invoke my function over a region of interest, for instance let's see what happens over the Hawaii:

latlon = c(17.5, 22.5, -160, -154)
PlotRegion(so2df, latlon, expression(SO[2]~total~vertical~column))

There they are, my pixels, and what appears to be a SO2 plume from the Mauna Loa. Please ignore the negative values for now. As you can see, the pixels' area vary towards the edge of the swath (different binning scheme).

I tried to show the same plot over google's maps, using ggmap:

PlotRegionMap <- function(so2df, latlon, title) {
  # Plot the given dataset over a geographic region.
  #
  # Args:
  #   df: The dataset, should include the no2tc, lat, lon columns
  #   latlon: A vector of four values identifying the botton-left and top-right corners 
  #           c(latmin, latmax, lonmin, lonmax)
  #   title: The plot title

  # subset the data frame first
  df_sub <- subset(so2df, lat>latlon[1] & lat<latlon[2] & lon>latlon[3] & lon<latlon[4])

  subtitle = paste("#Pixel =", dim(df_sub)[1], "Data min =", formatC(min(df_sub$so2tc, na.rm=T), format="e", digits=2), 
                   "max =", formatC(max(df_sub$so2tc, na.rm=T), format="e", digits=2))
  base_map <- get_map(location = c(lon = (latlon[4]+latlon[3])/2, lat = (latlon[1]+latlon[2])/2), zoom = 7, maptype="terrain", color="bw")

  ggmap(base_map, extent = "normal")  +
    geom_polygon(data=df_sub, aes(y=lat_bounds, x=lon_bounds,fill=so2tc, group=id),  alpha=0.5) +
    theme_light() + 
    theme(panel.ontop=TRUE, panel.background=element_blank()) +
    scale_fill_distiller(palette='Spectral') +
    coord_quickmap(xlim=c(latlon[3], latlon[4]), ylim=c(latlon[1], latlon[2])) +
    labs(title=title, subtitle=subtitle, 
         x="Longitude", y="Latitude", 
         fill=expression(DU)) 

}

And this is what I get:

latlon = c(17.5, 22.5, -160, -154)
PlotRegionMap(so2df, latlon, expression(SO[2]~total~vertical~column))

Questions

Is there a more efficient way to approach this problem? I was reading about the sf package, and I was wondering if I could define a dataframe of points (values + centre pixel coordinates), and have sf automatically infer the pixel boundaries. That would save me from having to rely on the lat/lon boundaries defined in my original dataset and having to merge them with my values. I could accept a loss of precision on the transition areas towards the edge of the swath, the grid is otherwise pretty much regular, each pixel being 3.5x7 km^2 big.
Would re-gridding my data to a regular grid (how?), possibly by aggregating adjacent pixels, improve performances? I was thinking about using the raster package, which, as I understood, requires data on a regular grid. This should be useful going global scale (e.g. plots over Europe), where I don't need the to plot the individual pixels–in fact, I wouldn't even been able to see them.
Do I need to re-project my data when plotting over google map?

[bonus cosmetic questions]

Is there a more elegant way to subset my dataframe on a region identified by its four corner points?
How could I change the color scale to make the higher values stand out with respect to the lower values? I've experienced with log scale with poor results.

Electrochemistry answered 23/2, 2018 at 16:36 Comment(12)

This is an interesting looking problem, but right now it's a bit much - you may have more luck getting answers if you ask multiple distinct questions instead of five questions in one. Also, please provide data to make your problem reproducible (use dput(), or if your dataset is too large, reproduce your problem using a simulated data or a dataset built into one of the packages you are using) so others can run your code. – Cantrip 23/2, 2018 at 16:46

Hi Jan, thanks for your reply, I saved the two data frames on the cloud, the links are in the questions, so it should be possible to fully reproduce the workflow by loading them first and start from the merge step. As for the questions, maybe the ones I really care about right now are 1. and 2. Thanks! – Electrochemistry 23/2, 2018 at 16:56

People are generally pretty unwilling to follow links to large offsite data files that will take time to download and may or may not contain viruses. – Cantrip 23/2, 2018 at 17:1

I edited my question to make it more clear how to reproduce the workflow. As for the virus, what is the standard way to attach data files to questions if not by uploading them to some kind of file sharing service? – Electrochemistry 23/2, 2018 at 17:11

In general as well as making examples reproducible, it is good to make them minimal. This often means taking only a subset of the data that is large enough to capture all issues with the data using head() and then dput(). I'll try and work through this when I get some more time though. – Theatricals 23/2, 2018 at 23:24

I agree with the other commenters. I recommend splitting this into several independent questions, each one containing a small example data set with a clearly defined single question and required outcome. – Yalonda 24/2, 2018 at 0:7

Agreed with splitting the original question. Though I'd like to retain the original data set, to allow people to demonstrate how they efficiently deal with perhaps millions of points with sf, subsetting them and plotting them on a map. Bear in mind that the original netCDF file is over 600MB so, reducing it to two data frame summing up to 48MB is already quite a reduction in size. That said, I presume I could reduce the data even further, by preliminary cropping the orbit to a much smaller region. I'll do that in my follow-up questions. Thanks all for the constructive feedback! – Electrochemistry 24/2, 2018 at 8:28

@Electrochemistry : Depending on how you split now your questions reducing more your data might be useful or not. Make it as simple as possible. You can propose a really minimal example and then check afterwards if the proposed solutions scale correctly to your real data – Skirl 25/2, 2018 at 14:56

@Electrochemistry (continued) : Some comments on your questions here : if you want to increase computing efficiency (merging, subsetting) on data.frames, you might look at the packages data.table and dplyr. If you can have a grid of equally sized pixels the raster package seems indeed the way to go... It is particularly well suited to handle huge datasets (the whole data is not loaded in the RAM in contrast with the usual way in R). The stars package might also be useful in the future (now the development seem to be in its infancy) – Skirl 25/2, 2018 at 14:59

With regard to your bonus question about the color: make all pixels below a certain threshold (e.g. 1 DU) transparent. If you do a google image search on TROPOMI SO2 you'll see some nice examples. For instance this one – Zooplankton 27/2, 2018 at 9:3

Hi, could you provide information concerning the characteristics of the "original" data. What kind of sensor/product is it ? Asking because knowing the characteristics of the "source" may help in understanding if there are better way to deal with it. – Schnorkle 28/2, 2018 at 13:17

It's SO2 Total Column measurement derived from data acquired by TROPOMI, an instrument on board of the ESA satellite Sentinel-5P. I'm currently working on new data acquired over Mount Sinabung during the recent eruption and will re-structure my question with extracts from this new data set, to make things more interesting. I believe I already solved the issue about the color scale applying a SQRT transform, and playing with varying alpha levels for different signal strengths. Here's a preview. – Electrochemistry 28/2, 2018 at 14:42

I think data.table might be helpful here. The merging is almost instantly.

"5839200 rows merged in 1.24507117271423 seconds"

library(data.table)
pixel_cornersDT <- as.data.table(pixel_corners)
so2dfDT <- as.data.table(so2df)

setkey(pixel_cornersDT, id)
setkey(so2dfDT, id)

so2dfDT <- merge(pixel_cornersDT, so2dfDT, by=c("id"), all = TRUE)

Having the data in a data.table, the subsetting in the plotting function is also gonna be considerably faster.

Questions 1 / 2 / 4:

I dont think the process would be faster with raster or sf but you can experiment with the functions rasterFromXYZ() or st_make_grid(). But most of the time will be spent on the conversion to raster/sf objects, as you would have to transform the whole dataset.

I would suggest doing all the data processing with data.table including the cropping, and from there you could switch to raster/sf objects for plotting purposes.

Question 3:

The google plot shows correctly, but you have specified a black/white map, and its overlayed with the "raster", so you won't see a lot. You could change the basemap to a satellite-background.

base_map <- get_map(location = c(lon = (latlon[4]+latlon[3])/2, lat = (latlon[1]+latlon[2])/2), 
                    zoom = 7, maptype="satellite")

Question 5:

You could use the rescale function from the scales package. I included two options below. The first (uncommented) takes the quantiles as breaks and the other breaks are individually-defined. I wouldn't use the log transformation (trans- argument) as you would create NA values, since you have negative values aswell.

ggplot(df_sub) + 
  geom_polygon(aes(y=lat_bounds, x=lon_bounds, fill=so2tc, group=id), alpha=0.8) +
  borders('world', xlim=range(df_sub$lon), ylim=range(df_sub$lat),
          colour='gray20', size=.2) +
  theme_light() + 
  theme(panel.ontop=TRUE, panel.background=element_blank()) +
  # scale_fill_distiller(palette='Spectral', type="seq", trans = "log2") +
  scale_fill_distiller(palette = "Spectral",
                       # values = scales::rescale(quantile(df_sub$so2tc), c(0,1))) +
                       values = scales::rescale(c(-3,0,1,5), c(0,1))) +
  coord_quickmap(xlim=c(latlon[3], latlon[4]), ylim=c(latlon[1], latlon[2])) +
  labs(title=title, subtitle=subtitle, 
       x="Longitude", y="Latitude", 
       fill=expression(DU))

The whole process takes now about 8 seconds for me, including the plotting without background-map, although the map rendering will also take additional 1-2 seconds.

Vilipend answered 29/6, 2018 at 22:5 Comment(2)

Thanks for taking the time to prepare such an insightful answer! – Electrochemistry 2/7, 2018 at 7:45

You're welcome, I hope it is still relevant for you! – Vilipend 2/7, 2018 at 8:27

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++