The CRAN policy limits R package size to 5 Mb, which is little for graphical applications such as mapping. There are multiple ways of handling the package size limitations, all of which come with their drawbacks. The alternatives have been listed below.
My question is: how to make an R package download data files only once (i.e. they are saved to a place where R finds them after restarting)? The solution should work for all common CRAN platforms.
I have been developing a mapping package for R which is supposed to plot bathymetric maps anywhere around the globe in ggplot2. I list alternatives to handle large data files in CRAN packages I have come across. The alternatives are written map-making in mind but apply for any case where large, single files are required:
Moving large files to a data package and making the original package depend on the data package.
- a) If the data package is <5 Mb, it can be uploaded to CRAN, and one can make the original depend or import the data package in the DESCRIPTION field. User can simply use the
install.packages()
function as they would with any other CRAN package. Things work CRANtastic and everyone is happy. - b) If the data package is >5 Mb, things get messy. One alternative, in theory, would be to make a separate data package for each file given that the data files are all <5 Mb. Then one could use the approach in 1a for each data package. This alternative is so hacky that I have not had the nerves to try it in practice. It would be interesting to hear in the comments if someone has.
- c) Another and better alternative is to use the drat package to make a data package, for example, to GitHub. This alternative has the benefit that the user can write
install.packages()
to install the original package from CRAN but also has quite a few disadvantages for the developer. Setting up the data package to pass all CRAN checks can be slightly challenging as all the steps have not been correctly specified anywhere online at the moment: the original package has to ask for permission to install the data package; the data package has to be distributed as separate binaries for the current development version of R at least for Windows and Mac, but possibly also for Fedora in the drat repository; the data package should be listed asSuggests:
with an URL underAdditional_repositories:
in the DESCRIPTION file; to mention some surprises I have encountered so far. All in all, this alternative is great for the user but requires maintenance from the developer.
- a) If the data package is <5 Mb, it can be uploaded to CRAN, and one can make the original depend or import the data package in the DESCRIPTION field. User can simply use the
Some mapping packages (such as marmap) download data to temporary files from external servers. This approach has the benefit that CRAN requirements are easy to fulfill, and the user does not have to store any more data than required for the application. The approach also allows specifying the resolution in the download function, which is great for "zooming" the maps. The disadvantages are that the process is bound to take more time than simply storing the map data locally. Another disadvantage is that the map data need to be distributed in raster format (or the server has to crop vectors). At the time of writing, vector data allow easier manipulation of colors and styles in R and ggplot2 than raster data. Vectors also make sharper figures as the elements are not bound to resolution. The third disadvantage is that the download method (to my knowledge) has to be targetted to temporary files (i.e. they get lost when R is restarted) when writing a CRAN package due to operating system differences. As far as I know, it is not allowed to add Rdata files to already downloaded and existing R packages, and finding a location to download data that works for all major CRAN operating systems can be difficult.
I keep on getting rejected by CRAN time after time because I have not managed to solve the data download problem. There is some help available online but I feel this issue has not been addressed sufficiently yet. The optimal solution would download sp vector shapefiles as needed when making maps (the objects can be stored in .Rdata format). This would allow the addition of detailed maps for certain frequently needed regions. The shapefiles could be stored on GitHub, which would allow quick and flexible modification of these files during development.
tar.gz
is < 5MB. – Garnetgarnett