How to make a CRAN package to download data only once regardless of OS?
Asked Answered
P

5

17

The CRAN policy limits R package size to 5 Mb, which is little for graphical applications such as mapping. There are multiple ways of handling the package size limitations, all of which come with their drawbacks. The alternatives have been listed below.

My question is: how to make an R package download data files only once (i.e. they are saved to a place where R finds them after restarting)? The solution should work for all common CRAN platforms.

I have been developing a mapping package for R which is supposed to plot bathymetric maps anywhere around the globe in ggplot2. I list alternatives to handle large data files in CRAN packages I have come across. The alternatives are written map-making in mind but apply for any case where large, single files are required:

  1. Moving large files to a data package and making the original package depend on the data package.

    • a) If the data package is <5 Mb, it can be uploaded to CRAN, and one can make the original depend or import the data package in the DESCRIPTION field. User can simply use the install.packages() function as they would with any other CRAN package. Things work CRANtastic and everyone is happy.
    • b) If the data package is >5 Mb, things get messy. One alternative, in theory, would be to make a separate data package for each file given that the data files are all <5 Mb. Then one could use the approach in 1a for each data package. This alternative is so hacky that I have not had the nerves to try it in practice. It would be interesting to hear in the comments if someone has.
    • c) Another and better alternative is to use the drat package to make a data package, for example, to GitHub. This alternative has the benefit that the user can write install.packages() to install the original package from CRAN but also has quite a few disadvantages for the developer. Setting up the data package to pass all CRAN checks can be slightly challenging as all the steps have not been correctly specified anywhere online at the moment: the original package has to ask for permission to install the data package; the data package has to be distributed as separate binaries for the current development version of R at least for Windows and Mac, but possibly also for Fedora in the drat repository; the data package should be listed as Suggests: with an URL under Additional_repositories: in the DESCRIPTION file; to mention some surprises I have encountered so far. All in all, this alternative is great for the user but requires maintenance from the developer.
  2. Some mapping packages (such as marmap) download data to temporary files from external servers. This approach has the benefit that CRAN requirements are easy to fulfill, and the user does not have to store any more data than required for the application. The approach also allows specifying the resolution in the download function, which is great for "zooming" the maps. The disadvantages are that the process is bound to take more time than simply storing the map data locally. Another disadvantage is that the map data need to be distributed in raster format (or the server has to crop vectors). At the time of writing, vector data allow easier manipulation of colors and styles in R and ggplot2 than raster data. Vectors also make sharper figures as the elements are not bound to resolution. The third disadvantage is that the download method (to my knowledge) has to be targetted to temporary files (i.e. they get lost when R is restarted) when writing a CRAN package due to operating system differences. As far as I know, it is not allowed to add Rdata files to already downloaded and existing R packages, and finding a location to download data that works for all major CRAN operating systems can be difficult.

I keep on getting rejected by CRAN time after time because I have not managed to solve the data download problem. There is some help available online but I feel this issue has not been addressed sufficiently yet. The optimal solution would download sp vector shapefiles as needed when making maps (the objects can be stored in .Rdata format). This would allow the addition of detailed maps for certain frequently needed regions. The shapefiles could be stored on GitHub, which would allow quick and flexible modification of these files during development.

Ponzo answered 2/9, 2020 at 10:18 Comment(8)
But data package could exeed 5MB, e.g: cran.r-project.org/web/packages/geomapdata/index.html its > 20MB. No need to be hackyKragh
@Kragh Unfortunately this seems to be old information. I cannot find the R devel email this has been explained any longer but it seems that they keep on rejecting all packages larger than 5Mb.Ponzo
I suppose if you submit large package without comment then it will be rejected. But you explain your situation?Kragh
@Marek, no vain. I get ignored after I got this email in April: "If you have html pages there, can't you have the data there (or elsewhere), too, and provide an R function that fetches the data on demand, e.g. in a way the user only has to download these once? "Ponzo
I submitted a package > 20MB and no problem. But the tar.gz is < 5MB.Garnetgarnett
@StéphaneLaurent Mine is 32 MbPonzo
@Mikko, I have this exact issue right now, what solution did you end up using? I am thinking about using the drat R package as you mentioned; did you take this approach and any suggestions?Dobruja
@PeterCalhoun, I used both drat and download. Please see here: mikkovihtakari.github.io/ggOceanMapsPonzo
S
2

Have you tried using xz compression to reduce the size of your sysdata? I believe the default is gzip, with the compression level set to 6. If you use either bzip2 or xz compression when saving your package data with save(), R will use these compression algorithms in conjunction with a compression level of 9. The upshot is that you get smaller package data objects.

Sheri answered 2/9, 2020 at 10:36 Comment(1)
Yes, this is one of the things R CMD check picks on. See e.g. r-pkgs.had.co.nz/check.htmlPonzo
H
2

The getNOAA.bathy() function from the marmap package has a keep argument which defaults to FALSE. If set to TRUE, the dataset downloaded from the ETOPO1 database on NOAA servers is stored locally, in the working directory of the current R session. The argument Path allows the user to specify where the dataset should be saved (version 1.0.5, available on GitHub but not on CRAN yet).

When the user calls getNOAA.bathy(), the function first checks if the requested data is available locally, either in the current working directory or in the user provided path. If it is (same bounding box and resolution), then the NOAA servers are not queried and the local data file is loaded instead. If not, the data is downloaded from NOAA servers. IMHO, this method has the following advantages:

  1. if keep=FALSE: nothing is stored locally, which avoids adding too much clutter to the user's disk when loading many different test datasets.
  2. if keep=TRUE: the data is stored locally. Loading the data will be much faster the next time (and it can be done offline) since everything happens locally.
  3. In a script, the same getNOAA.bathy() function is used to first download data from NOAA servers and load local files when available. The user does not have to worry to manually save the data, nor to alter his\her script to load local data the next time, since the function automatically loads the data from the most appropriate source (web server or internal disk).
  4. there's no need to pack any heavy data within the package.

As far as I can tell, the only drawback is that on Windows machines, paths are limited to 250 characters, which might cause some trouble when generating filenames to save the data. Indeed, depending on the bounding box and resolution of the data downloaded on NOAA servers, filenames can be pretty long due to floating point arithmetics. An easy fix is to round the coordinates of the bounding box (using either round(), ceiling() or floor()) to a few decimal places before generating the name of the file to save.

Hovis answered 4/9, 2020 at 14:18 Comment(1)
Hmm...interesting. I did not look at the code carefully enough. Saving to current working directory would download the data to every project directory if one used Rstudio and did not know much about programming (average user for my package). That is better than downloading every time one reopens R but not optimal. I wonder whether every major CRAN operating system have desktop? Is that located to "~/Desktop" under all operating systems?Ponzo
D
2

In general I wouldn't make it too hacky. I think there could be ways to trick the package to load additional data online during installation and add it to the package itself. Would be somehow nice - but I don't think it is popular with the CRAN maintainers.

What about the following ? :

  1. CRAN package for the functions
  2. Github package for your data

In the CRAN package you import devtools and with the .onLoad method you install the Github data package with devtools::install_github. (on load is called, when the package is loaded with library()/require()). You see this sometimes with package startup messages.

I could imagine the following advantages:

  • is not done during installation but at package load
  • is somehow more transparent to the user (especially if you put a message)
  • has only to be done once (afterwards on load can just check if the data package is there and loads it)
  • the data is actually in a package and not a user path
  • the data is there for offline use once loaded
  • if you check for data package version in .onLoad, you could also trigger/make an update for the data without updating the CRAN package

A implementation could look like this:

#' @import devtools
  
.onLoad <- function(libname, pkgname){
  if (! "wordcloud" %in% utils::installed.packages()) {
    message("installing data super dupa data package")
    devtools::install_github("ifellows/wordcloud")
  }
  else {
    require(wordcloud)
    message("Everything fine, ready for usage!")
  }
}

The .onLoad has just to be out in any of your .R files. For your concrete implementation you could also refine this further. I don't have anything to to with the wordcloud package - was just the first thing I quickly found on GitHub as an example to install with install_github. If there is an error message saying something with staged install - you have to add StagedInstall: no to your DESCRIPTION file.

Designate answered 4/9, 2020 at 17:44 Comment(7)
Thank you for your answer. What are the benefits of the devtools Github installation over the drat way (see the links in my post)? I guess I mention one in my post: you need to maintain current binaries for Windows and Mac every time you update the CRAN release when doing it the drat way. With the devtools you would not, but would need to introduce a dependency and compile from source. In my experience Windows machines sometimes strucle with source compilation (when there is C code for instance) and I often get complaints when I depend my solutions on that.Ponzo
Oh, and btw, it is not allowed to install packages without asking the user first. My latest submit got rejected because of that. You should add a query prompt (the menu function for example) in .onLoad to pass CRAN checks (fails on Fedora at least).Ponzo
Oh, utils::menu() does not seem to work either: error: menu() cannot be used non-interactively. I have no idea what to do...Ponzo
Maybe open another question for the menu problem, with some more info? You see these menu's pretty often I am quite sure this should work.Designate
Yeah, complicated issue - to be honest at least until now, I wouldn't have a clear favorite between the solutions suggested here. Mine is of course also just a suggestion how you could possibly tackle the problem - I have no CRAN package with the same issue. In the end you maybe just have to go with a suboptimal solution. I think there are few legacy packages with >5MB on CRAN, but somehow understandable that they insist on their rules.Designate
Maybe it might be a good solution to search for a package that has the some issue and try to just replicate their solution 1:1. (unfortunately I have no package in mind here ... but I'd guess there must be some)Designate
True, there is no ideal solution for this atm. Therefore this question. Trying to provoke one of the brilliant minds to make a solution ;)Ponzo
S
2

You could have a function to install the data at a chosen location, and have the path stored in an option defined in your .R Profile: option(yourpackage.datapath = your path). You might suggest that the user stores it in your package installation path.

The installing function prints first the code above and proposes you to copy and paste it in your .RProfile while the data is downloading :

if(is.null(getOption("yourpackage.datapath")))
  stop('you have not defined the "yourpackage.datapath" option, please make sure the data is installed using `yourpackage::install_yourdata", then copy `option(yourpackage.datapath = yourpath)` to your R profile.')

You could also open it using edit() for instance. Or place it in your pastebin but you don't want extra dependencies and I think you'd need some to do this. I don't think CRAN will let you edit the .RProfile automatically but this is not too bad of a manual action. The installation function could check that the option is set before even downloading.

The data can be stored in a global variable of your namespace. You just need to define a environment object in your package and a function to modify it :

globals <- new.env()
load_data <- function(path) globals$data <- readRDS(path) 

Then your functions will test if globals$data is NULL before either loading the data (after checking if path option was set properly) or moving on.

Once it's done, as long as the data or RProfile are not removed, it will work forever, and if they are removed the functions will catch it and give instructions as to how to fix the issue.


Another option here is to load the data in .onLoad, it means you'll have some logic in there to deal with the first time the package is loaded. As .onLoad knows the installation path through the libname argument you can even impose to download your data there, and load it right after you checked it's there (using a global variable as above) , so no need for options and RProfile.

As long as the user is prompted I think it will be fine with CRAN.

Spermaceti answered 10/9, 2020 at 19:20 Comment(0)
B
2

Two alternatives that might be of interest:

  1. Create an additional install function that installs from Github the data package(s). The rnaturalearth package has a great example with the install_rnaturalearthhires function.

  2. Use the pins package to register a board_url. The pins package works by downloading and storing the file on cache. Whenever it is called it looks to the original url to see if there were any changes. If there weren't, it uses the one it already has in memory. If it has no Internet connection, it also uses the one in memory. As an example we use the pins package in our covidmx package to update COVID-19 data from the Internet.

Boz answered 6/9, 2022 at 21:52 Comment(2)
1. in your answer just asks to install rnaturaleartdata package from c("packages.ropensci.org", "cran.rstudio.com"). This is because CRAN does not allow automatically installing packages without confirmation using install.packages. Could we just deposit the data package to ropensci.org and add a Suggests to DESCRIPTION similarly to the drat example? Source code here: github.com/ropensci/rnaturalearth/blob/master/R/…Ponzo
@Ponzo I linked the wrong function from the rnaturalearth package. You can see that the install_rnaturalearthhires function uses devtools. So you could install from any public Github repo. See the source code here: github.com/ropensci/rnaturalearth/blob/master/R/…Boz

© 2022 - 2024 — McMap. All rights reserved.