Writing robust R code: namespaces, masking and using the `::` operator
Asked Answered
W

2

48

Short version

For those that don't want to read through my "case", this is the essence:

  1. What is the recommended way of minimizing the chances of new packages breaking existing code, i.e. of making the code you write as robust as possible?
  2. What is the recommended way of making the best use of the namespace mechanism when

    a) just using contributed packages (say in just some R Analysis Project)?

    b) with respect to developing own packages?

  3. How best to avoid conflicts with respect to formal classes (mostly Reference Classes in my case) as there isn't even a namespace mechanism comparable to :: for classes (AFAIU)?


The way the R universe works

This is something that's been nagging in the back of my mind for about two years now, yet I don't feel as if I have come to a satisfying solution. Plus I feel it's getting worse.

We see an ever increasing number of packages on CRAN, github, R-Forge and the like, which is simply terrific.

In such a decentralized environment, it is natural that the code base that makes up R (let's say that's base R and contributed R, for simplicity) will deviate from an ideal state with respect to robustness: people follow different conventions, there's S3, S4, S4 Reference Classes, etc. Things can't be as "aligned" as they would be if there were a "central clearing instance" that enforced conventions. That's okay.

The problem

Given the above, it can be very hard to use R to write robust code. Not everything you need will be in base R. For certain projects you will end up loading quite a few contributed packages.

IMHO, the biggest issue in that respect is the way the namespace concept is put to use in R: R allows for simply writing the name of a certain function/method without explicitly requiring it's namespace (i.e. foo vs. namespace::foo).

So for the sake of simplicity, that's what everyone is doing. But that way, name clashes, broken code and the need to rewrite/refactor your code are just a matter of time (or of the number of different packages loaded).

At best, you will know about which existing functions are masked/overloaded by a newly added package. At worst, you will have no clue until your code breaks.

A couple of examples:

  • try loading RMySQL and RSQLite at the same time, they don't go along very well
  • also RMongo will overwrite certain functions of RMySQL
  • forecast masks a lot of stuff with respect to ARIMA-related functions
  • R.utils even masks the base::parse routine

(I can't recall which functions in particular were causing the problems, but am willing to look it up again if there's interest)

Surprisingly, this doesn't seem to bother a lot of programmers out there. I tried to raise interest a couple of times at r-devel, to no significant avail.

Downsides of using the :: operator

  1. Using the :: operator might significantly hurt efficiency in certain contexts as Dominick Samperi pointed out.
  2. When developing your own package, you can't even use the :: operator throughout your own code as your code is no real package yet and thus there's also no namespace yet. So I would have to initially stick to the foo way, build, test and then go back to changing everything to namespace::foo. Not really.

Possible solutions to avoid these problems

  1. Reassign each function from each package to a variable that follows certain naming conventions, e.g. namespace..foo in order to avoid the inefficiencies associated with namespace::foo (I outlined it once here). Pros: it works. Cons: it's clumsy and you double the memory used.
  2. Simulate a namespace when developing your package. AFAIU, this is not really possible, at least I was told so back then.
  3. Make it mandatory to use namespace::foo. IMHO, that would be the best thing to do. Sure, we would lose some extend of simplicity, but then again the R universe just isn't simple anymore (at least it's not as simple as in the early 00's).

And what about (formal) classes?

Apart from the aspects described above, :: way works quite well for functions/methods. But what about class definitions?

Take package timeDate with it's class timeDate. Say another package comes along which also has a class timeDate. I don't see how I could explicitly state that I would like a new instance of class timeDate from either of the two packages.

Something like this will not work:

new(timeDate::timeDate)
new("timeDate::timeDate")
new("timeDate", ns="timeDate")

That can be a huge problem as more and more people switch to an OOP-style for their R packages, leading to lots of class definitions. If there is a way to explicitly address the namespace of a class definition, I would very much appreciate a pointer!

Conclusion

Even though this was a bit lengthy, I hope I was able to point out the core problem/question and that I can raise more awareness here.

I think devtools and mvbutils do have some approaches that might be worth spreading, but I'm sure there's more to say.

Waxen answered 8/6, 2012 at 10:30 Comment(20)
This is a nice summary of the state of things, but perhaps you can more explicitly state what exactly the question is?Willy
Yes, that's a point. Just put up an "Essence" section ;-)Waxen
To appease @Andrie, you should re-state your question at the end of your... um, question. Also, "Esssence" -> "tl;dr". :)Adrenalin
@JoshuaUlrich: Sorry, I don't quite understand what you mean by "Esssence" -> "tl;dr"Waxen
I meant that as a joke. tl;dr.Adrenalin
Related question: https://mcmap.net/q/371863/-ensuring-reproducibility-in-an-r-environment/602276Willy
@JoshuaUlrich: thanks for a new abbreviation I learned ;-)Waxen
another hidden way how things can go wrong: https://mcmap.net/q/371864/-reordering-factor-gives-different-results-depending-on-which-packages-are-loaded/602276Willy
@Waxen Please don't post follow-up questions inside the question. This gets confusing very quickly. Anyway, I think the question on imports vs depends has been covered on SO.Willy
I attempted to clean up your question, but it involved some massive edits. In particular, I simply removed your latest edit (see Andrie's comment above). If you have follow-ups or comments on Joris's answer, you can ask a new question, or comment on his answer. Feel free to touch it up again, but keep in mind my aim was to trim it down a bit to the essentials to keep this fairly focused.Ehr
@joran: I can live with that ;-)Waxen
Sorry to having yet another follow-up question on this, but: what if the use case is not building an new package, but just using base and contrib packages in an "R Project" (that might never turn into an actual package). For example, I'd go with require("R.utils") which would mask base::parse since I can't use Imports and/or importFrom in a "project only" environment, and then I'm back to having to resort to using :: in order to tell R which parse function I want in a specific situation. Or am I still getting this wrong?Waxen
Or say I'm importing both packageA and packageB via Imports and both have function foo. R will look in the respective imports::<pkg> namespaces before traversing the search list, but I'd still need to use :: to distinguish between my two "dependencies" in a specific piece of code. So still: wouldn't it, as a community, be much more straight forward to always use :: as at least that way R can never really guess wrong? Sorry to keep rambling on about this, but I guess I'm still not fully convinced ;-)Waxen
@Waxen - by "R Project" do you mean executing code in the Global Environment? I don't see a way to avoid :: in that case.Clamor
@SFun28: right, basically executing everything in .GlobalEnv as all your own functions are sourced to that envir and your package dependencies are just loaded via require and thus are attached to the search path. IMHO, that's probably the most likely "productive use case" for a lot of R users (including companies) and the "risk of fragility" associated with this might be a big deal-breaker for basing crucial processes on R.Waxen
@Waxen - You cannot import two functions of the same name (R will tell you that it is overwriting one with the other - or perhaps it WARNs when you build the package, I forget). From your comment "R will look in the respective imports::<pkg>" I think you are misunderstanding imports. All packages have a single imports namespace. As such, only one foo can be present in your package's Imports namespace. packgeA's imports is for symbols that packageA itself depends on, not the symbols that your package depends on.Clamor
@Waxen - so if you really need foo you need to use ::Clamor
@SFun28: ah, okay, thanks for clarifying the imports::<pkg> again!Waxen
May I ask what I'm sure is a very noob-like question? Why doesn't a solution like f <- dplyr::filter work to get at the functionality we'd like while preventing masking (in this case, of filter from stat package)?Phrasing
Not a noob question and this works. I don't like it that much as a general approach as it involves the manual selection of function objects. But I'm sure there are good use cases for this (e.g. when having a "tight" loop where the :: operator would induce too much overhead while still wanting to make absolutely sure that masking doesn't get in your way.Waxen
C
35

GREAT question.

Validation

Writing robust, stable, and production-ready R code IS hard. You said: "Surprisingly, this doesn't seem to bother a lot of programmers out there". That's because most R programmers are not writing production code. They are performing one-off academic/research tasks. I would seriously question the skillset of any coder that claims that R is easy to put into production. Aside from my post on search/find mechanism which you have already linked to, I also wrote a post about the dangers of warning. The suggestions will help reduce complexity in your production code.

Tips for writing robust/production R code

  1. Avoid packages that use Depends and favor packages that use Imports. A package with dependencies stuffed into Imports only is completely safe to use. If you absolutely must use a package that employs Depends, then email the author immediately after you call install.packages().

Here's what I tell authors: "Hi Author, I'm a fan of the XYZ package. I'd like to make a request. Could you move ABC and DEF from Depends to Imports in the next update? I cannot add your package to my own package's Imports until this happens. With R 2.14 enforcing NAMESPACE for every package, the general message from R Core is that packages should try to be "good citizens". If I have to load a Depends package, it adds a significant burden: I have to check for conflicts every time I take a dependency on a new package. With Imports, the package is free of side-effects. I understand that you might break other people's packages by doing this. I think its the right thing to do to demonstrate a commitment to Imports and in the long-run it will help people produce more robust R code."

  1. Use importFrom. Don't add an entire package to Imports, add only those specific functions that you require. I accomplish this with Roxygen2 function documentation and roxygenize() which automatically generates the NAMESPACE file. In this way, you can import two packages that have conflicts where the conflicts aren't in the functions you actually need to use. Is this tedious? Only until it becomes a habit. The benefit: you can quickly identify all of your 3rd-party dependencies. That helps with...

  2. Don't upgrade packages blindly. Read the changelog line-by-line and consider how the updates will affect the stability of your own package. Most of the time, the updates don't touch the functions you actually use.

  3. Avoid S4 classes. I'm doing some hand-waving here. I find S4 to be complex and it takes enough brain power to deal with the search/find mechanism on the functional side of R. Do you really need these OO feature? Managing state = managing complexity - leave that for Python or Java =)

  4. Write unit tests. Use the testthat package.

  5. Whenever you R CMD build/test your package, parse the output and look for NOTE, INFO, WARNING. Also, physically scan with your own eyes. There's a part of the build step that notes conflicts but doesn't attach a WARN, etc. to it.

  6. Add assertions and invariants right after a call to a 3rd-party package. In other words, don't fully trust what someone else gives you. Probe the result a little bit and stop() if the result is unexpected. You don't have to go crazy - pick one or two assertions that imply valid/high-confidence results.

I think there's more but this has become muscle memory now =) I'll augment if more comes to me.

Clamor answered 8/6, 2012 at 14:58 Comment(11)
Nice post, but I don't think the "Avoid S4 classes" admonishment deserves to be on that list. Some excellent R packages, including lme4, Matrix, sp, and related spatial packages use S4 to good effect. It's useful when you want generic functions that dispatch on the classes of multiple arguments. (Type library(Matrix); showMethods("solve"), or library(sp); showMethods("over") to see what I mean. Plus, my impression is that S4 coding tends to make for/enforce tighter and MORE robust code. I would say, though, "don't use S4 unless you know that you need to."Oslo
Ok, I went overboard with the S4 advice. I didn't know S4 supported multiple dispatch. Good comment, Josh! +1Clamor
Great answer, thanks a lot! I'd also go along with everything you say except the S4 related statement. Well, not quite: plain-vanilla S4 is kinda "lame", but S4 Reference Classes ROCK! It's true and easy pass-by-reference, multiple signature argument dispatch etc. I'd say the more complex your (production oriented) project/software will be, probably the better you are off with using Reference Classes to implement an OOP style.Waxen
I would avoid pass-by-reference unless there is memory concern. I try to keep R as functional as possible to reap the benefits of avoiding state. I posted another tip about assertions/invariants.Clamor
@SFun28: I'd very much like to discuss and hear your opinion on this sometime if you you want to. For my context (webscraping/big data analysis) they've given me so much more flexibility.Waxen
happy to! lets connect on twitter or linked-in? that info is on my blog.Clamor
@SFun28: great, I'll get in touch with you next weekWaxen
@SFun28: I've added one/two follow-up questions via two comments to my original question. If you find the time, I'd be great to hear your comments on those.Waxen
You still need Depends if your package calls a generic function but needs a method (that is not exported from) in that packages namespace? I had this issue recently when trying to move packages from Depends to Imports. In this instance there is no reason for the maintainer to export the method I want - the whole point of namespaces is that methods get hidden is it not?Message
I'd have to experiment to find out, but I think a package registers its methods with R and that mechanism is independent from exporting? Easy way to test is to create a skeleton package (Skeleton) with a method (say print.myclass), then import Skeleton in another package (MyConsumer). library(MyConsumer) and try to print an object of class myclass. I think it will work just fine?Clamor
Also this paper makes an important point on managing versioning of your dependencies: arxiv.org/abs/1303.2140Lucais
E
19

My take on it :

Summary : Flexibility comes with a price. I'm willing to pay that price.

1) I simply don't use packages that cause that kind of problems. If I really, really need a function from that package in my own packages, I use the importFrom() in my NAMESPACE file. In any case, if I have trouble with a package, I contact the package author. The problem is at their side, not R's.

2) I never use :: inside my own code. By exporting only the functions needed by the user of my package, I can keep my own functions inside the NAMESPACE without running into conflicts. Functions that are not exported won't hide functions with the same name either, so that's a double win.

A good guide on how exactly environments, namespaces and the likes work you find here: http://blog.obeautifulcode.com/R/How-R-Searches-And-Finds-Stuff/

This definitely is a must-read for everybody writing packages and the likes. After you read this, you'll realize that using :: in your package code is not necessary.

Extraterrestrial answered 8/6, 2012 at 11:14 Comment(4)
Ad 1): haven't looked into importFrom(), yet. Thanks for that one. Ad 2): I'm not only talking about using :: for my own functions in my packages, but also for those of contrib packages. If I don't how could I be sure that, say, a year from now everything will still work properly? To me, tt still feels like the most natural thing to do is to explicitly specify the namespace along with the function you're calling. Isn't that the way it is done in other programming languages as well?Waxen
@Waxen using Import() and ImportFrom(), and the correct specification of Depends and Imports in the DESCRIPTION file you will be a whole lot more sure that things keep on working. Regarding other programming languages: also in Java you just import a library and then use the function without specifying each time from where it comes. Regarding the difference between Depends and Imports : Depends does a check on the version, Imports doesn't. But note that Imports in the DESCRIPTION file and import() in the NAMESPACE file are two different things.Extraterrestrial
Yap, I did and see clearer now ;-) Still don't quite get why dependencies aren't always specified via Imports (instead of Depends) in the DESCRIPTION file. thxWaxen
Alright, gotcha. The post you pointed to was really one of the most insightful posts for me this year, thanks again!Waxen

© 2022 - 2024 — McMap. All rights reserved.