How can DVCS help scientific programming?

Asked 27/4, 2009 at 4:35 Answered 27/4, 2009 at 14:58

I'm doing some preliminary work in investigating how DVCS (the likes of Git, Hg, Bazar) can help in the process of scientific programming, especially for graduate students. I think I'm in quite a good position for this since I've been programming for quite a few years and is currently starting a Masters program in a natural science. The goal is to have a short presentation on this in a month or two.

As far as I see it, aside from the obvious advantage of source control, DVCS currently affords the following improvements to a grad student's daily life:

Branching:

This is the big one. From observing DVCS practices it is clear that cheap branching mainly encourages experimentation of new features. Scientific programming is ALL about experimentation. Different branches can be created to tweak parameters or algorithms. This is especially important because most scientific code haven't seen a single aota of refactoring throughout their lifetime (most grad students won't even know what it is), so ability to go to different branches will bring some method to the typical madness. Fast commits could also mean using commit comments as a surrogate for lab notebooks. Computational results could be tagged to specific commit hashcodes for reproducible research.
Pushing to servers:

Since most scientific code nowadays are run on some sort of a cluster, DVCS can be used as some sort of a more advanced Rsync, which many are already using to push "production" code to the HPC clusters. This is combined with branching to easily run multiple versions of code without leaving
Collaboration of papers:

Need I say more? Papers that have multiple authors are run exactly like small open source projects. Collaboration on the papers should be a natural fit when authors all write in LaTex, with additional complications if the writing is done in something like Word. This is where commit comments could potential play a bigger role.

My question is, what do you think DVCS can contribute for scientific programmers? I see a lot of talks to move to source control in the community, but most are still looking into Subversion. From my cursory notes it sounds like DVCS should be the perfect workflow paradigm for new grad students. Is my thinking flawed? Or is scientific coding simply lagging too much behind to have even heard of DVCS tools?

Practices for programming in a scientific environment

Ce answered 27/4, 2009 at 4:35 Comment(4)

not really programming related, more about how people can collaborate on writing papers? – Feil 27/4, 2009 at 4:43

What is scientific programming? Programming done by scientists, or programming on large clusters? I've done lots of scientific programming in several labs, and most of it is not done collaboratively or on clusters. – Colman 27/4, 2009 at 5:32

@mmr: I'd go with "programming to simulate experiments and/or analyze data sets", and its collaborativeness and clusterosity vary enormously. (In my discipline essentially every project involves some collaboratively developed code, and the primary analysis generally runs on a cluster.) – Wilow 27/4, 2009 at 14:36

dmckee is right, though I might add that a scientific programming project typically exhibits the following traits: 1. Generally lower code refactoring compared to industry standards 2. Shorter code lifetime than most projects (usually shorter than the term of a graduate study) 3. Should be able to be scrutinized for accuracy 4. Typically coded for both workstation use (prototyping and testing) and cluster use (production experiment) – Ce 27/4, 2009 at 19:6

Training is a real concern. I know quite a few particle physicists (big science with big programming projects) whose sum total knowledge of source control is how to run the naive versions of cvs checkout, cvs update, and cvs commit.

Yes, CVS. I know a software group leader who has put off the move to Subversion because of these folks.

At the next tier of skill they also know the diff and stat commands and how to specify branches or tags, but may avoid creating or merging branches.

If you are planning to introduce a DVCS, plan on an intensive, ongoing training and support program. Scientists (or at least physicists) typically have little formal training in computer science, and may have only the vaguest conception of software process.

Wilow answered 27/4, 2009 at 14:58 Comment(0)

Regarding your main points:

"obvious advantage of DVCS": it bears repeating that, especially in an academic environment with potentially strict IT rules for not allowing external connection, the DVCS allows working with a local repository. That means you do not have to be "connected" to one central repo to access the full history of a project, and that right there could be the main contribution of DVCS to scientific programmers.

But that also mean you must have some kind of policy in order to allow any given work to "come together" and be consolidated into one repository, which does not mean there will be only one "central" base: one could imagine several central repos for several big projects. Still, that require administration (not to be under-estimated).

And that "consolidation" process can be quite difficult due to your main first point:

branching: the student need to branch carefully (since it is so easy). I saw my share of branch named 'toto', 'Monday', 'myName', ...: once published into another (more central) repo, what are we supposed to do with those ? If 20+ branches are to be merges in order to finalize one common code,... the process can become error-prone very quickly.

Quick comments on your other points:

deployment (what you call "pushing to server"): yes DVCS can be used for some kind of deployment, but that means you have organized your repo to include some kind of "release component" (the set of file you want to push on the server) and you have versionned them. And release management includes many other steps which cannot be all memorized in the DVCS, like for instance the de-variabilization process where you replace variable within configuration files with actual values adapted to the target server (port numbers, local paths, ...). You can attempt to manage those configuration files directly valued through branches, but in my experience it becomes quickly too complex to follow.
collaboration: that is not reserved to DVCS. (VCS offer them too). Note that for some format (Word Document), their internal revision system could be more efficient.

Molehill answered 27/4, 2009 at 5:56 Comment(0)

One big problem with DVCS for scientific programming is binary data. It is often the case that scientific programming requires input/output of gigantic files, and that kills performances very quickly on every DVCS I know of (bzr, hg, git). That's one area where svn is much better currently.

I think DVCS can be quite useful for papers as well, but that means that your collaborator knows the DVCS as well.

Appalachian answered 27/4, 2009 at 4:49 Comment(4)

Who's going to put the data under version control? That would kill a traditional RCS just as fast... In particle physics we typically use separate warehousing arrangements for the data. – Wilow 27/4, 2009 at 14:42

depending on the data, it makes a lot of sense to put them under revision control - or not. I certainly have met many researchers doing just that. And with DVCS, it is fundamentally different than say svn, because whereas svn tracks each file individually, most DVCS I know track the whole tree. You can't do partial checkout with git, bzr or hg (by partial, I mean only a subtree, not a subset of the history), so once you add one file of one Gb, you have to carry the associated burden forever after. – Appalachian 27/4, 2009 at 15:15

With Git at least you can get around this with submodules, but obviously that further complicates the whole thing. One thing I personally do when dealing with output that is too big to be revision controlled, is to label them with the commit id that generated the result. In svn this is the rev number, while in things like git it would be the sha-1 sum. – Ce 27/4, 2009 at 19:2

See also #677936 for a better id than SHA-1 alone – Molehill 29/4, 2009 at 13:49

Recommended topics

Hot tags