Scalable (half-million files) version control system

I

8

18

We use SVN for our source-code revision control and are experimenting using it for non-source-code files.

We are working with a large set (300-500k) of short (1-4kB) text files that will be updated on a regular basis and need to version control it. We tried using SVN in flat-file mode and it is struggling to handle the first commit (500k files checked in) taking about 36 hours.

On a daily basis, we need the system to be able to handle 10k modified files per commit transaction in a short time (<5 min).

My questions:

Is SVN the right solution for my purpose. The initial speed seems too slow for practical use.
If Yes, is there a particular svn server implementation that is fast? (We are currently using the gnu/linux default svn server and command line client.)
If No, what are the best f/oss/commercial alternatives

Thanks

Edit 1: I need version control because multiple people will be concurrently modifying the same files and will be doing manual diff/merge/resolve-conflicts in the exact same way as programmers edit source code. Thus I need a central repository to which people can check in their work and check out others work. The work-flow is virtually identical to a programming workflow except that the users are not programmers and the file content is not source-code.

Update 1: Turns out that the primary issue was more of a filesystem issue than an SVN issue. For SVN, committing a single directory with half-million new files did not finish even after 24 hours. Splitting the same across 500 folders arranged in a 1x5x10x10 tree with 1000 files per folder resulted in a commit time of 70 minutes. Commit speed drops significantly over time for single folder with large number of files. Git seems a lot faster. Will update with times.

Innutrition answered 31/3, 2010 at 17:55 Comment(10)

If you are doing what I think you are doing, I'd look into some kind of CMS. – Cervicitis 31/3, 2010 at 18:8

As others pointed out: it might be worth explaining what you are trying to solve in general, as a version control system might be the wrong (at least not the most efficient) solution to your problem. – Spirketing 31/3, 2010 at 18:45

Either what erikkallen said above, or a filesystem with builtin snapshot support. More details about the problem would be good to determine if version control is the correct solution for the problem. – Studner 31/3, 2010 at 18:48

@hasen No I am not mirroring wikipedia or anything else. The content for the files are manually generated. Look at Edit 1. – Innutrition 31/3, 2010 at 19:45

If you don't need tree-wide commit atomicity, git or mercurial aren't your best solution (especially if have lots of merges/conflicts). – Lanfranc 31/3, 2010 at 22:31

How are your files structured ? Many filesystems will bog down handling directories with a large number of files – Zealous 1/4, 2010 at 18:27

This blows my mind. People manually modify 10,000 files every day? Each one needs to be manually diffed and merged? Each file can be up to 4 kB? Wow. – Cutworm 1/4, 2010 at 18:31

There has been progress on this issue: mail-archives.apache.org/mod_mbox/subversion-dev/201004.mbox/… – Kamalakamaria 14/4, 2010 at 16:54

@gbjbaanb. Thanks for that valuable insight into the source of the problem. – Innutrition 14/4, 2010 at 17:59

@hashable, any update or new insights on this? – Anglophobia 18/6, 2015 at 19:8

K

6

is SVN suitable? As long as you're not checking out or updating the entire repository, then yes it is.

SVN is quite bad with committing very large numbers of files (especially on Windows) as all those .svn directories are written to to update a lock when you operate on the system. If you have a small number of directories, you won't notice, but the time taken seems to increase exponentially.

However, once committed (in chunks, directory by directory perhaps) then things become very much quicker. Updates don't take so long, and you can use the sparse checkout feature (very recommended) to work on sections of the repository. Assuming you don't need to modify thousands of files, you'll find it works quite well.

Committing 10,000 files - again, all at once is not going to be speedy, but 1,000 files ten times a day will be much more manageable.

So try it once you've got all files in there, and see how it works. All this will be fixed in 1.7, as the working copy mechanism is modified to remove those .svn directories (so keeping locks is simpler and much quicker).

Kamalakamaria answered 31/3, 2010 at 20:8 Comment(4)

It's not really the large number of files, it's the large number of directories that impacts the performance the most. – Levin 1/4, 2010 at 18:6

@Kamalakamaria @Sander Too many files in a single folder seems to be the problem. Please look at Update 1. – Innutrition 2/4, 2010 at 4:50

I was refering to the slowdowns described by @Kamalakamaria caused by .svn directories. That slowdown is caused by have many directories, not by having many files. Even locking the working copy before the operation and unlocking it afterwards takes a lot of time if there are many directories. – Levin 2/4, 2010 at 12:50

too many files in 1 directory... try your timing with the virus checker turned off. That .svn directory needs to be updated when you commit per file. Not good. Also, post on the svn dev mailing list with your timings - you may get some help there, or at least prompt someone to take a look what's going on. – Kamalakamaria 3/4, 2010 at 23:55

V

14

As of July 2008, the Linux kernel git repo had about 260,000 files. (2.6.26)

http://linuxator.wordpress.com/2008/07/22/5-things-you-didnt-know-about-linux-kernel-code-metrics/

At that number of files, the kernel developers still say git is really fast. I don't see why it'd be any slower at 500,000 files. Git tracks content, not files.

Verda answered 31/3, 2010 at 18:7 Comment(4)

To reaffirm this: I just tested a commit which essentially rewrote all the contents of an enormous repository (26000 files, 5GB). It took about 6 minutes, mostly I/O-limited over a not-that-fast network mount. In your use case, the diffs are more like 50MB, so you should see much faster commit times. (Your initial commit could still take a while - wild guess five minutes to an hour depending on your system.) – Aramaic 31/3, 2010 at 18:35

Be aware. Git has a steep learning curve for programmers and can be baffling to non-coders. I now use git all the time and couldn't work without it, but it took me a few months to get comfy. Make sure you are ready to sink some hours into training your non-programmer colleagues if you commit to Git-- no pun intended :) – Beaumarchais 1/4, 2010 at 13:13

@Andy Thanks for that valuable comment about Git's learning curve. – Innutrition 2/4, 2010 at 4:57

When I look at the linux-kernel-metrics-link, there are only about 26,000 files (instead of 260,000). – Displode 10/10, 2011 at 13:52

K

6

is SVN suitable? As long as you're not checking out or updating the entire repository, then yes it is.

SVN is quite bad with committing very large numbers of files (especially on Windows) as all those .svn directories are written to to update a lock when you operate on the system. If you have a small number of directories, you won't notice, but the time taken seems to increase exponentially.

However, once committed (in chunks, directory by directory perhaps) then things become very much quicker. Updates don't take so long, and you can use the sparse checkout feature (very recommended) to work on sections of the repository. Assuming you don't need to modify thousands of files, you'll find it works quite well.

Committing 10,000 files - again, all at once is not going to be speedy, but 1,000 files ten times a day will be much more manageable.

So try it once you've got all files in there, and see how it works. All this will be fixed in 1.7, as the working copy mechanism is modified to remove those .svn directories (so keeping locks is simpler and much quicker).

Kamalakamaria answered 31/3, 2010 at 20:8 Comment(4)

It's not really the large number of files, it's the large number of directories that impacts the performance the most. – Levin 1/4, 2010 at 18:6

@Kamalakamaria @Sander Too many files in a single folder seems to be the problem. Please look at Update 1. – Innutrition 2/4, 2010 at 4:50

I was refering to the slowdowns described by @Kamalakamaria caused by .svn directories. That slowdown is caused by have many directories, not by having many files. Even locking the working copy before the operation and unlocking it afterwards takes a lot of time if there are many directories. – Levin 2/4, 2010 at 12:50

too many files in 1 directory... try your timing with the virus checker turned off. That .svn directory needs to be updated when you commit per file. Not good. Also, post on the svn dev mailing list with your timings - you may get some help there, or at least prompt someone to take a look what's going on. – Kamalakamaria 3/4, 2010 at 23:55

D

5

for such short files, i'd check about using a database instead of a filesystem.

Discontented answered 31/3, 2010 at 18:39 Comment(0)

A

3

git is your best bet. You can check in an entire operating system (two gigabytes of code in a few hundred thousand files) and it remains usable, although the initial checkin will take quite a while, like around 40 minutes.

Advantageous answered 31/3, 2010 at 18:9 Comment(4)

Presuming the system has fast disk, yes. I suppose SSD would be the way to go for ultimate speed of revision control systems. – Advantageous 1/4, 2010 at 14:45

Thanks for that tip. Yes. Using an SSD as the SVN server HDD would speed up things. – Innutrition 2/4, 2010 at 4:38

@hashable: you'd have to research that. I think that the harddisk on the client is more critical than that in the server, when using SVN. – Levin 2/4, 2010 at 12:52

The client would be more critical with git too. – Advantageous 2/4, 2010 at 15:4

P

3

for svn "flat file mode" meaning FSFS I presume:
- make sure you're running the latest svn. FSFS had sharding added in ~1.5 IIRC which will be a night/day difference at 500k files. The particular filesystem you run will also have a huge effect. (Don't even think about this on NTFS.)
- You're going to be IO-bound with that many file transactions. SVN is not very effecient with this, having to stat files in .svn/ as well as the real files.
git has way better performance than svn, and you owe it to yourself to at least compare

Panjandrum answered 31/3, 2010 at 18:13 Comment(4)

@Nathan Yes. I believe we are using version 1.6.x of SVN. – Innutrition 31/3, 2010 at 19:16

and with the number of files, svn 1.7 will have much better support by scrapping the .svn directories that have a significant impact with a very large number of files. Of course, this isn't out yet. – Kamalakamaria 31/3, 2010 at 19:59

sharding will help you when you have a large number of revisions, it doesn't improve anything for the number of files. It's the revisions that are sharded in the repository. – Levin 2/4, 2010 at 12:53

@Sander: Right, good point. I guess I was imagining "updating on a regular basis" as individual commits, but that's not so likely with that number of files. The real slow-down is client side. – Panjandrum 2/4, 2010 at 13:23

M

3

I recommend Mercurial, as it still leads git in the usability department (git's been getting better, but, eh).

bzr has made leaps forward in usability as well.

Motherly answered 1/4, 2010 at 22:18 Comment(0)

A

0

Do you really need a file system with cheap snapshots, like ZFS? You could configure it to save the state of the filesystem every 5 minutes to avail yourself of some level of change history.

Ardussi answered 31/3, 2010 at 18:21 Comment(2)

Your answer sounds like a question (typo?). Anyway, good pointer! – Spirketing 31/3, 2010 at 18:49

It's called the Socratic method ;-) – Ardussi 31/3, 2010 at 19:37

L

0

Is there any reason you need to commit 10k modified files per commit? Subversion would scale much better if every user checks in his/her own file right away. Then that one time a day you need to 'publish' the files, you can tag them very fast and run the published version from the tag

Levin answered 1/4, 2010 at 18:9 Comment(3)

@Sander 10k is the upper bound. A user cannot check-in just a file at a time due to inter-file dependencies. – Innutrition 2/4, 2010 at 4:54

Do you mean that by manually doing their work, they produce up to 10k files that need to be one commit? That sounds pretty much impossible unless the files are generated, in which case it's generally better to store the source files in source control. – Levin 2/4, 2010 at 12:47

The manual work is not done at a file level. Small edits (to the information represented in all the files collectively) can result in several files being modified. Yes. for the upper bound case of 10000 file modifications, the changes are likely to be due to programmatic file modification. (There is both human and automatic editing of the files.) – Innutrition 2/4, 2010 at 13:40

Recommended topics

Hot tags