How much space is needed to download entire CRAN repository?
Asked Answered
O

1

5

How much space is needed to download the entire CRAN Repository? Keeping all the files zipped, how large would a folder holding all the packages be? I can't find a clear answer to this question. I've read about 3GB, but I've also come across 200GB.

Oquinn answered 22/9, 2016 at 22:29 Comment(1)
The real answer is "it depends". Do you want sources only? Binaries for one or a few platforms? Full or partial history? HTML files? Accounting RDS files? Historical and current R source? You don't have to mirror all of CRAN to have the ability to have a CRAN repo locally. I have a custom rsync configuration (daily) and it's now <60GB on disk for the subset I've chosen to mirror which is pkg sources, macOS binaries, full R sources, all HTML (including CRAN checks) and some other bits.Pharisaism
P
8

Per my comment:

rsync -rtlzv --delete  cran.r-project.org::CRAN/bin/macosx/mavericks/contrib/3.2/ /cran/bin/macosx/mavericks/contrib/3.2/
rsync -rtlzv --delete  cran.r-project.org::CRAN/bin/macosx/mavericks/contrib/3.3/ /cran/bin/macosx/mavericks/contrib/3.3/
rsync -rtlzv --delete  cran.r-project.org::CRAN/doc/ /cran/doc/
rsync -rtlzv --delete  cran.r-project.org::CRAN/bin/macosx/tools/ /cran/bin/macosx/tools/
rsync -rtlzv --delete  cran.r-project.org::CRAN/web/ /cran/web/
rsync -rtlzv --delete  cran.r-project.org::CRAN/src/ /cran/src/
rsync -tlzv --delete  -a --include="NEWS" --include="*.shtml" --include="*.html" --include="*.pkg" --include="*.dmg" --include="*.gz" --exclude="*" cran.r-project.org::CRAN/bin/macosx/ /cran/bin/macosx/
rsync -tlzv --delete  -a --include="*.html" --include="*.shtml" --include="*.svg" --include="*.png" --exclude="*" cran.r-project.org::CRAN/ /cran/
rsync -rtlzv --delete  cran.r-project.org::CRAN/src/contrib/PACKAGES.gz /cran/src/contrib/PACKAGES.gz

(which is not an optimized set of rsync statements) gets me a very fully functional local CRAN repo that supports all of my systems quite well. I let the sole, nigh useless Windows VM I keep for testing use RStudio's mirror since I have no use for it's cruft on this system, but my linux and macOS systems work flawlessly with this when it comes to pkgs.

As I said in the comment, this is under 60GB.

To make it fully functional, you have to setup a web server and it's a PITA to use anything else but Apache given the 1990's web tech setup CRAN seems determined to maintain. Said config is an exercise left to the reader.

Of note: it's worth the time doing the mirror and exploring the nuggets around the filesystem. Many RDS files for "accounting" and other insights you won't get from starting at the 1990's HTML files on the web site.

Using your own, local mirror reduces the information leakage and stops you from contributing to the (IMO very inaccurate) "# downloads" package counts that show up on GitHub README.md badges and keeps your privacy for those mirrors that don't adhere to not keeping logs or mining your pkg usage.

Pharisaism answered 22/9, 2016 at 23:7 Comment(3)
Why do you believe the # downloads are very inaccurate?Culp
I'm pretty sure it includes Travis pkg installs (or other CI installs) and that definitely skews results if so. Plus RStudio isn't the only mirror.Pharisaism
It seems you are asking the R foundation to foot a huge bandwidth bill for a trivial gain. Why not just use random repos?Culp

© 2022 - 2024 — McMap. All rights reserved.