available.packages by publication date

Asked 4/1, 2012 at 5:0 Answered 12/1, 2012 at 5:13

Is it possible to get the publication date of CRAN packages from within R? I would like to get a list of the k most recently published CRAN packages, or alternatively all packages published after date dd-mm-yy. Similar to the information on the available_packages_by_date.html?

The available.packages() command has a "fields" argument, but this only extracts fields from the DESCRIPTION. The date field on the package description is not always up-to-date.

I can get it with a smart regex from the html page, but I am not sure how reliable and up-to-date the this html file is... At some point Kurt might decide to give the layout a makeover which would break the script. An alternative is to use timestamps from the CRAN FTP but I am also not sure how good this solution is. I am not sure if there is somewhere a formally structured file with publication dates? I assume the HTML page is automatically generated from some DB.

Occultism answered 4/1, 2012 at 5:0 Comment(2)

you can read the contents of the html table using XML::readHTMLTable. is this what you were looking for? – Gorilla 4/1, 2012 at 5:11

CRANberries produces a SQLite database with package metadata, including when added to CRAN etc. It would be trivial to export, and/or CRAN could just make it available. There are some 'hidden' RData files on CRAN, the information may well exist... – Fourhanded 4/1, 2012 at 23:35

Turns out there is an undocmented file "packages.rds" which contains the publication dates (not times) of all packages. I suppose these data are used to recreate the HTML file every day.

Below a simple function that extracts publication dates from this file:

recent.packages.rds <- function(){
    mytemp <- tempfile();
    download.file("http://cran.r-project.org/web/packages/packages.rds", mytemp);
    mydata <- as.data.frame(readRDS(mytemp), row.names=NA);
    mydata$Published <- as.Date(mydata[["Published"]]);

    #sort and get the fields you like:
    mydata <- mydata[order(mydata$Published),c("Package", "Version", "Published")];
}

Occultism answered 12/1, 2012 at 5:13 Comment(0)

The best approach is to take advantage of the fact the package DESCRIPTION is published on the cran mirror, and since the DESCRIPTION is from the build package, it contains information about exactly when it was packaged:

pkgs <- unname(available.packages()[, 1])[1:20]
desc_urls <- paste("http://cran.r-project.org/web/packages/", pkgs, "/DESCRIPTION", sep = "")
desc <- lapply(desc_urls, function(x) read.dcf(url(x)))

sapply(desc, function(x) x[, "Packaged"])
sapply(desc, function(x) x[, "Date/Publication"])

(I'm restricting it to the first 20 packages here to illustrate the basic idea)

Elizbeth answered 5/1, 2012 at 21:9 Comment(3)

+1 for pointing out that the package date and the last modified date could be different. – Ecker 6/1, 2012 at 16:48

Hmz that means downloading 3000+ DESCRIPTION files everytime I want to check for something new. I was planning on running this as a cron job every 15 minutes. Not sure that is a nice solution. – Occultism 7/1, 2012 at 4:29

If you just want to monitor for changes, I think there's a root level file you can inspect. Crantastic does this somehow. – Elizbeth 11/1, 2012 at 4:30

Here a function that uses the HTML and regular expressions. I still rather get the information from a more formal place though in case the HTML ever changes layout.

recent.packages <- function(number=10){

    #html is malformed
    maxlines <- number*2 + 11
    mytemp <- tempfile()
    if(getOption("repos") == "@CRAN@"){
        repo <- "http://cran.r-project.org"
    } else {
        repo <- getOption("repos");
    }
    newurl <- paste(repo,"/web/packages/available_packages_by_date.html", sep="");
    download.file(newurl, mytemp);
    datastring <- readLines(mytemp, n=maxlines)[12:maxlines];

    #we only find packages from after 2010-01-01
    myexpr1 <- '201[0-9]-[0-9]{2}-[0-9]{2} </td> <td> <a href="../../web/packages/[a-zA-Z0-9\\.]{2,}/'
    myexpr2 <- '^201[0-9]-[0-9]{2}-[0-9]{2}'
    myexpr3 <- '[a-zA-Z0-9\\.]{2,}/$'
    newpackages <- unlist(regmatches(datastring, gregexpr(myexpr1, datastring)));
    newdates <- unlist(regmatches(newpackages, gregexpr(myexpr2, newpackages)));
    newnames <- unlist(regmatches(newpackages, gregexpr(myexpr3, newpackages)));

    newdates <- as.Date(newdates);
    newnames <- substring(newnames, 1, nchar(newnames)-1);
    returndata <- data.frame(name=newnames, date=newdates);
    return(head(returndata, number));
}

Occultism answered 4/1, 2012 at 6:20 Comment(0)

So here a solution that uses the dir listing from the FTP. It is a little tricky because the FTP gives the date in linux format with either a timestamp or a year. Other than that it does it's job. I'm still not convinced this is reliable though. If packages are copied over to another server all timestmaps might be reset.

recent.packages.ftp <- function(){
    setwd(tempdir())
    download.file("ftp://cran.r-project.org/pub/R/src/contrib/", destfile=tempfile(), method="wget", extra="--no-htmlify");

    #because of --no-htmlify the destfile argument does not work
    datastring <- readLines(".listing");
    unlink(".listing");

    myexpr1 <- "(?<date>[A-Z][a-z]{2} [0-9]{2} [0-9]{2}:[0-9]{2}) (?<name>[a-zA-Z0-9\\.]{2,})_(?<version>[0-9\\.-]*).tar.gz$"
    matches <- gregexpr(myexpr1, datastring, perl=TRUE);
    packagelines <- as.logical(sapply(regmatches(datastring, matches), length));

    #subset proper lines
    matches <- matches[packagelines];
    datastring <- datastring[packagelines];
    N <- length(matches)

    #from the ?regexpr manual       
    parse.one <- function(res, result) {
        m <- do.call(rbind, lapply(seq_along(res), function(i) {
            if(result[i] == -1) return("")
            st <- attr(result, "capture.start")[i, ]
            substring(res[i], st, st + attr(result, "capture.length")[i, ] - 1)
        }))
        colnames(m) <- attr(result, "capture.names")
        m
    }

    #parse all records
    mydf <- data.frame(date=rep(NA, N), name=rep(NA, N), version=rep(NA,N))
    for(i in 1:N){
        mydf[i,] <- parse.one(datastring[i], matches[[i]]);
    }
    row.names(mydf) <- NULL;
    #convert dates
    mydf$date <- strptime(mydf$date, format="%b %d %H:%M");

    #So linux only displays dates for packages of less then six months old. 
    #However strptime will assume the current year for packages that don't have a timestamp
    #Therefore for dates that are in the future, we subtract a year. We can use some margin for timezones. 
    infuture <- (mydf$date > Sys.time() + 31*24*60*60);
    mydf$date[infuture] <- mydf$date[infuture] - 365*24*60*60;

    #sort and return
    mydf <- mydf[order(mydf$date),];
    row.names(mydf) <- NULL;
    return(mydf);
}

Occultism answered 4/1, 2012 at 23:55 Comment(0)

You could process the page http://cran.r-project.org/src/contrib/, and split the fields by whitespace in order to obtain the fully specified package source filename, which includes the version # and a .gz suffix.

There are a few other items in the list that are not package files, such as the .rds files, various subdirectories, and so on.

Barring changes in how the directory structure is presented or the locations of the files, I can't think of anything more authoritative than this.

Ecker answered 4/1, 2012 at 23:26 Comment(0)

Recommended topics

Hot tags