List files on HTTP/FTP server in R

Asked 25/8, 2015 at 20:37 Answered 23/6, 2020 at 18:12

Solved regex r html-parsing text-parsing

I'm trying to get list of files on HTTP/FTP server from R!, so that in next step I will be able to download them (or select some of files which meet my criteria to download).

I know that it is possible to use external program in web browser (download manager) which will allow me to select files to download from current web page/ftp. However, I wish to have everything scripted, so that it will be easier for me to reproduce.

I thought about calling Python from R! (since it seems much easier), but I tried to do that entirely in R!

I wrote following lines

require("RCurl") 
result <- getURL("http://server",verbose=TRUE,ftp.use.epsv=TRUE, dirlistonly = TRUE)

Result variable is character type:

typeof(result)
[1] "character"

Sample content is as follows:

Interesting file_20150629.txt20 Aug-2015 09:31 289K\nInteresting file_20150630.txt20 Aug-2015 09:31 293K\nInteresting file_20150701.txt20 Aug-2015 09:31 301K\nInteresting file_20150702.txt20 Aug-2015 09:31 304K\nInteresting file_20150703.txt20 Aug-2015 09:31 301K\nInteresting file_20150704.txt20 Aug-2015 09:31 300K\nInteresting file_20150705.txt20 Aug-2015 09:31 300K\nInteresting file_20150706.txt20 Aug-2015 09:31 305K\nInteresting file_20150707.txt20 Aug-2015 09:31 305K\nInteresting file_20150708.txt20 Aug-2015 09:31 301K\nInteresting file_20150709.txt20 Aug-2015 09:31 294K\n
\n\n\n"

So now, I'm trying to parse result content:

myFiles <- strsplit(result,'<a[^>]* href=\\"([^"]*.txt)\\"')[[1]]

hoping that I will match txt file (since it's in brackets: ()). but it matches:

">Interesting file_20150706.txt</a></td><td align=\"right\">20 Aug-2015 09:31  </td><td align=\"right\">305K</td></tr>\n<tr><td valign=\"top\"><img src=\"/apacheIcons/text.gif\" alt=\"[TXT]\"></td><td>

instead.

What is wrong (I tested my expression on https://regex101.com/) or (maybe this question is more appropriate) there is much easier way to obtain list of files with specific extension on the server in R! ?

Metric answered 25/8, 2015 at 20:37 Comment(0)

You really shouldn't use regex on html. The XML package makes this pretty simple. We can use getHTMLLinks() to gather any links we want.

library(XML)
getHTMLLinks(result)
#  [1] "Interesting file_20150629.txt"   "Interesting file_20150630.txt"  
#  [3] "Interesting file_20150701.txt"   "Interesting file_20150702.txt"  
#  [5] "Interesting file_20150703.txt"   "Interesting file_20150704.txt"  
#  [7] "Interesting file_20150705.txt"   "Interesting file_20150706.txt"  
#  [9] "Interesting file_20150707.txt"   "Interesting file_20150708.txt"  
# [11] "Interesting file_20150709.txt"

That will get all /@href links contained in //a. To grab only the ones that contain .txt, you can use a different XPath query from the default.

getHTMLLinks(result, xpQuery = "//a/@href[contains(., '.txt')]")

Or even more precisely, to get those files that end with .txt, you can do

getHTMLLinks(
    result,
    xpQuery = "//a/@href['.txt'=substring(., string-length(.) - 3)]"
)

Frontward answered 25/8, 2015 at 20:50 Comment(2)

Great response! I still don't know what was wrong in my regex, but your solution works perfectly!!! – Metric 26/8, 2015 at 6:35

Yeah, nice code. This helped me a lot. Especially last two sections. – Obeded 27/1, 2018 at 23:22

An alternative without loading additional libraries is to turn ftp.use.epsv=FALSE and crlf = TRUE. This will instruct libcurl to change \n's to \r\n's:

require("RCurl") 
result <- getURL("http://server",verbose=TRUE,ftp.use.epsv=FALSE, dirlistonly = TRUE, crlf = TRUE)

Then extract the individual URLs to the files using paste and strsplit,

result2 <- paste("http://server", strsplit(result, "\r*\n")[[1]], sep = "")

Doloroso answered 23/6, 2020 at 18:12 Comment(0)

Recommended topics

Hot tags