I'm trying to get list of files on HTTP/FTP server from R!, so that in next step I will be able to download them (or select some of files which meet my criteria to download).
I know that it is possible to use external program in web browser (download manager) which will allow me to select files to download from current web page/ftp. However, I wish to have everything scripted, so that it will be easier for me to reproduce.
I thought about calling Python from R! (since it seems much easier), but I tried to do that entirely in R!
I wrote following lines
require("RCurl")
result <- getURL("http://server",verbose=TRUE,ftp.use.epsv=TRUE, dirlistonly = TRUE)
Result variable is character type:
typeof(result)
[1] "character"
Sample content is as follows:
Interesting file_20150629.txt20 Aug-2015 09:31 289K\nInteresting file_20150630.txt20 Aug-2015 09:31 293K\nInteresting file_20150701.txt20 Aug-2015 09:31 301K\nInteresting file_20150702.txt20 Aug-2015 09:31 304K\nInteresting file_20150703.txt20 Aug-2015 09:31 301K\nInteresting file_20150704.txt20 Aug-2015 09:31 300K\nInteresting file_20150705.txt20 Aug-2015 09:31 300K\nInteresting file_20150706.txt20 Aug-2015 09:31 305K\nInteresting file_20150707.txt20 Aug-2015 09:31 305K\nInteresting file_20150708.txt20 Aug-2015 09:31 301K\nInteresting file_20150709.txt20 Aug-2015 09:31 294K\n
\n\n\n"
So now, I'm trying to parse result content:
myFiles <- strsplit(result,'<a[^>]* href=\\"([^"]*.txt)\\"')[[1]]
hoping that I will match txt file (since it's in brackets: ()). but it matches:
">Interesting file_20150706.txt</a></td><td align=\"right\">20 Aug-2015 09:31 </td><td align=\"right\">305K</td></tr>\n<tr><td valign=\"top\"><img src=\"/apacheIcons/text.gif\" alt=\"[TXT]\"></td><td>
instead.
What is wrong (I tested my expression on https://regex101.com/) or (maybe this question is more appropriate) there is much easier way to obtain list of files with specific extension on the server in R! ?