How can I download link targets from a web site using Perl?

Asked 6/7, 2010 at 11:34 Answered 30/10, 2013 at 12:57

I just made a script to grab links from a website, and in turn saves them into a text file.

Now I'm working on my regexes so it will grab links which contains php?dl= in the url from the text file:

E.g.: www.example.com/site/admin/a_files.php?dl=33931

Its pretty much the address you get when you hover over the dl button on the site. From which you can click to download or "right click save".

I'm just wondering on how to achieve this, having to download the content of the given address which will download a *.txt file. All from the script of course.

Lymphangitis answered 6/7, 2010 at 11:34 Comment(3)

What is the question here? You made a script and now want it to only download certain URLs? Are you looking for a regexp? – Snob 6/7, 2010 at 11:39

Crawling in Perl - A Quick Tutorial – Yell 6/7, 2010 at 11:39

I'm trying to figure out how you download the file associated with url. For example, on the website you get click 'dl' icon/button and your browser automatically downloads the file for you. ie: example.com/site/admin/a_files.php?dl=33931 would download "file1.txt" I'm just wondering how you can download the file in Perl. The regexp part is not a problem. Or have I missed a function that can do all of this with ease haha – Lymphangitis 6/7, 2010 at 11:44

Make WWW::Mechanize your new best friend.

Here's why:

It can identify links on a webpage that match a specific regex (/php\?dl=/ in this case)
It can follow those links through the follow_link method
It can get the targets of those links and save them to file

All this without needing to save your wanted links in an intermediate file! Life's sweet when you have the right tool for the job...

Example

use strict;
use warnings;
use WWW::Mechanize;

my $url  = 'http://www.example.com/';
my $mech = WWW::Mechanize->new();

$mech->get ( $url );

my @linksOfInterest = $mech->find_all_links ( text_regex => qr/php\?dl=/ );

my $fileNumber++;

foreach my $link (@linksOfInterest) {

    $mech->get ( $link, ':contentfile' => "file".($fileNumber++).".txt" );
    $mech->back();
}

Psychologist answered 6/7, 2010 at 11:55 Comment(6)

Awesome! you stated all the things I have been looking for, for the past 2 hours lol. Thank you :D – Lymphangitis 6/7, 2010 at 11:58

This helped alot. Thank you very much :D. I have so much to learn still, thnx for pointing out this very helpful module :D – Lymphangitis 6/7, 2010 at 12:27

I see no reason in this example to do the ->back() and ->reload(). – Jenisejenkel 6/7, 2010 at 16:14

@Andy : I suppose it depends on the page in question. If it updates frequently, a reload() may be prudent. – Psychologist 6/7, 2010 at 16:17

@Zaid: You're not doing anything with the reloaded page. @linksofInterest doesn't change. – Jenisejenkel 7/7, 2010 at 13:31

@Andy : Good point. The ->reload() is useless for the example in question. – Psychologist 7/7, 2010 at 13:55

You can download the file with LWP::UserAgent:

my $ua = LWP::UserAgent->new();  
my $response = $ua->get($url, ':content_file' => 'file.txt');

Or if you need a filehandle:

open my $fh, '<', $response->content_ref or die $!;

Fariss answered 6/7, 2010 at 11:56 Comment(1)

Or, just use 'LWP::Simple::getstore($url, $file)`. – Boart 6/7, 2010 at 12:37

Old question, but when I'm doing quick scripts, I often use "wget" or "curl" and pipe. This isn't cross-system portable, perhaps, but if I know my system has one or the other of these commands, it's generally good.

For example:

#! /usr/bin/env perl
use strict;
open my $fp, "curl http://www.example.com/ |";
while (<$fp>) {
  print;
}

Chapatti answered 30/10, 2013 at 12:57 Comment(0)

Recommended topics

Hot tags