Get all the URLs using multi curl
Asked Answered
E

6

8

I'm working on an app that gets all the URLs from an array of sites and displays it in array form or JSON.

I can do it using for loop, the problem is the execution time when I tried 10 URLs it gives me an error saying exceeds maximum execution time.

Upon searching I found this multi curl

I also found this Fast PHP CURL Multiple Requests: Retrieve the content of multiple URLs using CURL. I tried to add my code but didn't work because I don't how to use the function.

Hope you help me.

Thanks.

This is my sample code.

<?php

$urls=array(
'http://site1.com/',
'http://site2.com/',
'http://site3.com/');


$mh = curl_multi_init();
foreach ($urls as $i => $url) {

        $urlContent = file_get_contents($url);

        $dom = new DOMDocument();
        @$dom->loadHTML($urlContent);
        $xpath = new DOMXPath($dom);
        $hrefs = $xpath->evaluate("/html/body//a");

        for($i = 0; $i < $hrefs->length; $i++){
            $href = $hrefs->item($i);
            $url = $href->getAttribute('href');
            $url = filter_var($url, FILTER_SANITIZE_URL);
            // validate url
            if(!filter_var($url, FILTER_VALIDATE_URL) === false){
                echo '<a href="'.$url.'">'.$url.'</a><br />';
            }
        }

        $conn[$i]=curl_init($url);
        $fp[$i]=fopen ($g, "w");
        curl_setopt ($conn[$i], CURLOPT_FILE, $fp[$i]);
        curl_setopt ($conn[$i], CURLOPT_HEADER ,0);
        curl_setopt($conn[$i],CURLOPT_CONNECTTIMEOUT,60);
        curl_multi_add_handle ($mh,$conn[$i]);
}
do {
    $n=curl_multi_exec($mh,$active);
}
while ($active);
foreach ($urls as $i => $url) {
    curl_multi_remove_handle($mh,$conn[$i]);
    curl_close($conn[$i]);
    fclose ($fp[$i]);
}
curl_multi_close($mh);
?>
Emphatic answered 2/1, 2019 at 8:49 Comment(7)
"I tried to add my code but didn't work" What does "didn't work" mean? White page? Getting the wrong urls? Any error on screen? Or in your logs?Rajah
Your problem most likely is not with the curl but with your do/while running (by some reason) forever... Try debugging that possibility.Pliner
@kerbholz no sir, I cant use the the function properlyEmphatic
If you make it faster it'll hit execution time limit after, say, 50 urls.Chammy
What more have to you tried in these 4 days please share?Cosme
@Cosme I post a codeEmphatic
I will strongly suggest you to use some library like docs.guzzlephp.org/en/stable . It has many examples for concurrent requests etc.Coley
C
8

Here is a function that I put together that will properly utilize the curl_multi_init() function. It is more or less the same function that you will find on PHP.net with some minor tweaks. I have had great success with this.

function multi_thread_curl($urlArray, $optionArray, $nThreads) {

  //Group your urls into groups/threads.
  $curlArray = array_chunk($urlArray, $nThreads, $preserve_keys = true);

  //Iterate through each batch of urls.
  $ch = 'ch_';
  foreach($curlArray as $threads) {      

      //Create your cURL resources.
      foreach($threads as $thread=>$value) {

      ${$ch . $thread} = curl_init();

        curl_setopt_array(${$ch . $thread}, $optionArray); //Set your main curl options.
        curl_setopt(${$ch . $thread}, CURLOPT_URL, $value); //Set url.

        }

      //Create the multiple cURL handler.
      $mh = curl_multi_init();

      //Add the handles.
      foreach($threads as $thread=>$value) {

      curl_multi_add_handle($mh, ${$ch . $thread});

      }

      $active = null;

      //execute the handles.
      do {

      $mrc = curl_multi_exec($mh, $active);

      } while ($mrc == CURLM_CALL_MULTI_PERFORM);

      while ($active && $mrc == CURLM_OK) {

          if (curl_multi_select($mh) != -1) {
              do {

                  $mrc = curl_multi_exec($mh, $active);

              } while ($mrc == CURLM_CALL_MULTI_PERFORM);
          }

      }

      //Get your data and close the handles.
      foreach($threads as $thread=>$value) {

      $results[$thread] = curl_multi_getcontent(${$ch . $thread});

      curl_multi_remove_handle($mh, ${$ch . $thread});

      }

      //Close the multi handle exec.
      curl_multi_close($mh);

  }


  return $results;

} 



//Add whatever options here. The CURLOPT_URL is left out intentionally.
//It will be added in later from the url array.
$optionArray = array(

  CURLOPT_USERAGENT        => 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0',//Pick your user agent.
  CURLOPT_RETURNTRANSFER   => TRUE,
  CURLOPT_TIMEOUT          => 10

);

//Create an array of your urls.
$urlArray = array(

    'http://site1.com/',
    'http://site2.com/',
    'http://site3.com/'

);

//Play around with this number and see what works best.
//This is how many urls it will try to do at one time.
$nThreads = 20;

//To use run the function.
$results = multi_thread_curl($urlArray, $optionArray, $nThreads);

Once this is complete you will have an array containing all of the html from your list of websites. It is at this point where I would loop through them and pull out all of the urls.

Like so:

foreach($results as $page){

  $dom = new DOMDocument();
  @$dom->loadHTML($page);
  $xpath = new DOMXPath($dom);
  $hrefs = $xpath->evaluate("/html/body//a");

  for($i = 0; $i < $hrefs->length; $i++){

    $href = $hrefs->item($i);
    $url = $href->getAttribute('href');
    $url = filter_var($url, FILTER_SANITIZE_URL);
    // validate url
    if(!filter_var($url, FILTER_VALIDATE_URL) === false){
    echo '<a href="'.$url.'">'.$url.'</a><br />';
    }

  }

}

It is also worth keeping in the back of you head the ability to increase the run time of your script.

If your using a hosting service you may be restricted to something in the ball park of two minutes regardless of what you set your max execution time to. Just food for thought.

This is done by:

ini_set('max_execution_time', 120);

You can always try more time but you'll never know till you time it.

Hope it helps.

Collegian answered 9/1, 2019 at 7:24 Comment(7)
Yup no problem, let me know if you have any issues.Collegian
I got a problem Warning: Use of undefined constant ch - assumed 'ch' (this will throw an Error in a future version of PHP) this 3 lines ${ch . $thread} = curl_init(); curl_setopt_array(${ch . $thread}, $optionArray); //Set your main curl options. curl_setopt(${ch . $thread}, CURLOPT_URL, $value); //Set url.Emphatic
Thank you for pointing that out. I have addressed the issue. Please check updated answer.Collegian
I made some changes to the updated code. Use the current code. Sorry for the rapid changes.Collegian
thanks for help, sorry too, I'm the one who's late to reply.Emphatic
Yup no problem, let me know if helps you.Collegian
Let us continue this discussion in chat.Collegian
B
1

You may be using an endless loop - if not, you can can increase the maximum execution time in php.ini or with:

ini_set('max_execution_time', 600); // 600 seconds = 10 minutes

Belsky answered 6/1, 2019 at 9:20 Comment(1)
There is a design flaw if you have to increase execution time to 10 minutes.Cosme
E
1

This is what I achieved after working on the code, It worked but not sure if this is the best answer. Kindly check my code.

<?php

$array = array('https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/','https://www.google.com/');

print_r (getUrls($array));

function getUrls($array) { 

  $arrUrl = array();
  $arrList = array();
  $url_count = count($array);
  $curl_array = array();
  $ch = curl_multi_init();

  foreach($array as $count => $url) {
      $curl_array[$count] = curl_init($url);
      curl_setopt($curl_array[$count], CURLOPT_RETURNTRANSFER, true);
      curl_multi_add_handle($ch, $curl_array[$count]);
  }


  do{
    curl_multi_exec($ch, $exec);
    curl_multi_select($ch,1);
  }while($exec);


  foreach($array as $count => $url) {


      $arrUrl = array();

      $urlContent = curl_multi_getcontent($curl_array[$count]);

      $dom = new DOMDocument();
        @$dom->loadHTML($urlContent);
        $xpath = new DOMXPath($dom);
        $hrefs = $xpath->evaluate("/html/body//a");

        for($i = 0; $i < $hrefs->length; $i++){
            $href = $hrefs->item($i);
            $url = $href->getAttribute('href');
            $url = filter_var($url, FILTER_SANITIZE_URL);
            // validate url
            if (filter_var($url, FILTER_VALIDATE_URL) !== false) {

              if (strpos($url, 'mailto') === false) {


                      $arrUrl[] = $url;

              }
          }
        }

        array_push($arrList, array_unique($arrUrl));


  }

  foreach($array as $count => $url) {
      curl_multi_remove_handle($ch, $curl_array[$count]);
  }

  curl_multi_close($ch); 

  foreach($array as $count => $url) {
      curl_close($curl_array[$count]);
  }

  return $arrList;

}
Emphatic answered 8/1, 2019 at 2:49 Comment(0)
E
0

First of all i know that OP does asking about multi_curl but i just adding another alternative if the OP may changes his mind. What i do here is splitting the urls into many request so the cpu usage will not that big. If the OP still wants use multi_curl maybe the PHP master here could gives more better solution.

<?php
$num = preg_replace('/[^0-9]/','',$_GET['num']);
$num = empty($num) ? 0 : $num;

$urls=array(
'http://site1.com/',
'http://site2.com/',
'http://site3.com/');

if(!empty($urls[$num]))
{
    /* do your single curl stuff here and store its data here*/

    /*now redirect to the next url. dont use header location redirect, it would ends up too many redirect error in browser*/
    $next_url = !empty($urls[$num+1]) ? $urls[$num+1] : 'done';

    echo '<html>
    <head>
    <meta http-equiv="refresh" content="0;url=http://yourcodedomain.com/yourpath/yourcode.php?num='.$next_url.'" />
    </head>
    <body>
    <p>Fetching: '.$num+1.' / '.count($urls).'</p>
    </body> 
    </html>';
}
elseif($_GET['num'] == 'done')
{
    /*if all sites have been fetched, do something here*/
}
else
{
    /*throws exception here*/
}

?>
Extrovert answered 8/1, 2019 at 1:46 Comment(0)
B
0

i had same issue then i solved using usleep() this try and let me know

do {
    usleep(10000);
    $n=curl_multi_exec($mh,$active);
}
Barbet answered 10/1, 2019 at 8:26 Comment(0)
P
-1

Try this simplified version:

$urls = [
    'https://en.wikipedia.org/',
    'https://secure.php.net/',
];

set_time_limit(0);
libxml_use_internal_errors(true);

$hrefs = [];
foreach ($urls as $url) {
    $html = file_get_contents($url);
    $doc = new DOMDocument;
    $doc->loadHTML($html);

    foreach ($doc->getElementsByTagName('a') as $link) {
        $href = filter_var($link->getAttribute('href'), FILTER_SANITIZE_URL);
        if (filter_var($href, FILTER_VALIDATE_URL)) {
            echo "<a href='{$href}'>{$href}</a><br/>\n";
        }
    }
}
Privily answered 6/1, 2019 at 12:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.