PHP Simple HTML DOM Parser returning false on valid url
Asked Answered
C

5

5

I'm trying the following:

$url = 'https://www.tripadvisor.es/Hotels-g187514-Madrid-Hotels.html'

$ta_html = file_get_html($url);
var_dump($ta_html);

it returns false, this is working and correctly getting the html for:

$url = 'https://www.tripadvisor.es/Hotels-g294316-Lima_Lima_Region-Hotels.html#ACCOM_OVERVIEW'

My first thought was that it had a redirect but I checked the headers with curl and its 200 ok and it seemed like the same on both cases. What can be happening? how it can be solved?

This seems to be a duplicate of this problem: Simple HTML DOM returning false that is also unanswered

Cleome answered 22/4, 2017 at 17:0 Comment(2)
what are you trying to scrap from that page? I prefer to use DOMDocument php built-in class.Bisitun
I'm just experimenting with html simple dom parser. But I'd like to know the reason why on the same website what it seems to me as 2 equal urls one works and the other notCleome
U
15

It looks like HTML DOM parser is failing because the HTML file size is greater than the library's max file size. When you call file_get_html() it does a file size check based on it's MAX_FILE_SIZE constant. So before calling any HTML DOM parser methods, increase the max file size used by the library by calling:

define('MAX_FILE_SIZE', 1200000); // or larger if needed, default is 600000

Also as as you found out you can work around the file size check with doing this

$html = new simple_html_dom();
$html->load($str);
Understandable answered 3/9, 2018 at 14:33 Comment(0)
C
2

So I found a workaround doing this:

$base = $url;
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $base);
curl_setopt($curl, CURLOPT_REFERER, $base);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$str = curl_exec($curl);
curl_close($curl);

$html = new simple_html_dom();
$html->load($str);

Truth be told I dont know exactly why this works, and what was the original problem, and I would appreciate if anyone could point that out

Cleome answered 22/4, 2017 at 20:42 Comment(0)
R
0

It looks like this is happening because of this check in simple_html_dom.php in the file_get_html() function

if (empty($contents) || strlen($contents) > MAX_FILE_SIZE)
{
    return false;
}

It might be that the length of the content is greater than the MAX_FILE_SIZE

Roadbed answered 15/11, 2017 at 12:12 Comment(0)
M
0

Hope it will help you:

$base = $url;
$curl = curl_init();
curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_HEADER, false);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_URL, $base);
curl_setopt($curl, CURLOPT_REFERER, $base);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
$str = curl_exec($curl);
curl_close($curl);

$html = new simple_html_dom();
$html->load($str);
Musgrave answered 24/1, 2020 at 22:49 Comment(0)
G
-1

Use file_get_contents() instead, works for me.

$url = "https://www.tripadvisor.es/Hotels-g187514-Madrid-Hotels.html";
file_put_contents("hello.html", file_get_contents($url));

file_get_html("Hello_html");
Griffy answered 22/4, 2017 at 17:2 Comment(4)
The OP wrote that it works for another url. This isn't the answer, nor the correct solutionLundy
The url I used in the example, works, don't talk shit when you didn't test it.Griffy
This works but I need to use file_get_html from simplehtmldom.sourceforge.net Not sure if my question is not well writtenCleome
Check my answer againGriffy

© 2022 - 2024 — McMap. All rights reserved.