Trouble getting source code from a webpage
Asked Answered
F

0

0

I've written a script in php to get the html content or source code from a webpage but I could not succeed. When I execute my script, it opens the page itself. How can I get the html element or source code?

This is the script:

<?php
include "simple_html_dom.php";
function get_source($url)
{
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    $htmlContent = curl_exec($ch);
    curl_close($ch);
    $dom = new simple_html_dom();
    $dom->load($htmlContent);
    return $dom;
}
$scraped_page = get_source("https://stackoverflow.com/questions/tagged/web-scraping");
echo $scraped_page;
?>

Currently I'm getting like this:

enter image description here My expected output is something like:

enter image description here Btw, echoing $htmlContent also gives me what you can see in image 1.

Fixate answered 18/9, 2018 at 18:55 Comment(15)
The last line of code echo $scraped_page; displays the document you've loaded, so you should be able to use this to extract the data instead.Enterostomy
Yes, I know but how can I get the source code then? Thanks for your comment @Nigel Ren.Fixate
That is the source code, not sure what you are expecting to get? If you want to display the source - either put echo '<pre>'; before and echo '</pre>'; after the echo. Or view the source in your browser.Enterostomy
Read the docs on the library you're using. The reason that you're getting what you're getting is because the object you're echoing has a __toString() function that just returns the bare source. If you want to do something else you need to do something else.Essieessinger
I never asked why don't I get source code using my above script; rather, I asked how I can get them, meaning which way. The above script is just a placeholder to let you know that I tried myself before making a post. Thanks.Fixate
Please give and example of the desired output.Ahmed
Possible duplicate of PHP Parse HTML codeEnterostomy
What we see when we inspect element or click on View page source button.Fixate
This is the most basic thing what other languages provide in the first place. However, this is a wrongly applied Possible duplicate flag when the question there is totally different from what I've asked here. Thanks anyway.Fixate
is echo $scraped_page not showing what you expected? What is it showing? What did you expect? if the curl request succeeded, it should be showing you some HTML. If it isn't, you probably need to find out why the request failed, or what else went wrong with your script. "Didn't succeed" as a description of your problem doesn't really give us much to go on. What do you mean by "opens the page itself"? Which page? Opens how, exactly? You're just echoing the result of the curl request, that's all. We would really like to help, but we need you to be more specific about your problem. Thankyou.Khalilahkhalin
It strikes me that if you want the raw HTML returned by the curl request, I would suggest echoing $htmlContent instead rather than echoing $dom, which it seems is likely to be an object.Khalilahkhalin
Please check out the edit @ADyson.Fixate
Ok thanks. I guess because you are echoing it into an existing HTML document, so the browser treats it like any other HTML which forms part of the page - i.e. it parses it and renders it. I didn't know if you were executing this from the command-line, or maybe echoing it into a textbox, or anything else. Now we have some context. If you want to see the raw HTML in this context, you need to HTML-encode it so the browser sees it as text and not HTML to actually be interpreted and rendered.Khalilahkhalin
There are potentially a couple of different ways to do that. See google.co.uk/…Khalilahkhalin
Can you please be more clear about the expected output? Like providing an example of desired output in text form, not an image.Ahmed

© 2022 - 2024 — McMap. All rights reserved.