Trouble getting the name of a product from a webpage
Asked Answered
I

3

6

I've written a script in php to scrape the title of a product located at the top right corner in a webpage. The title is visible as Gucci.

when I execute my below script, it gives me an error Notice: Trying to get property 'plaintext' of non-object in C:\xampp\htdocs\runcode\testfile.php on line 16.

How can I get only the name Gucci from that webpage?

Link to the url

I've written so far:

<?php
include "simple_html_dom.php";
$link = "https://www.farfetch.com//bd/shopping/men/gucci-rhyton-web-print-leather-sneaker-item-12964878.aspx"; 

function get_content($url)
    {
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_HTTPHEADER, array('User-Agent: Mozilla/5.0',));
        curl_setopt($ch, CURLOPT_BINARYTRANSFER, 1);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        $htmlContent = curl_exec($ch);
        curl_close($ch);
        $dom = new simple_html_dom();
        $dom->load($htmlContent);
        $itemTitle = $dom->find('#bannerComponents-Container [itemprop="name"]', 0)->plaintext;
        echo "{$itemTitle}";
    }
get_content($link);
?>

Btw, the selector I've used within the script is flawless.

To clear the confusion I've copied a chunk of html elements from the page source which neither generats dynamically nor javascript encrypted so I don't find any reason for curl not to be able to handle that:

<div class="cdb2b6" id="bannerComponents-Container">
    <p class="_41db0e _527bd9 eda00d" data-tstid="merchandiseTag">New Season</p>
    <div class="_1c3e57">
        <h1 class="_61cb2e" itemProp="brand" itemscope="" itemType="http://schema.org/Brand">
            <a href="/bd/shopping/men/gucci/items.aspx" class="fd9e8e e484bf _4a941d f140b0" data-trk="pp_infobrd" data-tstid="cardInfo-title" itemProp="url" aria-label="Gucci">
                <span itemProp="name">Gucci</span>
            </a>
        </h1>
    </div>
</div>

Post script: It's very pathetic that I had to show a real life example from another language to make sure the name Gucci is not dynamically generated as few comments and an answer have already indicated that

The following script is written in python (using requests module which can't handle dynamic content):

import requests
from bs4 import BeautifulSoup

url = "https://www.farfetch.com//bd/shopping/men/gucci-rhyton-web-print-leather-sneaker-item-12964878.aspx"

with requests.Session() as s:
    s.headers["User-Agent"] = "Mozilla/5.0"
    res = s.get(url)
    soup = BeautifulSoup(res.text,"lxml")
    item = soup.select_one('#bannerComponents-Container [itemprop="name"]').text
    print(item)

Output It produces:

Gucci

Now, it's clear that the content I look for is static.

Please check out the below image to recognize the title which I've already marked by a pencil.

enter image description here

Irrespirable answered 19/9, 2018 at 13:13 Comment(4)
does $itemTitle = $dom->find('#bannerComponents-Container [itemprop="name"]', 0); return an object?Fourinhand
What is the structure of input HTML ? Please include that in the question.Tade
Try [itemProp=name] (capital P for some reason)Preter
Please check out the edit. I've added some materials to bring the clarity.Irrespirable
B
1

The main difference between your successful Python script and your PHP script is the use of session. Your PHP script doesn't use cookies, and that triggers a differend response from the server.

We have two options:

  1. Change the selector. As mentioned in Mark's answer, the item is still on the html, but in a different tag. We could get it with this selector:

    'a[itemprop="brand"]'
    
  2. Use cookies. We can get the same response as your Python script if we use CURLOPT_COOKIESESSION and a temporary file to write/read the cookies.

    function get_content($url) {
        $cookieFileh = tmpfile();
        $cookieFile=stream_get_meta_data($cookieFileh)['uri'];
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
        curl_setopt($ch, CURLOPT_COOKIESESSION, true);
        curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
        curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile); 
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); //
        curl_setopt($ch, CURLOPT_ENCODING, "gzip");
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_exec($ch);
        $htmlContent = curl_exec($ch);
        curl_close($ch);
        fclose($cookieFileh); // thanks to tmpfile(), this also deletes the cookie file.
        $dom = new simple_html_dom();
        $dom->load($htmlContent);
        $itemTitle = $dom->find('#bannerComponents-Container [itemprop="name"]', 0)->plaintext;
        echo "{$itemTitle}";
    }
    
    $link = "https://www.farfetch.com/bd/shopping/men/gucci-rhyton-web-print-leather-sneaker-item-12964878.aspx"; 
    get_content($link);
    //Gucci
    

    This script performs two requests; the first request writes the cookies to file, the second reads and uses them.

    In this case the server returns a compressed response, so I've used CURLOPT_ENCODING to unzip the contents.

    Since you use headers only to set a user-agent, it's best to use the CURLOPT_USERAGENT option.

    I've set CURLOPT_SSL_VERIFYPEER to false because I haven't set a certificate, and CURL fails to use HTTPS. If you can communicate with HTTPS sites it's best not to use this option for security reasons. If not, you could set a certifcate with CURLOPT_CAINFO.

Bounteous answered 22/9, 2018 at 16:37 Comment(14)
You have always been a life saver to me. Your solution never goes astray. Thanks a lot. I have two small questions: 1. Your solution did fetch the right answer along with this line Notice: tempnam(): file created in the system's temporary directory in C:\xampp\htdocs\runcode\testfile.php on line 5. How can i shake off that error. 2. Is there any way i can get the page source or html content programmatically using php as we get in python using res = requests.get(url) ; print(res.text)?Irrespirable
That is not an error, it's a notice that informs us about the temporary file. You could suppress it wit @, eg: $cookieFile = @tempnam("/cookies", "CURLCOOKIE");. About your second question, you can just echo the response content, if that's what you mean, eg echo $htmlContent;.Bounteous
Please check out this post to see what I meant by getting source code or html content. I deleted that post but undeleted now for you. Btw, echoing $htmlContent; I can see that the script just opens that webpage like we open a webpage in a browser not the html stuff. You deserve the bounty. However, let's wait as long as it is alive.Irrespirable
Ok, I think I understand what you mean. You're trying to print HTML code on an browser, right? The problem is that a web browser is designed to interpret HTML code, so you see only the text. Sorry, I can't help you with that. But let me search for an answer, and if I find anything useful I'll let you know.Bounteous
I thought i was the only person having such issue. Thankssssssss a trillion for the clarity.Irrespirable
No, that's the normal behaviour. You can print the html code if you execute your script from the command line. If you're running a server and want to print html code on a web page try htmlspecialcharsBounteous
No damn way. This is it. I have been seraching for this for the last couple of weeks. Please feel free to make this comment as an answer so that i can accept it in my linked post.Irrespirable
Sure, just give me some time, I'm a little busy at the moment. Or you can answer if you want, I'll definitely upvote a correct answer.Bounteous
what happens if you don't have access to the /cookies folder? use tmpfile. it's not at all uncommon for PHP scripts to not have write access to any folder except a special temp folder (as returned by sys_get_temp_dir() - but tmpfile() will take care of locating the folder for you, and take care of cleaning up the file when the handle is closed / or the script terminates.)Caine
@Caine Thanks for the suggestion and edit. You're right, I should have used /tmp or %TEMP%. BTW what do you think about the CURLOPT_SSL_VERIFYPEER option? Is there anything better I can do (besides what I've mentioned in my answer) when running this script from terminal?Bounteous
@Bounteous actually you should neither use /tmp nor %TEMP%, you should use the return value of sys_get_temp_dir(), like tempnam(sys_get_temp_dir(),"cookiefile"); (or better yet, use tmpfile() )- as for CURLOPT_SSL_VERIFYPEER , you can get a certificate bundle at curl.haxx.se/docs/caextract.html and use that with CURLOPT_CAINFO, and enable CURLOPT_SSL_VERIFYPEERCaine
@Caine Yes, I agree that it's best to use sys_get_temp_dir() which is more reliable, instead of hardcoding the path. About CURL, I'm aware of the CURLOPT_CAINFO option (I've mentioned it in my answer), but I was hoping for a method that doesn't require a cerificate; however it's much better than disabling verification.Bounteous
@ssnobody That wouldn't print the html tags on the page. But you could surround htmlspecialchars with pre tags, so it would preserve the original format.Bounteous
Feel free to take a look at this url in your spare time @t.m.adam.Irrespirable
C
3

@t.m.adam already solved the problem, i just want to add that there's no good reason to use simple_html_dom today, seems unmaintained, development stopped in 2014, there's lots of unresolved bugreports, and most importantly, DOMDocument & DOMXPath can do just about everything simple_html_dom can, and is maintained, and is an integrated part of PHP, which means there's nothing to include/bundle with your script. parsing it with DOMDocument & DOMXPath would look like:

$htmlContent = curl_exec($ch);
curl_close($ch);
fclose($cookieFileh); // thanks to tmpfile(), this also deletes the cookie file.
$dom = @DOMDocument::loadHTML($htmlContent);
$xp=new DOMXPath($dom);
$itemTitle = $xp->query('//*[@id="bannerComponents-Container"]//*[@itemprop="name"]')->item(0)->textContent;
echo $itemTitle;
Caine answered 23/9, 2018 at 7:33 Comment(2)
If I wish to use css selector instead of xpath complying with your above approach, how would that modified portion look like? Thanks for your invaluable input @hanshenrik.Irrespirable
@asmitu sorry, afaik there is nothing built-in to PHP that supports CSS - but if you're using Symfony, they have this CSS-to-xpath converter, in which case you can run $converter = new Symfony\Component\CssSelector\CssSelectorConverter(); $itemTitle = $xp->query($converter->toXPath('#bannerComponents-Container [itemprop="name"]'))->item(0)->textContent;Caine
N
2

Your selector works in a browser indeed, but your selector is not present when you use curl to get the page source.

Try saving the curled page in terminal and you'll see that the page structure is different from what you see in the browser.

This is true for most modern websites because they use Javascript heavily and curl does not run javascript for you.

I saved the curl results into a file, the brand info looks like this:

<a itemprop="brand" class="generic" data-tstid="Label_ItemBrand" href="/bd/shopping/men/gucci/items.aspx" dir="ltr">Gucci</a>

Nasopharynx answered 20/9, 2018 at 7:0 Comment(0)
B
1

The main difference between your successful Python script and your PHP script is the use of session. Your PHP script doesn't use cookies, and that triggers a differend response from the server.

We have two options:

  1. Change the selector. As mentioned in Mark's answer, the item is still on the html, but in a different tag. We could get it with this selector:

    'a[itemprop="brand"]'
    
  2. Use cookies. We can get the same response as your Python script if we use CURLOPT_COOKIESESSION and a temporary file to write/read the cookies.

    function get_content($url) {
        $cookieFileh = tmpfile();
        $cookieFile=stream_get_meta_data($cookieFileh)['uri'];
        $ch = curl_init();
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0');
        curl_setopt($ch, CURLOPT_COOKIESESSION, true);
        curl_setopt($ch, CURLOPT_COOKIEJAR, $cookieFile);
        curl_setopt($ch, CURLOPT_COOKIEFILE, $cookieFile); 
        curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); //
        curl_setopt($ch, CURLOPT_ENCODING, "gzip");
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_exec($ch);
        $htmlContent = curl_exec($ch);
        curl_close($ch);
        fclose($cookieFileh); // thanks to tmpfile(), this also deletes the cookie file.
        $dom = new simple_html_dom();
        $dom->load($htmlContent);
        $itemTitle = $dom->find('#bannerComponents-Container [itemprop="name"]', 0)->plaintext;
        echo "{$itemTitle}";
    }
    
    $link = "https://www.farfetch.com/bd/shopping/men/gucci-rhyton-web-print-leather-sneaker-item-12964878.aspx"; 
    get_content($link);
    //Gucci
    

    This script performs two requests; the first request writes the cookies to file, the second reads and uses them.

    In this case the server returns a compressed response, so I've used CURLOPT_ENCODING to unzip the contents.

    Since you use headers only to set a user-agent, it's best to use the CURLOPT_USERAGENT option.

    I've set CURLOPT_SSL_VERIFYPEER to false because I haven't set a certificate, and CURL fails to use HTTPS. If you can communicate with HTTPS sites it's best not to use this option for security reasons. If not, you could set a certifcate with CURLOPT_CAINFO.

Bounteous answered 22/9, 2018 at 16:37 Comment(14)
You have always been a life saver to me. Your solution never goes astray. Thanks a lot. I have two small questions: 1. Your solution did fetch the right answer along with this line Notice: tempnam(): file created in the system's temporary directory in C:\xampp\htdocs\runcode\testfile.php on line 5. How can i shake off that error. 2. Is there any way i can get the page source or html content programmatically using php as we get in python using res = requests.get(url) ; print(res.text)?Irrespirable
That is not an error, it's a notice that informs us about the temporary file. You could suppress it wit @, eg: $cookieFile = @tempnam("/cookies", "CURLCOOKIE");. About your second question, you can just echo the response content, if that's what you mean, eg echo $htmlContent;.Bounteous
Please check out this post to see what I meant by getting source code or html content. I deleted that post but undeleted now for you. Btw, echoing $htmlContent; I can see that the script just opens that webpage like we open a webpage in a browser not the html stuff. You deserve the bounty. However, let's wait as long as it is alive.Irrespirable
Ok, I think I understand what you mean. You're trying to print HTML code on an browser, right? The problem is that a web browser is designed to interpret HTML code, so you see only the text. Sorry, I can't help you with that. But let me search for an answer, and if I find anything useful I'll let you know.Bounteous
I thought i was the only person having such issue. Thankssssssss a trillion for the clarity.Irrespirable
No, that's the normal behaviour. You can print the html code if you execute your script from the command line. If you're running a server and want to print html code on a web page try htmlspecialcharsBounteous
No damn way. This is it. I have been seraching for this for the last couple of weeks. Please feel free to make this comment as an answer so that i can accept it in my linked post.Irrespirable
Sure, just give me some time, I'm a little busy at the moment. Or you can answer if you want, I'll definitely upvote a correct answer.Bounteous
what happens if you don't have access to the /cookies folder? use tmpfile. it's not at all uncommon for PHP scripts to not have write access to any folder except a special temp folder (as returned by sys_get_temp_dir() - but tmpfile() will take care of locating the folder for you, and take care of cleaning up the file when the handle is closed / or the script terminates.)Caine
@Caine Thanks for the suggestion and edit. You're right, I should have used /tmp or %TEMP%. BTW what do you think about the CURLOPT_SSL_VERIFYPEER option? Is there anything better I can do (besides what I've mentioned in my answer) when running this script from terminal?Bounteous
@Bounteous actually you should neither use /tmp nor %TEMP%, you should use the return value of sys_get_temp_dir(), like tempnam(sys_get_temp_dir(),"cookiefile"); (or better yet, use tmpfile() )- as for CURLOPT_SSL_VERIFYPEER , you can get a certificate bundle at curl.haxx.se/docs/caextract.html and use that with CURLOPT_CAINFO, and enable CURLOPT_SSL_VERIFYPEERCaine
@Caine Yes, I agree that it's best to use sys_get_temp_dir() which is more reliable, instead of hardcoding the path. About CURL, I'm aware of the CURLOPT_CAINFO option (I've mentioned it in my answer), but I was hoping for a method that doesn't require a cerificate; however it's much better than disabling verification.Bounteous
@ssnobody That wouldn't print the html tags on the page. But you could surround htmlspecialchars with pre tags, so it would preserve the original format.Bounteous
Feel free to take a look at this url in your spare time @t.m.adam.Irrespirable

© 2022 - 2024 — McMap. All rights reserved.