YQL: html table is no longer supported
Asked Answered
I

4

18

I use YQL to get some html-pages for reading information out of it. Since today I get the return message "html table is no longer supported. See https://policies.yahoo.com/us/en/yahoo/terms/product-atos/yql/index.htm for YQL Terms of Use"

Example in the console: https://developer.yahoo.com/yql/console/#h=select+*+from+html+where+url%3D%22http%3A%2F%2Fwww.google.de%22

Did Yahoo stop this service? Does anybody know a kind of announcement from Yahoo? I am wondering whether this is simply a bug or whether they really stopped this service...

All documentation is still there (html scraping): https://developer.yahoo.com/yql/guide/yql-select-xpath.html , https://developer.yahoo.com/yql/

A while ago I posted in an YQL forum from Yahoo, now this one does not exist anymore (or at least I do not find it). How can you contact Yahoo to find out whether this service really stopped?

Best regards, hebr3

Impotent answered 8/6, 2017 at 9:2 Comment(5)
Yes, not working for me too. They give us a link to the "YQL Terms of Use" page but it is no help. It seems the YQL service is still operational but as the error message states the "HTML table" query is just not supported any more. So, I'm trying to find another way to scrape an HTML table from a web page. Perhaps there is another YQL service out there that can help extract a table from a web page or there is some alternative query in YQL I can try. I guess I will have to read docs on YQL to find out.Jermayne
Same issue here. Broke my script and took some time to find out that this table is no longer supported. There are other public proxies (#15006000), but they all have some limitations and can be blocked away if there are too many requests unlike yahoo with it's cache.Shark
@Jermayne the error is not due to HTML tables. It's related to the YQL table named "html". Think of YQL like any other query language -- information is stored in table structures. In regards to finding an alternative to YQL, that's not necessary. You just have to find an alternative YQL table. See my answerAlverson
I'm on GAE using YQL html table JSON output and going to refactor scraping using lxml. For not breaking the interface to existing code, it would be useful to have sample YQL output at hand, especially JSON, which was quite peculiar. The XML-to-JSON-transformation documentation is not a full spec (e.g. how did it handle mixed nodes?). Please share samples html vs. json, like this one.Anticlinal
Here's a Python gist that can be useful for refactoring a YQL html query returning JSON, by using the lxml module with XPATH query and converting the output to YQL's JSON format, to avoid breaking the interface to other code: https://gist.github.com/vicmortelmans/5ee79080249ed5e0a173bc9e6fd426b1Anticlinal
I
0

Thank you very much for your code.

It helped me to create my own script to read those pages which I need. I never programmed PHP before, but with your code and the wisdom of the internet I could change your script to my needs.

PHP

<?
    header('Access-Control-Allow-Origin: *'); //all
    $url = $_GET['url'];
    if (substr($url,0,25) != "https://www.xxxx.yy") {
       echo "Only https://www.xxxx.yy allowed!";
       return;
    }
    $xpathQuery = $_GET['xpath'];

    //need more hard check for security, I made only basic
   function check($target_url){
       $check = curl_init();
       //curl_setopt( $check, CURLOPT_HTTPHEADER, array("REMOTE_ADDR: $ip", "HTTP_X_FORWARDED_FOR: $ip"));
        //curl_setopt($check, CURLOPT_INTERFACE, "xxx.xxx.xxx.xxx");
        curl_setopt($check, CURLOPT_COOKIEJAR, 'cookiemon.txt');
        curl_setopt($check, CURLOPT_COOKIEFILE, 'cookiemon.txt');
        curl_setopt($check, CURLOPT_TIMEOUT, 40000);
        curl_setopt($check, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($check, CURLOPT_URL, $target_url);
        curl_setopt($check, CURLOPT_USERAGENT,   $_SERVER['HTTP_USER_AGENT']);
    curl_setopt($check, CURLOPT_FOLLOWLOCATION, false);
        $tmp = curl_exec ($check);
        curl_close ($check);
        return $tmp;
    } 

    // get html
    $html = check($url);
    $dom = new DOMDocument();
    @$dom->loadHTML($html);

    // apply xpath filter
    $xpath = new DOMXPath($dom);
    $elements = $xpath->query($xpathQuery);
    $temp_dom = new DOMDocument();
    foreach($elements as $n)   $temp_dom->appendChild($temp_dom->importNode($n,true));
    $renderedHtml = $temp_dom->saveHTML();

    // return html in json response
    // json structure: 
    // {html: "xxxx"}
    $post_data = array(
      'html' => $renderedHtml
    );  
    echo json_encode($post_data); 

?>

Javascript

$.ajax({
    url: "url of service",
    dataType: "json", 
    data: { url: url,
            xpath: "//*"
          },
    type: 'GET',
    success: function() {
             },
    error: function(data) {
           }
}); 
Impotent answered 10/6, 2017 at 11:21 Comment(3)
This might not be a solution for all as having it's own proxy all requests will end up on target site coming from your server. For some tasks this might be undesirable. The beauty of YQL were that you can access cached (sometimes not) versions of pages and to target site this would look as desired search indexing traffic. And to imitate cached versions to reduce requests you'll have to store, sometimes quite a lot of data. And it'll be more than one screen size script. So i consider it is not a general purpose answer.Shark
I agree with SerrNovik. This solution is a shallow alternative to YQL, not a way to make YQL behave as requested. It's worth contributing, but not a suitable answer to the original question. Additionally, many developers use YQL to eliminate CORS from the equation. Your solution only works for documents on the same host.Alverson
yes, your are all right, I also liked the YQL html table - but YQL stopped the service without any warning (at least I did not receive one) and therefore my service did not work anymore --> From my point of view YQL was not reliable anymore and I needed a replacementImpotent
A
18

It looks like Yahoo did indeed end their support of the html library as of 6/8/2017 (according to my error logs). There doesn't appear to be any official announcement of it yet.

Luckily, there is a YQL community library that can be used in place of the official html library with few changes to your codebase. See the htmlstring table in the YQL Console.

Change your YQL query to reference htmltable instead of html and include the community environment in your REST query. For example:

/*/ Old code /*/

var site = "http://www.test.com/foo.html";

var yql = "select * from html where url='" + site + "' AND xpath='//div'";

var resturl = "https://query.yahooapis.com/v1/public/yql?q="
    + encodeURIComponent(yql) + "&format=json";

 

/*/ New code /*/

var site = "http://www.test.com/foo.html";

var yql = "select * from htmlstring where url='" + site + "' AND xpath='//div'";

var resturl = "https://query.yahooapis.com/v1/public/yql?q="
    + encodeURIComponent(yql) + "&format=json"
    + "&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys";
Alverson answered 9/6, 2017 at 18:14 Comment(11)
Thank you very much for this hint. I use only the public version of YQL, for htmlstring I would have to use one with authentication. In any case I am done with Yahoo YQL - I had now several issues with their stability, availability, etc. (though it is a free service I would need reliability and this doesn't seem to exist). I did now set up my own server and use my own web service to get the html pages I need.Impotent
I'm able to use htmlstring without authentication. I wonder why you aren't. PS, if my answer is suitable, please consider marking it as the accepted answer.Alverson
@Alverson your answer is correct, only thing is that Yahoo APIs has to be serve over https and no htmlInterdisciplinary
@user6589814 I'm able to hit the API over http. Are you receiving an error when you try it? Also, the html table is only provided as an example of an old query. My suggested solution is to use htmlstringAlverson
@Alverson I wouldn't do that. Reason is if you use your time to build a new api or script and for some reason http works. I'm sure they will put it down soon since they were going to stop the whole package.Interdisciplinary
I'm missing your meaning. Are you saying you think they'll remove the htmlstring table as well? If so, I disagree because htmlstring is a community-provided table, not officially from Yahoo. So Yahoo has no duty to devote development time to supporting it, ergo they don't mind if it stays. Or are you saying you think they'll remove http access? Again, I dsagree. No API that receives and serves publicly available data should require security. That's just overkill.Alverson
Hey @Alverson I'm getting this issue, when I'm using your given code : XMLHttpRequest cannot load query.yahooapis.com/v1/public/yql?q=select%20*%20from%20htmlstring%2…3D%27*%27&format=json&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'localhost:8080' is therefore not allowed access. The response had HTTP status code 999.Rising
I'd like to mention the existence of the json table, for those, like myself, who were using the html table to retrieve the JSON content returned by a URL (along with the callback parameter -- JSONP).Undermanned
htmlstring thing is working randomly, sometime works, sometime failEntertaining
I am experiencing htmlstring working sometimes and not others. Seems to be about 50%/50%. Do we have a service solution that is more dependable?Mears
@Undermanned I tried "from json" but it failed. What is the name of this table?Relucent
I
0

Thank you very much for your code.

It helped me to create my own script to read those pages which I need. I never programmed PHP before, but with your code and the wisdom of the internet I could change your script to my needs.

PHP

<?
    header('Access-Control-Allow-Origin: *'); //all
    $url = $_GET['url'];
    if (substr($url,0,25) != "https://www.xxxx.yy") {
       echo "Only https://www.xxxx.yy allowed!";
       return;
    }
    $xpathQuery = $_GET['xpath'];

    //need more hard check for security, I made only basic
   function check($target_url){
       $check = curl_init();
       //curl_setopt( $check, CURLOPT_HTTPHEADER, array("REMOTE_ADDR: $ip", "HTTP_X_FORWARDED_FOR: $ip"));
        //curl_setopt($check, CURLOPT_INTERFACE, "xxx.xxx.xxx.xxx");
        curl_setopt($check, CURLOPT_COOKIEJAR, 'cookiemon.txt');
        curl_setopt($check, CURLOPT_COOKIEFILE, 'cookiemon.txt');
        curl_setopt($check, CURLOPT_TIMEOUT, 40000);
        curl_setopt($check, CURLOPT_RETURNTRANSFER, TRUE);
        curl_setopt($check, CURLOPT_URL, $target_url);
        curl_setopt($check, CURLOPT_USERAGENT,   $_SERVER['HTTP_USER_AGENT']);
    curl_setopt($check, CURLOPT_FOLLOWLOCATION, false);
        $tmp = curl_exec ($check);
        curl_close ($check);
        return $tmp;
    } 

    // get html
    $html = check($url);
    $dom = new DOMDocument();
    @$dom->loadHTML($html);

    // apply xpath filter
    $xpath = new DOMXPath($dom);
    $elements = $xpath->query($xpathQuery);
    $temp_dom = new DOMDocument();
    foreach($elements as $n)   $temp_dom->appendChild($temp_dom->importNode($n,true));
    $renderedHtml = $temp_dom->saveHTML();

    // return html in json response
    // json structure: 
    // {html: "xxxx"}
    $post_data = array(
      'html' => $renderedHtml
    );  
    echo json_encode($post_data); 

?>

Javascript

$.ajax({
    url: "url of service",
    dataType: "json", 
    data: { url: url,
            xpath: "//*"
          },
    type: 'GET',
    success: function() {
             },
    error: function(data) {
           }
}); 
Impotent answered 10/6, 2017 at 11:21 Comment(3)
This might not be a solution for all as having it's own proxy all requests will end up on target site coming from your server. For some tasks this might be undesirable. The beauty of YQL were that you can access cached (sometimes not) versions of pages and to target site this would look as desired search indexing traffic. And to imitate cached versions to reduce requests you'll have to store, sometimes quite a lot of data. And it'll be more than one screen size script. So i consider it is not a general purpose answer.Shark
I agree with SerrNovik. This solution is a shallow alternative to YQL, not a way to make YQL behave as requested. It's worth contributing, but not a suitable answer to the original question. Additionally, many developers use YQL to eliminate CORS from the equation. Your solution only works for documents on the same host.Alverson
yes, your are all right, I also liked the YQL html table - but YQL stopped the service without any warning (at least I did not receive one) and therefore my service did not work anymore --> From my point of view YQL was not reliable anymore and I needed a replacementImpotent
Q
0

Even though YQL does not support the html table anymore, I've come to realize that instead of making one network call and parsing out the results it's possible to make several calls. For example, my call before would look like this:

select html from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"

Which should give me the information as such below

enter image description here

Now I'd have to use these two:

select title from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"

select description from rss where url="http://w1.weather.gov/xml/current_obs/KFLL.rss"

.. to get what I want. I don't know why they would deprecate something like this without a fallback clearly listed but you should be able to get your data this way.

Quadragesimal answered 27/6, 2017 at 16:5 Comment(0)
E
0

I build an open source tool called CloudQuery (source code)provide similar functionality as yql recently. It is able to turn most websites to API with some clicks.

Embay answered 25/3, 2019 at 3:5 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.