Getting back urls while loading multiple urls with YQL
Asked Answered
L

2

6

I'm using YQL to fetch a bunch of pages, some of which could be offline (obviously I don't know which ones). I'm using this query:

SELECT * FROM html WHERE url IN ("http://www.whooma.net", "http://www.dfdsfsdgsfagdffgd.com", "http://www.cnn.com")

Where the first and the last one are actual sites, while the second one obviously doesn't exist. Two results are actually returned but the url from where they were loaded doesn't appear anywhere. So what would be the way to find out which html page belongs to which url, if not every page in the query is loaded?

Lientery answered 2/10, 2013 at 20:13 Comment(2)
I don't understand. "but the url from where they were loaded doesn't appear anywhere" do you expect the code above to do that? "find out which html page belongs to which url" eing???Biflagellate
I thought it was clear, sorry! Let me explain that again. I load a bunch of urls, some of them are loaded, then the loaded urls are packed into an array and sent back to me. The problem is: I have no hint on which url I asked does every element in the array correspond to. So instead of just code, I'd need back couples of (url, data) so that I know which data corresponds to which url, and which urls weren't loaded.Lientery
J
3

Unfortunately, I do not know a way where you can get a key=>value pair in the response where key being the url and value being the html response. But, you can try the following query and see if it meets your use case:

select * from yql.query.multi where queries="select * from html where url='http://www.whooma.net';select * from feed where url='http://www.dfdsfsdgsfagdffgd.com';select * from html where url='http://www.cnn.com'"

Try it here. What you can do is before firing the query, maintain the order in an array of the url in the queries like so ['http://www.whooma.net','http://www.dfdsfsdgsfagdffgd.com','http://www.cnn.com']. We can call this array A When you iterate over the response from the YQL query, the url which does not exists will return a null. A sample response from the above query:

<results>
  <results>
    // Response from select * from html where url='http://www.whooma.net'. This should be some html
  </results>
  <results>
    // Response from select * from feed where url='http://www.dfdsfsdgsfagdffgd.com'. This should be null.
  </results>
  <results>
    // select * from html where url='http://www.cnn.com'. This should also be some html
  </results>
</results>

So in conclusion, you can iterate over array A and response from YQL. The first element of array A should correspond to the first results(inner results) element of that YQL response. i.e You are creating a hashmap from two arrays. I know the answer is long but I think it was needed. Let me know if there is any confusion.

Joerg answered 7/10, 2013 at 23:15 Comment(0)
U
1

You can figure out which urls aren't loading by using the YQL diagnostics flag. The diagnostics flag will cause the response to include a diagnostics property with a url array that indicates whether the corresponding servers were found. Presumably, once you eliminate the urls that didn't load, the result pages will match up with the remaining urls.

Unknowable answered 12/10, 2013 at 12:56 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.