JSoup UserAgent, how to set it right?
Asked Answered
G

4

48

I'm trying to parse the frontpage of facebook with JSoup but I always get the HTML Code for mobile devices and not the version for normal browsers(In my case Firefox 5.0).

I'm setting my User Agent like this:

doc = Jsoup.connect(url)
      .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0")
      .get();

Am I doing something wrong?

EDIT:

I just parsed http://whatsmyuseragent.com/ and it looks like the user Agent is working. Now its even more confusing for me why the site http://www.facebook.com/ returns a different version when using JSoup and my browser. Both are using the same useragent....

I noticed this behaviour on some other sites too now. If you could explain to me what the Issue is I would be more than happy.

Gannet answered 5/7, 2011 at 11:6 Comment(2)
I can't be the only one encountering this issue, or am I ?Gannet
Thank you Markus. Adding user agent only solved my issueCallis
E
60

You might try setting the referrer header as well:

doc = Jsoup.connect("https://www.facebook.com/")
      .userAgent("Mozilla/5.0 (Windows; U; WindowsNT 5.1; en-US; rv1.8.1.6) Gecko/20070725 Firefox/2.0.0.6")
      .referrer("http://www.google.com")
      .get();
Ethiopian answered 22/8, 2011 at 11:16 Comment(4)
@Gili I meant the referrer. What is its role in this?Butacaine
@silentbang, websites might look for the Referer header in order to detect spider bots so if you want to pretend to be a browser you'll need to set that value too. See en.wikipedia.org/wiki/HTTP_refererCaddy
I tried this code for 'www.cnn.com', but it still returned mobile version of web content..Groh
I found that if I set a user-agent as a browser of windows or mac, YouTube will ignore all the mate tags in the html. So I have to remove it to retrieve thumbnails, title, description to preview these pages.Liszt
M
39
Response response= Jsoup.connect(location)
           .ignoreContentType(true)
           .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")  
           .referrer("http://www.google.com")   
           .timeout(12000) 
           .followRedirects(true)
           .execute();

Document doc = response.parse();

User Agent

Use the latest User agent. Here's the complete list http://www.useragentstring.com/pages/useragentstring.php.

Timeout

Also don't forget to add timout, since sometimes it takes more than normal timeout to download the page.

Referer

Set the referer as google.

Follow redirects

follow redirects to get to the page.

execute() instead of get()

Use execute() to get the Response object. Which can help you to check for content type and status codes incase of error.

Later you can parse the response object to obtain the document.

Hosted the full example on github

Meadows answered 29/11, 2013 at 11:41 Comment(4)
useragentstring.com seems to be broken now.Ellsworth
Just for clarification, while the exact link in the answer is broken (useragentstring.com/pages/Firefox), the site itself is up (as of the time of writing of this comment): useragentstring.comWilsey
Thanks. Updated the link in the answerMeadows
It doesn't work for masterclass.com. Can you suggest me anything?Berserk
P
8

It's likely that Facebook is setting (and then expecting) certain cookies in its requests, and considers a header that lacks any to be a bot/mobile user/limited browser/something else.

There's several questions about handling cookies with JSoup however you may find it simpler to use HttpUrlConnection or Apache's HttpClient and then passing the result to JSoup. An excellent writeup on everything you need to know: Using java.net.URLConnection to fire and handle HTTP requests

One useful way to debug the difference between your browser and JSoup is Chrome's network inspector. You can add headers from the browser to JSoup one at a time until you get the behavior you expect, then narrow down exactly which headers you need.

Patency answered 1/9, 2012 at 1:38 Comment(0)
B
1

I had the 403 problem and setting .userAgent("Mozilla") worked for me (so it doesn't need to be super specific to work.

Bookish answered 3/11, 2016 at 14:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.