403 Forbidden with Java but not web browser?
Asked Answered
G

4

66

I am writing a small Java program to get the amount of results for a given Google search term. For some reason, in Java I am getting a 403 Forbidden but I am getting the right results in web browsers. Code:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;


public class DataGetter {

    public static void main(String[] args) throws IOException {
        getResultAmount("test");
    }

    private static int getResultAmount(String query) throws IOException {
        BufferedReader r = new BufferedReader(new InputStreamReader(new URL("https://www.google.com/search?q=" + query).openConnection()
                .getInputStream()));
        String line;
        String src = "";
        while ((line = r.readLine()) != null) {
            src += line;
        }
        System.out.println(src);
        return 1;
    }

}

And the error:

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: https://www.google.com/search?q=test
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
    at DataGetter.getResultAmount(DataGetter.java:15)
    at DataGetter.main(DataGetter.java:10)

Why is it doing this?

Gunn answered 2/12, 2012 at 15:27 Comment(4)
@Perception um... what's an SSL endpoint? (sorry I'm clueless about this kind of stuff)Gunn
SSL (secure socket layer) is a method of ensuring security of data passed back and forth between a client and server. An SSL endpoint is a regular URL, but with https instead of http. Using SSL is more complicated than regular http because there needs to be handshaking between the client and server. Which in your case is unnecessary, since you can just use the 'normal' http endpoint for Google (http;//www.google.com/search)Sanjuana
@Sanjuana if I use normal http:// the same thing happensGunn
Add the query you are working with too the question.Sanjuana
C
129

You just need to set user agent header for it to work:

URLConnection connection = new URL("https://www.google.com/search?q=" + query).openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
connection.connect();

BufferedReader r  = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));

StringBuilder sb = new StringBuilder();
String line;
while ((line = r.readLine()) != null) {
    sb.append(line);
}
System.out.println(sb.toString());

The SSL was transparently handled for you as could be seen from your exception stacktrace.

Getting the result amount is not really this simple though, after this you have to fake that you're a browser by fetching the cookie and parsing the redirect token link.

String cookie = connection.getHeaderField( "Set-Cookie").split(";")[0];
Pattern pattern = Pattern.compile("content=\\\"0;url=(.*?)\\\"");
Matcher m = pattern.matcher(response);
if( m.find() ) {
    String url = m.group(1);
    connection = new URL(url).openConnection();
    connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
    connection.setRequestProperty("Cookie", cookie );
    connection.connect();
    r  = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));
    sb = new StringBuilder();
    while ((line = r.readLine()) != null) {
        sb.append(line);
    }
    response = sb.toString();
    pattern = Pattern.compile("<div id=\"resultStats\">About ([0-9,]+) results</div>");
    m = pattern.matcher(response);
    if( m.find() ) {
        long amount = Long.parseLong(m.group(1).replaceAll(",", ""));
        return amount;
    }

}

Running the full code I get 2930000000L as a result.

Countryandwestern answered 2/12, 2012 at 16:51 Comment(9)
Dude, I owe you a keg of beer, this is such a perfect solution to my problem! Can google restrict/throttle your results using this method?Aguilar
@gudthing throttling is ip-based, so it's not about the method but whether you change your ip :-)Countryandwestern
I see! A simple router restart (for WAN change) will solve the problem :). Thanks again!!Aguilar
connection.connect(); will throw exception "already connected"Basically
@Countryandwestern What should the variable response contain?Ethban
The full code link is dead. Can it be re-hosted on a service without expirations?Peters
This is the things which made my day: Now I find out why HTTP url was not working in web api calling, just this is very usefult for me to work in Android 9 and Android 10. connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");Saintsimon
Amazon Cloudfront httpcon.addRequestProperty("Accept-Encoding", "gzip, deflate, br");Phenomenal
Not working in 2023, please check stackoverflow.com/questions/77227173/…Wrought
K
6

For me it worked by adding the header: "Accept": "*/*"

Kinsella answered 27/6, 2018 at 10:23 Comment(2)
This worked for me, but I'm not sure why it works. Can you explain more about it please?Magnet
Not working in 2023, please check stackoverflow.com/questions/77227173/…Wrought
H
3

You probably aren't setting the correct headers. Use LiveHttpHeaders (or equivalent) in the browser to see what headers the browser is sending, then emulate them in your code.

Hyperboloid answered 2/12, 2012 at 15:30 Comment(4)
I tried "https://www.google.com/search?q=" + query + "&rlz=1C1RNNN_enUS371&aq=f&oq=" + query + "&sugexp=chrome,mod=6&sourceid=chrome&ie=UTF-8", still didn't workGunn
@PicklishDoorknob you added a query string parameter, you didn't change the headers. You can set headers with .setRequestProperty() on the URLConnection objectCountryandwestern
Here's an SO article that talks about adding request headers: stackoverflow.com/questions/480153/…Hyperboloid
can you please check stackoverflow.com/questions/77227173/…Wrought
P
0

It's because the site uses SSL. Try using the Jersey HTTP Client. You will probably also have to learn a little about HTTPS and the certificates, but I think Jersey can bet set to ignore most of the details relating to the actual security.

Parbuckle answered 2/12, 2012 at 15:34 Comment(4)
No it isn't, it works just by emulating browser http headers like @KevinDay said in his answer.Countryandwestern
@Ben Brunk - there is a good lesson here - at the core, all of programming is built up of layer upon layer of additional abstraction. Understanding the low level is super useful. Using a higher level client like you describe might work - but only because it's making a low level call that you yourself could make if you choose to. I will never forget how illuminating it was for me to sit down and interact with a web server using a telnet client and crafting the HTTP request by hand. Cheerio!Hyperboloid
Actually, I'm still not sure why that code worked because you typically have to add the site's public certificate to your local Java keystore in order to use SSL like that, even with URLConnection, so something doesn't add up about that URL. Also, what makes you think I never connected to a website using telnet? I do this for a living and I often forget there are a lot of people on this site who are students or hobby programmers. I just try to be hepful.Parbuckle
If the site uses a certificate that has a trust chain to a CA that is included with JAVA in it's cacerts truststore (located in jdk\jre\lib\security) then explicitly adding the sites certificate is not needed.Lasonde

© 2022 - 2024 — McMap. All rights reserved.