403 Forbidden with Java but not web browser?

Asked 2/12, 2012 at 15:27 Answered 27/6, 2018 at 10:23

I am writing a small Java program to get the amount of results for a given Google search term. For some reason, in Java I am getting a 403 Forbidden but I am getting the right results in web browsers. Code:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;


public class DataGetter {

    public static void main(String[] args) throws IOException {
        getResultAmount("test");
    }

    private static int getResultAmount(String query) throws IOException {
        BufferedReader r = new BufferedReader(new InputStreamReader(new URL("https://www.google.com/search?q=" + query).openConnection()
                .getInputStream()));
        String line;
        String src = "";
        while ((line = r.readLine()) != null) {
            src += line;
        }
        System.out.println(src);
        return 1;
    }

}

And the error:

Exception in thread "main" java.io.IOException: Server returned HTTP response code: 403 for URL: https://www.google.com/search?q=test
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
    at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(Unknown Source)
    at DataGetter.getResultAmount(DataGetter.java:15)
    at DataGetter.main(DataGetter.java:10)

Why is it doing this?

Gunn answered 2/12, 2012 at 15:27 Comment(4)

@Perception um... what's an SSL endpoint? (sorry I'm clueless about this kind of stuff) – Gunn 2/12, 2012 at 15:38

SSL (secure socket layer) is a method of ensuring security of data passed back and forth between a client and server. An SSL endpoint is a regular URL, but with https instead of http. Using SSL is more complicated than regular http because there needs to be handshaking between the client and server. Which in your case is unnecessary, since you can just use the 'normal' http endpoint for Google (http;//www.google.com/search) – Sanjuana 2/12, 2012 at 15:42

@Sanjuana if I use normal http:// the same thing happens – Gunn 2/12, 2012 at 15:54

Add the query you are working with too the question. – Sanjuana 2/12, 2012 at 15:58

129

You just need to set user agent header for it to work:

URLConnection connection = new URL("https://www.google.com/search?q=" + query).openConnection();
connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
connection.connect();

BufferedReader r  = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));

StringBuilder sb = new StringBuilder();
String line;
while ((line = r.readLine()) != null) {
    sb.append(line);
}
System.out.println(sb.toString());

The SSL was transparently handled for you as could be seen from your exception stacktrace.

Getting the result amount is not really this simple though, after this you have to fake that you're a browser by fetching the cookie and parsing the redirect token link.

String cookie = connection.getHeaderField( "Set-Cookie").split(";")[0];
Pattern pattern = Pattern.compile("content=\\\"0;url=(.*?)\\\"");
Matcher m = pattern.matcher(response);
if( m.find() ) {
    String url = m.group(1);
    connection = new URL(url).openConnection();
    connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11");
    connection.setRequestProperty("Cookie", cookie );
    connection.connect();
    r  = new BufferedReader(new InputStreamReader(connection.getInputStream(), Charset.forName("UTF-8")));
    sb = new StringBuilder();
    while ((line = r.readLine()) != null) {
        sb.append(line);
    }
    response = sb.toString();
    pattern = Pattern.compile("<div id=\"resultStats\">About ([0-9,]+) results</div>");
    m = pattern.matcher(response);
    if( m.find() ) {
        long amount = Long.parseLong(m.group(1).replaceAll(",", ""));
        return amount;
    }

}

Running the full code I get 2930000000L as a result.

Countryandwestern answered 2/12, 2012 at 16:51 Comment(9)

Dude, I owe you a keg of beer, this is such a perfect solution to my problem! Can google restrict/throttle your results using this method? – Aguilar 28/3, 2015 at 21:25

@gudthing throttling is ip-based, so it's not about the method but whether you change your ip :-) – Countryandwestern 29/3, 2015 at 0:29

I see! A simple router restart (for WAN change) will solve the problem :). Thanks again!! – Aguilar 29/3, 2015 at 8:26

connection.connect(); will throw exception "already connected" – Basically 20/5, 2018 at 22:57

@Countryandwestern What should the variable response contain? – Ethban 19/6, 2018 at 13:43

The full code link is dead. Can it be re-hosted on a service without expirations? – Peters 14/3, 2019 at 6:28

This is the things which made my day: Now I find out why HTTP url was not working in web api calling, just this is very usefult for me to work in Android 9 and Android 10. connection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"); – Saintsimon 14/8, 2020 at 7:11

Amazon Cloudfront httpcon.addRequestProperty("Accept-Encoding", "gzip, deflate, br"); – Phenomenal 5/9, 2020 at 15:55

Not working in 2023, please check stackoverflow.com/questions/77227173/… – Wrought 4/10, 2023 at 6:24

For me it worked by adding the header: "Accept": "*/*"

Kinsella answered 27/6, 2018 at 10:23 Comment(2)

This worked for me, but I'm not sure why it works. Can you explain more about it please? – Magnet 22/5, 2021 at 21:25

Not working in 2023, please check stackoverflow.com/questions/77227173/… – Wrought 4/10, 2023 at 6:24

You probably aren't setting the correct headers. Use LiveHttpHeaders (or equivalent) in the browser to see what headers the browser is sending, then emulate them in your code.

Hyperboloid answered 2/12, 2012 at 15:30 Comment(4)

I tried

"https://www.google.com/search?q=" + query + "&rlz=1C1RNNN_enUS371&aq=f&oq=" + query + "&sugexp=chrome,mod=6&sourceid=chrome&ie=UTF-8"

, still didn't work – Gunn 2/12, 2012 at 15:32

@PicklishDoorknob you added a query string parameter, you didn't change the headers. You can set headers with .setRequestProperty() on the URLConnection object – Countryandwestern 2/12, 2012 at 16:28

Here's an SO article that talks about adding request headers: stackoverflow.com/questions/480153/… – Hyperboloid 2/12, 2012 at 19:58

can you please check stackoverflow.com/questions/77227173/… – Wrought 4/10, 2023 at 6:24

It's because the site uses SSL. Try using the Jersey HTTP Client. You will probably also have to learn a little about HTTPS and the certificates, but I think Jersey can bet set to ignore most of the details relating to the actual security.

Parbuckle answered 2/12, 2012 at 15:34 Comment(4)

No it isn't, it works just by emulating browser http headers like @KevinDay said in his answer. – Countryandwestern 2/12, 2012 at 16:24

@Ben Brunk - there is a good lesson here - at the core, all of programming is built up of layer upon layer of additional abstraction. Understanding the low level is super useful. Using a higher level client like you describe might work - but only because it's making a low level call that you yourself could make if you choose to. I will never forget how illuminating it was for me to sit down and interact with a web server using a telnet client and crafting the HTTP request by hand. Cheerio! – Hyperboloid 2/12, 2012 at 20:2

Actually, I'm still not sure why that code worked because you typically have to add the site's public certificate to your local Java keystore in order to use SSL like that, even with URLConnection, so something doesn't add up about that URL. Also, what makes you think I never connected to a website using telnet? I do this for a living and I often forget there are a lot of people on this site who are students or hobby programmers. I just try to be hepful. – Parbuckle 3/12, 2012 at 1:9

If the site uses a certificate that has a trust chain to a CA that is included with JAVA in it's cacerts truststore (located in jdk\jre\lib\security) then explicitly adding the sites certificate is not needed. – Lasonde 10/2, 2017 at 16:16

Recommended topics

Hot tags