How do I extract the target URL from a Google search result?
Asked Answered
C

3

5

I am trying to extract URLs from Google search results. I use Indy IdHTTP to get HTML results from Google, and I use Achmad Z's code for getting the link hrefs from the page. How can I get the real link target for each URL instead of the one that goes through Google's redirector?


I tried that but I get an "Operand no applicable" error in this part of the code:

function ToUTF8Encode(str: string): string;
var
  b: Byte;
begin
  for b in BytesOf(UTF8Encode(str)) do
  begin
    Result := Format('%s%s%.2x', [Result, '%', b]);
  end;
end;

I use Delphi 7 with Indy 9.00.10. Maybe indy update will help ?

Chrissy answered 11/10, 2011 at 19:28 Comment(2)
OK, show us what you have tried and how those attempts failed.Bergwall
Google does severe browser sniffing and click counting, masking your homebrewn User-Agent as Opera might help.Abhorrent
F
4

In the previous post here I've tried to explain why you should use Google Search API, in this one I'll try to provide you an example with a hope it will work in your Delphi 7.

You need to have the SuperObject (JSON parser for Delphi), I've used this version (latest at this time). Then you need Indy; the best would be to upgrade to the latest version if possible. I've used the one shipped with Delphi 2009, but I think the TIdHTTP.Get method is so important that it must work fine also in your 9.00.10 version.

Now you need a list box and a button on your form, the following piece of code and a bit of luck (for compatibility :)

The URL request building you can see for instance in the DxGoogleSearchApi.pas mentioned before but the best is to follow the Google Web Search API reference. In DxGoogleSearchApi.pas you can take the inspiration e.g. how to fetch several pages.

So take this as an inspiration

uses
  IdHTTP, IdURI, SuperObject;

const
  GSA_Version = '1.0';
  GSA_BaseURL = 'http://ajax.googleapis.com/ajax/services/search/';

procedure TForm1.GoogleSearch(const Text: string);
var
  I: Integer;
  RequestURL: string;
  HTTPObject: TIdHTTP;
  HTTPStream: TMemoryStream;
  JSONResult: ISuperObject;
  JSONResponse: ISuperObject;
begin
  RequestURL := TIdURI.URLEncode(GSA_BaseURL + 'web?v=' + GSA_Version + '&q=' + Text);

  HTTPObject := TIdHTTP.Create(nil);
  HTTPStream := TMemoryStream.Create;

  try
    HTTPObject.Get(RequestURL, HTTPStream);
    JSONResponse := TSuperObject.ParseStream(HTTPStream, True);

    if JSONResponse.I['responseStatus'] = 200 then
    begin
      ListBox1.Items.Add('Search time: ' + JSONResponse.S['responseData.cursor.searchResultTime']);
      ListBox1.Items.Add('Fetched count: ' + IntToStr(JSONResponse['responseData.results'].AsArray.Length));
      ListBox1.Items.Add('Total count: ' + JSONResponse.S['responseData.cursor.resultCount']);
      ListBox1.Items.Add('');

      for I := 0 to JSONResponse['responseData.results'].AsArray.Length - 1 do
      begin
        JSONResult := JSONResponse['responseData.results'].AsArray[I];
        ListBox1.Items.Add(JSONResult.S['unescapedUrl']);
      end;
    end;

  finally
    HTTPObject.Free;
    HTTPStream.Free;
  end;
end;

procedure TForm1.Button1Click(Sender: TObject);
begin
  GoogleSearch('Delphi');
end;
Fabiola answered 12/10, 2011 at 22:48 Comment(10)
It seems I am doomed , and now I get EidHttp error Http 404 not found , my friend will bring me Delphi 2009 , it seems like only option for me , not even Marco Cantu's code for webfind isn't working for me [ marcocantu.com/code/md6htm/WebFind.htm ] and it was made for Delphi 7 , there is problem with procedure TFindWebThread.HtmlToList; and I just can't figure it out :(Chrissy
@Danijel, HTTP 404 means that you have a wrong URL request. It's easy, simply take the URL address you're passing to the TIdHTTP.Get method and paste it to your web browser. You will get the same error. If you will have the correct URL then you'll get the JSON result what will be in your brower a plain text object notation.Fabiola
@Danijel, try to pass to the TIdHTTP.Get for instance http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=Delphi and you will get the proper result.Fabiola
I get 404 and the pointer for errors is set to JSONResponse := TSuperObject.ParseStream(HTTPStream, True);Chrissy
@Danijel, sorry, I don't get what you mean with the pointer for errors but I think you mean debugger break point. But trust me, error HTTP 404 Not found means that you are passing a wrong URL, nothing else. Try to press F5 key on the line with HTTPObject.Get, then select RequestURL word and press CTRL+F5. This will add a watch to the URL address variable. Then run your application by pressing F9 and invoke the search. Then you will see in the Watch List the real URL value you're passing to the TIdHTTP.Get. Then right click on the watch and select Copy watch value.Fabiola
This value is enclosed by the '' chars, so delete them and this address paste into your browser. I bet you will get HTTP 404 Not found :)Fabiola
Hey TLama , you're the man , I got myself Delphi 2009 , my professor gave it to me , and I found that your code does great job with google API. I'm very very grateful to you , and when i finish other not so hard things in my program, I will add your name in thanks list :) , also I made a solution for Delphi 7 PosEx and extracting links :)Chrissy
@Danije, I'm glad to help, anyway, you don't need to give me as a thank list participant, here on StackOverflow I would be glad if you accept some of my answers :)Fabiola
I wanted to do that but haven't had enough rep power , now I have and I accepted your answers , and thanks again :)Chrissy
I think that library is obsolete now. I get "The Google Web Search API is no longer available. Please migrate to the Google Custom Search API"Skip
F
5

If I get it right you are trying to fetch the Google search results using TIdHTTP.Get method. If so, then
you should definitely focus on some Google Search API implementation because

  1. it's impossible to fetch the results this way because you don't have any access to the document inside the iframe in which the search results are, so you won't get any search results by using HTTP GET in this case (or at least I haven't heard about the request which can do that)
  2. it's against Google policies and you should use proper Google Search API instead, for instance Google SOAP Search API, there are available also several types of Google Search API's for various purposes

You can find e.g. here the Delphi wrapper with the demo code for Google Search API. I've tested it with Delphi 2009 on Windows 7/64 and it works fine for me.

Fabiola answered 11/10, 2011 at 21:45 Comment(7)
tried that bu I get 'Operand no applicable error in this part of the code: function ToUTF8Encode(str: string): string; var b: Byte; begin for b in BytesOf(UTF8Encode(str)) do begin Result := Format('%s%s%.2x', [Result, '%', b]); end; I use Delphi 7 with Indy 9.00.10 . Maybe indy update will help ?Chrissy
@Danijel, sorry, it's because of for b in BytesOf statement which was not in Delphi 7 yet. As a hotfix you may try to replace the original ToUTF8Encode function with the one from the updated post. Hope this will be the last incompatible thing. Anyway still you can replace the ToUTF8Encode function with some which encodes URL addresses. That's the only thing what this function is used for.Fabiola
Thanks for helping I tried that , and used sh_web and other types for search type and I always get error ' Access violation in adress xxxx in module . Thanks for your help . i don't want to bother you anymore it seems there is no hope for me to find that code. Will keep searching though .Chrissy
Resorts to google policy prohibitions seems questionable to me, got a formal statement? Also, iframe?Abhorrent
@Premature Optimization, have you tried to get the Google search web page using TIdHTTP.Get ? Maybe your suggestion about user client might work but still it's the overkill.Fabiola
@TLama, yes i did. Parsing DOM document is rather challenging task, but otherwise i do not see any problems with this approach.Abhorrent
@Premature Optimization, well if it works it might be a solution, and at this time when Google is deprecating their API and moving to the paid services also low-cost but I would say they won't be happy with it (and I'm not talking about the inefficiency with downloading and parsing more data than you need).Fabiola
F
4

In the previous post here I've tried to explain why you should use Google Search API, in this one I'll try to provide you an example with a hope it will work in your Delphi 7.

You need to have the SuperObject (JSON parser for Delphi), I've used this version (latest at this time). Then you need Indy; the best would be to upgrade to the latest version if possible. I've used the one shipped with Delphi 2009, but I think the TIdHTTP.Get method is so important that it must work fine also in your 9.00.10 version.

Now you need a list box and a button on your form, the following piece of code and a bit of luck (for compatibility :)

The URL request building you can see for instance in the DxGoogleSearchApi.pas mentioned before but the best is to follow the Google Web Search API reference. In DxGoogleSearchApi.pas you can take the inspiration e.g. how to fetch several pages.

So take this as an inspiration

uses
  IdHTTP, IdURI, SuperObject;

const
  GSA_Version = '1.0';
  GSA_BaseURL = 'http://ajax.googleapis.com/ajax/services/search/';

procedure TForm1.GoogleSearch(const Text: string);
var
  I: Integer;
  RequestURL: string;
  HTTPObject: TIdHTTP;
  HTTPStream: TMemoryStream;
  JSONResult: ISuperObject;
  JSONResponse: ISuperObject;
begin
  RequestURL := TIdURI.URLEncode(GSA_BaseURL + 'web?v=' + GSA_Version + '&q=' + Text);

  HTTPObject := TIdHTTP.Create(nil);
  HTTPStream := TMemoryStream.Create;

  try
    HTTPObject.Get(RequestURL, HTTPStream);
    JSONResponse := TSuperObject.ParseStream(HTTPStream, True);

    if JSONResponse.I['responseStatus'] = 200 then
    begin
      ListBox1.Items.Add('Search time: ' + JSONResponse.S['responseData.cursor.searchResultTime']);
      ListBox1.Items.Add('Fetched count: ' + IntToStr(JSONResponse['responseData.results'].AsArray.Length));
      ListBox1.Items.Add('Total count: ' + JSONResponse.S['responseData.cursor.resultCount']);
      ListBox1.Items.Add('');

      for I := 0 to JSONResponse['responseData.results'].AsArray.Length - 1 do
      begin
        JSONResult := JSONResponse['responseData.results'].AsArray[I];
        ListBox1.Items.Add(JSONResult.S['unescapedUrl']);
      end;
    end;

  finally
    HTTPObject.Free;
    HTTPStream.Free;
  end;
end;

procedure TForm1.Button1Click(Sender: TObject);
begin
  GoogleSearch('Delphi');
end;
Fabiola answered 12/10, 2011 at 22:48 Comment(10)
It seems I am doomed , and now I get EidHttp error Http 404 not found , my friend will bring me Delphi 2009 , it seems like only option for me , not even Marco Cantu's code for webfind isn't working for me [ marcocantu.com/code/md6htm/WebFind.htm ] and it was made for Delphi 7 , there is problem with procedure TFindWebThread.HtmlToList; and I just can't figure it out :(Chrissy
@Danijel, HTTP 404 means that you have a wrong URL request. It's easy, simply take the URL address you're passing to the TIdHTTP.Get method and paste it to your web browser. You will get the same error. If you will have the correct URL then you'll get the JSON result what will be in your brower a plain text object notation.Fabiola
@Danijel, try to pass to the TIdHTTP.Get for instance http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=Delphi and you will get the proper result.Fabiola
I get 404 and the pointer for errors is set to JSONResponse := TSuperObject.ParseStream(HTTPStream, True);Chrissy
@Danijel, sorry, I don't get what you mean with the pointer for errors but I think you mean debugger break point. But trust me, error HTTP 404 Not found means that you are passing a wrong URL, nothing else. Try to press F5 key on the line with HTTPObject.Get, then select RequestURL word and press CTRL+F5. This will add a watch to the URL address variable. Then run your application by pressing F9 and invoke the search. Then you will see in the Watch List the real URL value you're passing to the TIdHTTP.Get. Then right click on the watch and select Copy watch value.Fabiola
This value is enclosed by the '' chars, so delete them and this address paste into your browser. I bet you will get HTTP 404 Not found :)Fabiola
Hey TLama , you're the man , I got myself Delphi 2009 , my professor gave it to me , and I found that your code does great job with google API. I'm very very grateful to you , and when i finish other not so hard things in my program, I will add your name in thanks list :) , also I made a solution for Delphi 7 PosEx and extracting links :)Chrissy
@Danije, I'm glad to help, anyway, you don't need to give me as a thank list participant, here on StackOverflow I would be glad if you accept some of my answers :)Fabiola
I wanted to do that but haven't had enough rep power , now I have and I accepted your answers , and thanks again :)Chrissy
I think that library is obsolete now. I get "The Google Web Search API is no longer available. Please migrate to the Google Custom Search API"Skip
C
1

Answer to my question , maybe it can help someone: Fetching web page :

memo1.Lines.Text := idhttp1.Get('http://ajax.googleapis.com/aja...tart=1&rsz=large&q=max');

extracting URL's :

function ExtractText(const Str, Delim1, Delim2: string; PosStart: integer; var PosEnd: integer): string;
var
  pos1, pos2: integer;
begin
  Result := '';
  pos1 := PosEx(Delim1, Str, PosStart);
  if pos1 > 0 then
  begin
    pos2 := PosEx(Delim2, Str, pos1 + Length(Delim1));
    if pos2 > 0 then
    begin
      PosEnd := pos2 + Length(Delim2);
      Result := Copy(Str, pos1 + Length(Delim1), pos2 - (pos1 + Length(Delim1)));
    end;
  end;
end;

And on Button1 just put :

procedure TForm1.Button1Click(Sender: TObject);
var Pos: integer;
    sText: string;
begin
  sText := ExtractText(Memo1.Lines.Text, '"url":"', '","visibleUrl"', 1, Pos);
  while sText <> '' do
  begin
    Memo2.Lines.Add(sText);
    sText := ExtractText(Memo1.Lines.Text, '"url":"', '","visibleUrl"', Pos, Pos);
  end;
end;

www.delphi.about.com has nice documentation on string manipulation , Zarko Gajic does great job on that site :) NOTE: if google changes it's source this will be useless.

Chrissy answered 18/10, 2011 at 12:28 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.