get links from a google search in C#
Asked Answered
C

2

8

I'm trying to program a simple search in google via C# that would run a query of my choice and retrieve the first 50 links. After thoroughly searching for a similar tool\correct API I realized that most of them are obsolete. My first try was to create a "simple HttpWebRequest" and scan the received WebResponse for "href=" which turned out to be not rewarding at all (redundancy) and very frustrating. I do have a Google API but I'm not sure how to use it for this purpose, although I know that there is an 1000 limit per day.

Gil

Coil answered 3/3, 2011 at 11:10 Comment(8)
I have a project which sends request to google and parses back response. We have to rewrite parsing module several times per year in order to follow google's markup changes. It sucks. Though it usually takes only couple of hours to fix parsing code.Traditionalism
@Snowbear, are you using HtmlAgility pack for parsing?Oft
@Shiv, no, it's kind of legacy part, which still uses regular expressions. Thanks, for mentioning that, I would look into that next time we will rewrite that nightmare.Traditionalism
@could you send me yours as a starting point?Coil
@snowbear, yes, I think if you used HtmlAgility pack and you searched for links, it wouldn't matter (in this case) much since the end result is you're still looking for link. Of course if Google changes it to have say multiple links per result, you'll have to find a way to distinguish one link from the others for a given result item.Oft
@shiv, we were looking not only for links, but also search result header and description provided by google. Also for some results (youtube videos for example) google provide specific html with embedded video or smth like that which makes it even worse.Traditionalism
@snowbear, yup. I hear you. Have you looked at using the API instead?Oft
@Shiv, yeah, looked, haven't tried, but it a) has restricted number of queries per day, we need to perform ~2000 searches b) doesn't seem to provide more than 50 results, we were aiming for 200 c) I've looked into API (not google custom API) last week and it looks like google is stopping supporting it. Custom API doesn't seem to work for me since it works only over predefined set of sites.Traditionalism
O
1

If you're going this route you should use HtmlAgility pack for your parsing. However, a better approach would be to use Google's API. See this post i need to know which of my url is indexed on google

As for some code for using HtmlAgility pack, I have a post on my blog Finding links on a Web page

Oft answered 3/3, 2011 at 11:13 Comment(0)
G
8

Here is working code.. obviously you will have to add the proper form and a few simple controls...

using HtmlAgilityPack;
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Net;
using System.ServiceModel.Syndication;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Xml;

namespace Search
{
    public partial class Form1 : Form
    {
        // load snippet
        HtmlAgilityPack.HtmlDocument htmlSnippet = new HtmlAgilityPack.HtmlDocument();

        public Form1()
        {
            InitializeComponent();
        }

        private void btn1_Click(object sender, EventArgs e)
        {
            listBox1.Items.Clear();
            StringBuilder sb = new StringBuilder();
            byte[] ResultsBuffer = new byte[8192];
            string SearchResults = "http://google.com/search?q=" + txtKeyWords.Text.Trim();
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(SearchResults);
            HttpWebResponse response = (HttpWebResponse)request.GetResponse();

            Stream resStream = response.GetResponseStream();
            string tempString = null;
            int count = 0;
            do
            {
                count = resStream.Read(ResultsBuffer, 0, ResultsBuffer.Length);
                if (count != 0)
                {
                    tempString = Encoding.ASCII.GetString(ResultsBuffer, 0, count);
                    sb.Append(tempString);
                }
            }

            while (count > 0);
            string sbb = sb.ToString();

            HtmlAgilityPack.HtmlDocument html = new HtmlAgilityPack.HtmlDocument();
            html.OptionOutputAsXml = true;
            html.LoadHtml(sbb);
            HtmlNode doc = html.DocumentNode;

            foreach (HtmlNode link in doc.SelectNodes("//a[@href]"))
            {
                //HtmlAttribute att = link.Attributes["href"];
                string hrefValue = link.GetAttributeValue("href", string.Empty);
                if (!hrefValue.ToString().ToUpper().Contains("GOOGLE") && hrefValue.ToString().Contains("/url?q=") && hrefValue.ToString().ToUpper().Contains("HTTP://"))
                {
                    int index = hrefValue.IndexOf("&");
                    if (index > 0)
                    {
                        hrefValue = hrefValue.Substring(0, index);
                        listBox1.Items.Add(hrefValue.Replace("/url?q=", ""));
                    }
                }
            }
        }
    }
}
Gridiron answered 4/1, 2015 at 20:19 Comment(0)
O
1

If you're going this route you should use HtmlAgility pack for your parsing. However, a better approach would be to use Google's API. See this post i need to know which of my url is indexed on google

As for some code for using HtmlAgility pack, I have a post on my blog Finding links on a Web page

Oft answered 3/3, 2011 at 11:13 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.