Number of results google (or other) search programmatically
Asked Answered
O

1

15

I am making a little personal project. Ideally I would like to be able to make programmatically a google search and have the count of results. (My goal is to compare the results count between a lot (100000+) of different phrases).

Is there a free way to make a web search and compare the popularity of different texts, by using Google Bing or whatever (the source is not really important).

I tried Google but seems that freely I can do only 10 requests per day. Bing is more permissive (5000 free requests per month).

Is there other tools or way to have a count of number of results for a particular sentence freely ? Thanks in advance.

Overrule answered 31/7, 2016 at 23:10 Comment(1)
If your phrases are short or unique enough, google makes 5-gram databases available (from Jul 2012). See storage.googleapis.com/books/ngrams/books/datasetsv2.htmlEvocator
N
2

There are several things you're going to need if you're seeking to create a simple search engine.

First of all you should read and understand where the field of information retrieval started with G. Salton's paper or at least read the wiki page on the vector space model. It will require you learning at least some undergraduate linear algebra. I suggest Gilbert Strang's MIT video lectures for this.

You can then move to the Brin/Page Pagerank paper which outlays the original concept behind the hyperlink matrix and quickly calculating eigenvectors for ranking or read the wiki page.

You may also be interested in looking at the code for Apache Lucene

To get into contemporary search algorithm techniques you need calculus and regression analysis to learn machine learning and deep learning as the current google search has moved away from Pagerank and utilizes these. This is partially due to how link farming enabled people to artificially engineer search results and the huge amount of meta data that modern browsers and web servers allow to be collected.

EDIT:

For the webcrawler only portion I'd recommend WebSPHINX. I used this in my senior research in college in conjunction with Lucene.

Nittygritty answered 31/7, 2016 at 23:26 Comment(21)
How exactly does PageRank, linear algebra and machine learning help count the occurrences of a particular phrase in the Internet, which is what the OP is trying to do?Blitz
That's the only way to actually determine what he wants. In addition the Salton vector space model was made to do exactly this task. Pagerank is another means of calculating this. You can understand either without linear algebra.Nittygritty
This task does not need any kind of ranking at all. All you need is to implement a crawler and a basic way to search for occurrences of a phrase in the crawl results.Blitz
The ranking part is just included in understanding the motivation of the algorithm. You'd still need the webcrawler to create some data structure isomorphic to the hyperlink matrix or an object in the vector space.Nittygritty
The linear algebra would be necessary for figuring out a way to actually do this calculation in a reasonable amount of time.Nittygritty
And also compress it ... I mean unless you're planning on doing this search on some very small collection of domains.Nittygritty
Thanks for your response, and for the good references. To clarify my question : Actually, I am not wanting to re-create a Search engine. I just want to use an existing one. My problem : I have a lot of sentences (From some books for instance). I would like to make a google (or other) search to get the results number that each sentence gives. Is there a search engine with a free or at least permissive API that allows to make programmatically searches ?Overrule
Yes. Lucene is good for that and very transparent. It is mostly an api.Nittygritty
Actually Lucene doesn't fit my needs, I really want an API that i can query though network that tells me how many times a string appears on the web. Lucene is a search engine, I have to build a web-crawler to be able to use it for my purpose.Overrule
@Overrule I've added the link to WebSphinx. I used this in conjunction for a college project. I had neglected to include it last time because I had though this was integrated into Lucene because it was an apache project as well.Nittygritty
@JeffreyColeman the OP wanted just to parse goolge result page and get the text at the beginning of the search result page which say how many results there is . what you are explaining has no relation to what he wanted and both your suggestions of Lucene and WebSphinx are irrelevant.Hugely
@Hugely He mentioned "I tried Google but seems that freely I can do only 10 requests per day. Bing is more permissive (5000 free requests per month)." and asked for "Other tools" I took this as meaning that simply parsing from the results page not being an option.Nittygritty
I gave him what he needed to implement his own crawler and process the results.Nittygritty
@Hugely I am at a loss for what techniques and API other than what I provided can handle the volume and scope of what he wants to achieve.Nittygritty
@JeffreyColeman the problem is that most of the ways of doing this are rouge ways . which i mean they will have a pool of many ips and they will do the request from those ips in a way that trick google api that those requests are coming from different users .some will do a normal web search and simply parse the search page source and they simply make a delay between requests again to overcome the api limits. but those are not solid methods and i dont think they will scale to hundred of thousands of wordsHugely
@Hugely that's a good point. I don't think the OP has a good idea of the scope of what they are asking for. There's a chance though that doing something like that could violate TOC but I can't speak to the specifics of the search API's TOCs. There are so many other concerns with doing this for an actual commercial solution but that's what my answer is geared towards. Doing it via parsing and waiting on a timeout would likely need some cloud instance and a nice chunk of cash and time. It seems though like a really worthwhile project to find an independent solution.Nittygritty
@Hugely I think there's also probably other legal concerns related to this. The OP's country might even have related laws. I'm really interested in seeing other recommendations because it's in an area of interest of mine but I literally have written an 80 page paper on this in 2007 and know there's just a mountain of concerns including networking and SSL.Nittygritty
@Hugely I think as a college student it took me about three months of solid research and about 4 months of coding but I don't even have access to it anymore and I believe my private college owns it. I also had access to some nicely connected servers.Nittygritty
@Hugely I found something : commoncrawl.org Seems to be the way to go. But i don't know yet how programmatically integrate it.Overrule
@Overrule CommonCrawl is only practical if you are using AWS. It's a publically accessible S3 instance. If you plan on using AWS to host your solution, it is likely your best option. You'd also need to use Elastic Map Reduce though this is hadoop based and would have to be run on a job like basis and not be something fast enough for web based search on a word or phrase.Nittygritty
@Overrule It is also a static crawl from about two months ago per source so your crawl data would be at least two months old. It will not perform a crawl on request.Nittygritty

© 2022 - 2024 — McMap. All rights reserved.