Number of results google (or other) search programmatically

How exactly does PageRank, linear algebra and machine learning help count the occurrences of a particular phrase in the Internet, which is what the OP is trying to do? – Blitz 31/7, 2016 at 23:33

That's the only way to actually determine what he wants. In addition the Salton vector space model was made to do exactly this task. Pagerank is another means of calculating this. You can understand either without linear algebra. – Nittygritty 31/7, 2016 at 23:37

This task does not need any kind of ranking at all. All you need is to implement a crawler and a basic way to search for occurrences of a phrase in the crawl results. – Blitz 31/7, 2016 at 23:38

The ranking part is just included in understanding the motivation of the algorithm. You'd still need the webcrawler to create some data structure isomorphic to the hyperlink matrix or an object in the vector space. – Nittygritty 31/7, 2016 at 23:41

The linear algebra would be necessary for figuring out a way to actually do this calculation in a reasonable amount of time. – Nittygritty 31/7, 2016 at 23:42

And also compress it ... I mean unless you're planning on doing this search on some very small collection of domains. – Nittygritty 31/7, 2016 at 23:43

Thanks for your response, and for the good references. To clarify my question : Actually, I am not wanting to re-create a Search engine. I just want to use an existing one. My problem : I have a lot of sentences (From some books for instance). I would like to make a google (or other) search to get the results number that each sentence gives. Is there a search engine with a free or at least permissive API that allows to make programmatically searches ? – Overrule 1/8, 2016 at 4:38

Yes. Lucene is good for that and very transparent. It is mostly an api. – Nittygritty 1/8, 2016 at 5:34

Actually Lucene doesn't fit my needs, I really want an API that i can query though network that tells me how many times a string appears on the web. Lucene is a search engine, I have to build a web-crawler to be able to use it for my purpose. – Overrule 15/8, 2016 at 22:59

@Overrule I've added the link to WebSphinx. I used this in conjunction for a college project. I had neglected to include it last time because I had though this was integrated into Lucene because it was an apache project as well. – Nittygritty 16/8, 2016 at 20:36

@JeffreyColeman the OP wanted just to parse goolge result page and get the text at the beginning of the search result page which say how many results there is . what you are explaining has no relation to what he wanted and both your suggestions of Lucene and WebSphinx are irrelevant. – Hugely 16/8, 2016 at 20:55

@Hugely He mentioned "I tried Google but seems that freely I can do only 10 requests per day. Bing is more permissive (5000 free requests per month)." and asked for "Other tools" I took this as meaning that simply parsing from the results page not being an option. – Nittygritty 16/8, 2016 at 21:0

I gave him what he needed to implement his own crawler and process the results. – Nittygritty 16/8, 2016 at 21:2

@Hugely I am at a loss for what techniques and API other than what I provided can handle the volume and scope of what he wants to achieve. – Nittygritty 16/8, 2016 at 21:6

@JeffreyColeman the problem is that most of the ways of doing this are rouge ways . which i mean they will have a pool of many ips and they will do the request from those ips in a way that trick google api that those requests are coming from different users .some will do a normal web search and simply parse the search page source and they simply make a delay between requests again to overcome the api limits. but those are not solid methods and i dont think they will scale to hundred of thousands of words – Hugely 17/8, 2016 at 13:37

@Hugely that's a good point. I don't think the OP has a good idea of the scope of what they are asking for. There's a chance though that doing something like that could violate TOC but I can't speak to the specifics of the search API's TOCs. There are so many other concerns with doing this for an actual commercial solution but that's what my answer is geared towards. Doing it via parsing and waiting on a timeout would likely need some cloud instance and a nice chunk of cash and time. It seems though like a really worthwhile project to find an independent solution. – Nittygritty 17/8, 2016 at 19:45

@Hugely I think there's also probably other legal concerns related to this. The OP's country might even have related laws. I'm really interested in seeing other recommendations because it's in an area of interest of mine but I literally have written an 80 page paper on this in 2007 and know there's just a mountain of concerns including networking and SSL. – Nittygritty 17/8, 2016 at 19:50

@Hugely I think as a college student it took me about three months of solid research and about 4 months of coding but I don't even have access to it anymore and I believe my private college owns it. I also had access to some nicely connected servers. – Nittygritty 17/8, 2016 at 19:52

@Hugely I found something : commoncrawl.org Seems to be the way to go. But i don't know yet how programmatically integrate it. – Overrule 19/8, 2016 at 19:11

@Overrule CommonCrawl is only practical if you are using AWS. It's a publically accessible S3 instance. If you plan on using AWS to host your solution, it is likely your best option. You'd also need to use Elastic Map Reduce though this is hadoop based and would have to be run on a job like basis and not be something fast enough for web based search on a word or phrase. – Nittygritty 22/8, 2016 at 16:7

@Overrule It is also a static crawl from about two months ago per source so your crawl data would be at least two months old. It will not perform a crawl on request. – Nittygritty 22/8, 2016 at 16:9

Recommended topics

Hot tags