Free database of Google word frequencies?
Asked Answered
G

4

10

On the Stackoverflow podcast this week, Jeff mentioned that in 2004 he wrote a script which queried Google with 110,000 English words and collected a database containing the number of hits for each word. They use this on Stackoverflow e.g. for the "Related" list on the right-hand side of each question page.

Since creating one of these today with a similar script would be difficult (as Joel mentioned, "at 30,000 words you get a knock at your door"), I was wondering if anyone knows of a more up-to-date, free database of Google word frequencies (e.g. for IT words which have surely changed since then such as jquery, ruby, azure, etc.).

Goldthread answered 4/12, 2008 at 9:20 Comment(1)
A link to the relevant podcast would be interesting to have.Doughman
H
5

A quick Google search(!) turns up a few hits. This link looks promising:

But it's not targeted at IT words.

Heroic answered 4/12, 2008 at 9:26 Comment(0)
A
3

It maybe late to answer this but I can propose you different way. Instead of getting "number of hits" from Google to compute some approximation of it by yourself. Get big collection of text pages (Corpus) and count the number of each word in it. I have done this with the Wikipedia. There is a dump of all wiki pages. You just need to write a parser to extract text and to count words. The result is a list of more then 110K words (at least 2M-3M). If you really need numbers in Google search result you can get some sample of words and query Google and then make some normalization of computed values to match the Google values. I hope this helps.

Adduce answered 20/5, 2009 at 11:52 Comment(0)
T
1

According to Google, you may send 50,000 queries per day per one IP. I don't really think that it is illegal to split it between your friends..

I had similar problem with queries per day per IP but we solved it by totally different approach.

Tendentious answered 18/12, 2008 at 15:11 Comment(0)
D
0

You can split a list between your friends/collegues and use sufficiently large timeouts so you don't exceed 50,000 requests per day per IP, and then merging the results. I'm not sure about the legality of this approach, but the probability of having Google people "knocking at your door" using this method is pretty low.

NOTE: edited according to data provided by Skuta

Diviner answered 18/12, 2008 at 15:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.