Availability of a list with English words (including frequencies)? [closed]
Asked Answered
H

4

5

I'm using Python to parse urls into words. I am having some success but I am trying to cut down on ambiguity. For example, I am given the following url

"abbeycarsuk.com"

and my algorithm outputs:

['abbey','car','suk'],['abbey','cars','uk']

Clearly the second parsing is the correct one, but the first one is also technically just as correct (apparently 'suk' is a word in the dictionary that I am using).

What would help me out a lot is if there is a wordlist out there that also contains the fequency/popularity of each word. I could work this into my algorithm and then the second parsing would be chosen (since 'uk' is obviously more common than 'suk'). Does anyone know where I could find such a list? I found wordfrequency.info but they charge for the data, and the free sample they offer does not have enough words for me to be able to use it successfully.

Alternatively, I suppose I could download a large corpus (project Gutenberg?) and get the frequency values myself, however if such a data set already exists, it would make my life a lot easier.

Handset answered 15/7, 2013 at 15:59 Comment(1)
You could use the free list, on the site you mention link here, and then if the word isn't in this list then just assume the frequency is very low. Would that not be sufficient?Orly
U
7

There is an extensive article on this very subject written by Peter Norvig (Google's head of research), which contains worked examples in Python, and is fairly easy to understand. The article, along with the data used in the sample programs (some excerpts of Google ngram data) can be found here. The complete set of Google ngrams, for several languages, can be found here (free to download if you live in the east of the US).

Unpen answered 15/7, 2013 at 16:40 Comment(0)
N
2

As you mention, "corpus" is the keyword to search for.

E. g. here is a nice list of resources:

http://www-nlp.stanford.edu/links/statnlp.html

(scroll down)

Nutrient answered 15/7, 2013 at 16:8 Comment(0)
O
2

http://ucrel.lancs.ac.uk/bncfreq/flists.html

This is perhaps the list you want. I guess you can cut down on the size of it to increase performance if it's needed.

Here is a nice big list for you. More information available here.

Orly answered 15/7, 2013 at 16:30 Comment(1)
<8000 words is a very sparse English vocabulary.Deshawndesi
S
1

Have it search using a smaller dictionary first, a smaller dictionary will tend to keep more commonly used words. Then if it fails, you could have it use your more compete dictionary that includes words like 'suk'.

You would then be able to ignore word frequency analysis, however you would take a hit to your overhead by adding another smaller dictionary.

You might be able to use will's link that he posted in the comments as a small dictonary

edit also, the link you provided does indeed have a free service where you can download a list of the top 5,000 used words

Sidonia answered 15/7, 2013 at 16:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.