I'm using Python to parse urls into words. I am having some success but I am trying to cut down on ambiguity. For example, I am given the following url
"abbeycarsuk.com"
and my algorithm outputs:
['abbey','car','suk'],['abbey','cars','uk']
Clearly the second parsing is the correct one, but the first one is also technically just as correct (apparently 'suk' is a word in the dictionary that I am using).
What would help me out a lot is if there is a wordlist out there that also contains the fequency/popularity of each word. I could work this into my algorithm and then the second parsing would be chosen (since 'uk' is obviously more common than 'suk'). Does anyone know where I could find such a list? I found wordfrequency.info but they charge for the data, and the free sample they offer does not have enough words for me to be able to use it successfully.
Alternatively, I suppose I could download a large corpus (project Gutenberg?) and get the frequency values myself, however if such a data set already exists, it would make my life a lot easier.