Where can I obtain an English dictionary with structured data? [closed]
Asked Answered
B

5

40

I would like to download an English dictionary -- not just a word list -- in a structured format such as TXT, XML, or SQL.

Specifically, I need phonetic pronunciation and parts of speech (definition is not required).

Surprisingly, I can't find this online anywhere. Wiktionary is available for download, but it is only the MediaWiki articles themselves. Crawling all articles and extracting the phonetics and parts of speech would be a huge exercise.

Is this available anywhere? I don't mind paying.

Edit: a few people have asked what I would like to do. My immediate need is just curiosity, for example "what the most common two-syllable verbs?". Eventually my hope would be a tool that helps you find available domain names, and does so by pairing the correct parts of speech, with bonus points for phonetic matches.

Note: cross-posted on English Language and Usage.

Baines answered 25/9, 2010 at 15:51 Comment(3)
Please check the Excel file present here: freedownloadscenter.com/Themes/School_Themes/…Shipper
Good to note that if you do decide to crawl that it shouldn't be too hard. They have CSS classes set on the pronunciation: <span class="IPA">/stʌf/</span>Correspond
This is filed as phabricator.wikimedia.org/T38881Tatia
P
17

Go to http://www.speech.cs.cmu.edu/cgi-bin/cmudict and you will find the download page for the pronunciation dictionary at https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/

The latest version is currently cmudict.0.7a.

This is what I am currently using to implement the syllable counter for http://www.haikuvillage.com. It's in Ruby and I'd be happy to open source it for you if that helps.

Pixie answered 30/9, 2010 at 8:11 Comment(3)
Cool! This is extremely helpful. Now I need parts of speech...Baines
haikuvillage.com is wonderful!Corley
This is a pretty old question and I have a short timeframe, but I'd be interested in source or an explanation for how you're converting ARPAbet phones to syllables if you're still open to sharing itHouseless
A
8

Parts of Speech Dictionary in the public domain with highly structured format: http://icon.shef.ac.uk/Moby/mpos.html

Each line is an entry, separated by ×, with the word value on the left and the part-of-speech value (verb, etc.) on the right. Simple text file.

Alien answered 7/8, 2013 at 16:41 Comment(2)
link is broken.Gantlet
I found the part-of-speech dictionary here: ai1.ai.uga.edu/ftplib/natural-language/moby . The file is called mpos.tar.ZGoering
M
6

Wordnet is one of the best dictionaries i know. Perhaps you will find something there: https://wordnet.princeton.edu/related-projects

Multitude answered 29/9, 2010 at 14:14 Comment(2)
This looks promising. I wish the data wasn't in a custom format, but it looks extractable.Baines
It doesn't look like it contains info on pronunciation like the IPA or syllable info for a word. I could be wrong though.Yahairayahata
S
2

Portman, while I used the SpellChecker tool from DevExpress I knew that there existed the OpenOffice dictionaries I'm pretty sure they have a well defined data structure. I recommend you to use that in combination with any free/paid text to speech tool.

Hope that helps,

Shipper answered 25/9, 2010 at 16:20 Comment(5)
he's looking for pronunciations and parts of speech, not just a list of words (which is what DevExpress and OpenOffice provide).Duffer
@Jess - DevExpress use OpenOffice list of words, but have also a SpellChecker. I recommended him to use standard .dic and .aff files to find the words, then a tool to guarantee the pronunciation.Shipper
the OpenOffice files are actually a subset of Aspell. They include only spelling. No parts of speech and no pronunciation.Baines
@Portman, - Totally agree. My suggestion was using them as a list of words to be "spoken" by any free text to speech tool. There are plenty of them on internet ;)Shipper
I think he wants ACTUAL pronunciation that he can parse. It's not like he's going to listen to the TTS engine's pronunciation then write it down (and TTS engines usually aren't terribly good beyond the top 10,000 most common words).Duffer
H
1

This is not a direct answer to your question, but the Double Metaphone algorithm is very good at finding word or phrase matches for search engine application servers (such as Solr and others).

I cannot tell what your intended use of this is, so I can't tell if my suggestion is useful or not. If it is close to your intended use, the Wikipedia page about Double Metaphone has a listing of about a dozen implementations of it which may be worth exploring.

http://en.wikipedia.org/wiki/Double_Metaphone

Hydrogenize answered 27/9, 2010 at 18:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.