Determine the difficulty of an english word
Asked Answered
R

13

25

I am working a word based game. My word database contains around 10,000 english words (sorted alphabetically). I am planning to have 5 difficulty levels in the game. Level 1 shows the easiest words and Level 5 shows the most difficult words, relatively speaking.

I need to divide the 10,000 long words list into 5 levels, starting from the easiest words to difficult ones. I am looking for a program to do this for me.

Can someone tell me if there is an algorithm or a method to quantitatively measure the difficulty of an english word?

I have some thoughts revolving around using the "word length" and "word frequency" as factors, and come up with a formula or something that accomplishes this.

Relativity answered 28/2, 2011 at 10:58 Comment(6)
You have to tell more about what word difficulty means for youSwaggering
That really depends on what you mean by 'difficulty'. What does the player have to do with the word? Guess the spelling, the meaning, figure it out from an anagram?Breakfront
"Commitment" is a difficult word for many men, would that be good criteria?Truong
Umm, I know there cannot be a universal way to declare a word easy or difficult, it is pretty much subjective. But on average you'd consider the word "ABEYANCE" more difficult than "ABNORMAL". Maybe we can base it on common usage frequency?Relativity
@Breakfront The game is Jumbled Letters. The player has to arrange all the letters correctly, in order, to form the wordRelativity
@Relativity frequency of use would be a good measure in your case, provided that you can actually get that measure for all words (you could use number of results returned by google as a proxy, for instance). Otherwise you can approximately use length of the word and make your game learn from your players' mistakes (i.e. the longest/most attempts it takes your players to guess a word, the more weight you will add to that word's "difficulty")Rubdown
P
13

Get a large corpus of texts (e.g. from the Gutenberg archives), do a straight frequency analysis, and eyeball the results. If they don't look satisfying, weight each text with its Flesch-Kincaid score and run the analysis again - words that show up frequently, but in "difficult" texts will get a score boost, which is what you want.

If all you have is 10000 words, though, it will probably be quicker to just do the frequency sorting as a first pass and then tweak the results by hand.

Pluvious answered 28/2, 2011 at 15:8 Comment(0)
E
7

I'm not understanding how frequency is being used... if you were to scan a newspaper, I'm sure you would see the word "thoroughly" mentioned much more frequently than the word "bop" or "moo" but that doesn't mean it's an easier word; on the contrary 'thoroughly' is one of the most disgustingly absurd spelling anomalies that gives grade school children nightmares...

Try explaining to a sane human being learning english as a second language the subtle difference between slaughter and laughter.

Elyseelysee answered 25/4, 2013 at 21:0 Comment(3)
Oh yeah! And then (on the pronunciation side) why sheath/sheathe wreath/wreathe but not breath/breathe.Diseur
Thanks for reminding us of this! This should probably be a comment, though, as it is not an answer to the question, strictly speaking. Cheers!Cenacle
The infrequency or absence of certain words in a subset of a meaningful corpus is immaterial. What's germane is to have a sufficient sampling of materials to thoroughly draw from.Kamp
S
3

I agree that frequency of use is the most likely metric; there are studies supporting a high correlation between word frequency and difficulty (correct responses on tests, etc.). Check out the English Lexicon Project at http://elexicon.wustl.edu/ for some 70k(?) frequency-rated words.

Suppose answered 4/3, 2012 at 3:50 Comment(1)
"There are studies" ← links? :-)Cenacle
Z
3

Crowd-source the answer.

  • Create an online 'game' that lists 10 words at random.
  • Get the player to drag and drop them into easiest - hardest, and tick to indicate if the player has ever heard of the word.
  • Apply an ranking algorithm (e.g. ELO) on the result of each experiment.
  • Repeat.

It might even be fun to play, you could get a language proficiency score at the end.

Zibet answered 24/6, 2015 at 1:22 Comment(0)
B
1

Difficulty is a pretty amorphus concept. If you've no clear idea of what you want, perhaps you could take a look at the Porter Stemming Algorithm (see for example the original paper). That contains a more advanced idea of 'length' by defining words as being of the form [C](VC){m}[V]; C means a block of consonants and V a block of vowels and this definition says a word is an optional C followed by m VC blocks and finally an optional V. The m value is this advanced 'length'.

Backspace answered 28/2, 2011 at 11:6 Comment(2)
This paper is about "an algorithm for suffix stripping". It will probably be useful as a first step, if you consider that the complexity of "CONNECTIONS" should be the same as the complexity of "CONNECT". It does not calculate the complexity of the unsuffixed word itself, though, so it can only be a first step.Cenacle
What I suggested was using the m value as a rough measure of complexity, not taking the stem. CONNECTIONS and CONNECT do not have the same m value.Backspace
R
1

depending on the type of game the definition of "difficult" will change. If your game involves typing quickly (ztype-style...), "difficult" will have a different meaning than in a game where you need to define a word's meaning.

That said, Scrabble has a way to measure how "difficult" a word is which is also quite easy algoritmically.

Also you may look into defining "difficult" in terms of your game. You could beta test your game and classify words according to how "difficult" players find them in the context of your own game.

Rubdown answered 28/2, 2011 at 11:7 Comment(1)
The game im working on is Jumbled Words. The player has to arrange the letters in correct order to form the word. Yeah, I think a scoring system similar to Scrabble should work well.Relativity
H
1

There are several factors that relate to word difficulty, including age at acquisition, imageability, concreteness, abstractness, syllables, frequency (spoken and written). There are also psycholinguistic databases that will search for word by at least some of these factors. (just do a search for "psycholinguistic database".

Hennessy answered 3/11, 2015 at 19:15 Comment(1)
OP asked specifically for an algorithm, not for a database.Cosmonautics
D
1

Word frequency is an obvious choice (of course not perfect). You can download Google n-grams V2 here, which is license under the Creative Commons Attribution 3.0 Unported License.

Format: ngram TAB year TAB match_count TAB page_count TAB volume_count NEWLINE

Example:

enter image description here

Corpus used (from Lin, Yuri, et al. "Syntactic annotations for the google books ngram corpus." Proceedings of the ACL 2012 system demonstrations. Association for Computational Linguistics, 2012.):

enter image description here

Downturn answered 7/1, 2016 at 18:30 Comment(0)
S
0

Word length is a good indicator , for word frequency , you would need data as an algorithm can obviously not determine it by itself. You could also use some sort of scoring like the scrabble game does : each letter has a value and the final value would be the sum of the values. It would be imo easier to find frequency data about each letter in your language .

Stochmal answered 28/2, 2011 at 11:10 Comment(1)
Yes, I could try to find the frequency of each letter within my word database, and then assign a score to each word by adding up the frequencies of each letter in the word. Thanks, I'll give it a try, might be worthwhile.Relativity
K
0

In his article on spell correction Peter Norvig uses a dictionary to count the number of occurrences of each word (and thus determine their frequency).

You could use this as a stepping stone :)

Also, frequency should probably influence the difficulty more than length... you would have to beta-test the game for that.

Kelle answered 28/2, 2011 at 13:11 Comment(1)
The method described gets you a list of word frequencies indeed. It is good, but using the readily available list mentioned in Aaron Levitt's answer is much easier, and probably more reliable :-)Cenacle
I
0

In addition to metrics such as Flesch-Kincaid, you could try an approach based on the Dale-Chall readability formula, using lists of words that are familiar to readers of a particular level of ability.

Implementations of many of the readability formulae contain code for estimating the number of syllables in a word, which may also be useful.

Implicate answered 16/2, 2012 at 16:32 Comment(3)
The question is about single words. Flesch-Kincaid is made for texts, not for single words: 206.835 - 1.015*(words/sentences) - 84.6*(syllables/words). For a single word the formula becomes 205.82 - 84.6*syllables which is as useful as just counting syllables.Cenacle
Dale-Chall requires you to already know whether a word is difficult or not.Cenacle
The number of syllables is one of the parameters that make a word difficult indeed. This is StackOverflow so an algorithm would be welcome :-) Or could you link to any source file that contains the algorithm? Thanks!Cenacle
R
0

I would guess that the grade at wich the word is introduced into normal students vocabulary is a measure of difficulty. Next would be how many standard rule violations it has. Meaning your words that have spellings or pronunciations that seem to violate the normal set off rules. Finally.. the meaning.. can be a tough concept. .. for example ... try explaining abstract to someone who's never heard the word.

Retroactive answered 24/6, 2015 at 1:41 Comment(1)
Lol there's probably already a compiled ratings list for this... just need to find it.Retroactive
B
0

Without claiming to know anything about their algorithm, there is an API that returns a 1-10 scale word difficulty: TwinWord API

I have never used it, myself, though.

Blunger answered 27/10, 2019 at 0:4 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.