Detecting syllables in a word

Asked 1/1, 2009 at 17:8 Answered 6/6, 2021 at 16:50

Solved nlp spell-checking hyphenation

156

I need to find a fairly efficient way to detect syllables in a word. E.g.,

Invisible -> in-vi-sib-le

There are some syllabification rules that could be used:

V CV VC CVC CCV CCCV CVCC

*where V is a vowel and C is a consonant. E.g.,

Pronunciation (5 Pro-nun-ci-a-tion; CV-CVC-CV-V-CVC)

I've tried few methods, among which were using regex (which helps only if you want to count syllables) or hard coded rule definition (a brute force approach which proves to be very inefficient) and finally using a finite state automata (which did not result with anything useful).

The purpose of my application is to create a dictionary of all syllables in a given language. This dictionary will later be used for spell checking applications (using Bayesian classifiers) and text to speech synthesis.

I would appreciate if one could give me tips on an alternate way to solve this problem besides my previous approaches.

I work in Java, but any tip in C/C++, C#, Python, Perl... would work for me.

Glarum answered 1/1, 2009 at 17:8 Comment(2)

Do you actually want the actual division points or just the number of syllables in a word? If the latter, consider looking up the words in a text-to-speech dictionary and count the phonemes that encode vowel sounds. – Casabonne 24/8, 2012 at 22:8

The most efficient way (computation-wise; not storage-wise), I would guess would be just to have a Python dictionary with words as keys and the number of syllables as values. However, you'd still need a fallback for words that didn't make it in the dictionary. Let me know if you ever find such a dictionary! – Squadron 29/7, 2014 at 5:33

137

Read about the TeX approach to this problem for the purposes of hyphenation. Especially see Frank Liang's thesis dissertation Word Hy-phen-a-tion by Com-put-er. His algorithm is very accurate, and then includes a small exceptions dictionary for cases where the algorithm does not work.

Dray answered 1/1, 2009 at 17:17 Comment(9)

I like that youve cited a thesis dissertation on the subject, it's a little hint to the original poster that this might not be an easy question. – Trahern 1/1, 2009 at 17:29

Yes, I am aware that this is not a simple question, although I haven't worked much on it. I did underestimate the problem though, I thought I would work on other parts of my app, and later return to this 'simple' problem. Silly me :) – Glarum 1/1, 2009 at 17:33

I read the disertation paper, and found it very helpful. The problem with the approach was that I did not have any patterns for the Albanian language, although I found some tools that could generate those patterns. Anyway, for my purpose I wrote a rule based app, which solved the problem... – Glarum 3/1, 2009 at 1:20

... My approach is a bit slow (~20 sec on a 50K word file) but I think the results are reasonably accurate (i dont have any useful stats yet). – Glarum 3/1, 2009 at 1:24

I wrote up a quick post doing some tests of this approach including stats: allenporter.tumblr.com/post/9776954743/syllables -- While the hyphenation approach was promising, an ad-hoc approach of counting vowels seemed more accurate since the hyphenation algorithm errors on under-hyphenating. Definitely not a solved problem, as far as I can tell. – Tannertannery 4/9, 2011 at 4:35

@Tannertannery I read your webpage. According to your statistics, hyphenation approach is not accurate. I also read 2 articles eprints.soton.ac.uk/264285/1/MarchandAdsettDamper_ISCA07.pdf and web.cs.dal.ca/~adsett/publications/AdsMar_CompSyllMeth_2009.pdf . Do you know about SbA method in their articles? They claim hyphenation is as high as about 95% correct. What is that big dict (1 m size) you used for evaluation, Can you please let know where and how can I have it for such test? – Zinkenite 14/4, 2012 at 16:41

Note that the TeX algorithm is for finding legitimate hyphenation points, which is not exactly the same as syllable divisions. It's true that hyphenation points fall on syllable divisions, but not all syllable divisions are valid hyphenation points. For example, hyphens aren't (usually) used within a letter or two of either end of a word. I also believe the TeX patterns were tuned to trade off false negatives for false positives (never put a hyphen where it doesn't belong, even if that means missing some legitimate hyphenation opportunities). – Casabonne 24/8, 2012 at 22:5

I don't believe hyphenation is the answer either. – Wanting 8/4, 2014 at 20:28

But Liang's hyphenation algorithm isn't equivalent to breaking into syllables. E.g. applying it to "hyphenation" returns "'hy-phen-ation", but breaking into syllables it should be "hy-phen-a-tion" (4 syllables, not 3). "Project" isn't hyphenated at all, but broken into syllables it would be "pro-ject" (2 syllables, not 1). Many such cases – Unhandy 27/3, 2020 at 20:6

I stumbled across this page looking for the same thing, and found a few implementations of the Liang paper here: https://github.com/mnater/hyphenator or the successor: https://github.com/mnater/Hyphenopoly

That is unless you're the type that enjoys reading a 60 page thesis instead of adapting freely available code for non-unique problem. :)

Incorporator answered 2/1, 2009 at 7:19 Comment(1)

agreed - much more convenient to just use an existing implmentation – Syce 5/11, 2010 at 2:48

Here is a solution using NLTK:

from nltk.corpus import cmudict
d = cmudict.dict()
def nsyl(word):
  return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]]

Syce answered 5/11, 2010 at 2:52 Comment(4)

Hey thanks tiny baby error in the should be function def nsyl(word): return [len(list(y for y in x if y[-1].isdigit())) for x in d[word.lower()]] – Crosscrosslet 21/12, 2010 at 1:8

What would you suggest as a fallback for words that aren't in that corpus? – Allusive 18/6, 2011 at 0:18

@Pureferret cmudict is a pronouncing dictionary for north american english words. it splits words into phonemes, which are shorter than syllables (e.g. the word 'cat' is split into three phonemes: K - AE - T). but vowels also have a "stress marker": either 0, 1, or 2, depending on the pronunciation of the word (so AE in 'cat' becomes AE1). the code in the answer counts the stress markers and therefore the number of the vowels - which effectively gives the number of syllables (notice how in OP's examples each syllable has exactly one vowel). – Grenade 9/3, 2016 at 23:11

This returns the number of syllables, not the syllabification. – Marcus 14/5, 2017 at 15:34

I'm trying to tackle this problem for a program that will calculate the flesch-kincaid and flesch reading score of a block of text. My algorithm uses what I found on this website: http://www.howmanysyllables.com/howtocountsyllables.html and it gets reasonably close. It still has trouble on complicated words like invisible and hyphenation, but I've found it gets in the ballpark for my purposes.

It has the upside of being easy to implement. I found the "es" can be either syllabic or not. It's a gamble, but I decided to remove the es in my algorithm.

private int CountSyllables(string word)
    {
        char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
        string currentWord = word;
        int numVowels = 0;
        bool lastWasVowel = false;
        foreach (char wc in currentWord)
        {
            bool foundVowel = false;
            foreach (char v in vowels)
            {
                //don't count diphthongs
                if (v == wc && lastWasVowel)
                {
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
                else if (v == wc && !lastWasVowel)
                {
                    numVowels++;
                    foundVowel = true;
                    lastWasVowel = true;
                    break;
                }
            }

            //if full cycle and no vowel found, set lastWasVowel to false;
            if (!foundVowel)
                lastWasVowel = false;
        }
        //remove es, it's _usually? silent
        if (currentWord.Length > 2 && 
            currentWord.Substring(currentWord.Length - 2) == "es")
            numVowels--;
        // remove silent e
        else if (currentWord.Length > 1 &&
            currentWord.Substring(currentWord.Length - 1) == "e")
            numVowels--;

        return numVowels;
    }

Deon answered 11/4, 2011 at 0:34 Comment(2)

For my simple scenario of finding syllables in proper names this seems to be initially working well enough. Thanks for putting it out here. – Pulsifer 8/3, 2018 at 13:16

Its a decent try but even after some simple testing it does not seem very accurate. e.g. "anyone" returns 1 syllable instead of 3, "Minute" returns 3 instead of 2, and "Another" returns 2 instead of 3. – Lunsford 6/9, 2021 at 16:53

Why calculate it? Every online dictionary has this info. http://dictionary.reference.com/browse/invisible in·vis·i·ble

Vevina answered 20/2, 2010 at 2:44 Comment(3)

Maybe it has to work for words that don't appear in dictionaries, such as names? – Magically 13/9, 2010 at 19:13

@WouterLievens: I don't think names are anywhere near well-behaved enough for automatic syllable parsing. A syllable parser for English names would fail miserably on names of Welsh or Scottish origin, let alone names of Indian and Nigerian origins, yet you might find all of these in a single room somewhere in e.g. London. – Cochleate 14/1, 2012 at 14:6

One must keep in mind that it is not reasonable to expect better performance than a human could provide considering this is a purely heuristic approach to a sketchy domain. – Downright 4/9, 2015 at 20:40

This is a particularly difficult problem which is not completely solved by the LaTeX hyphenation algorithm. A good summary of some available methods and the challenges involved can be found in the paper Evaluating Automatic Syllabification Algorithms for English (Marchand, Adsett, and Damper 2007).

Faina answered 7/2, 2011 at 15:40 Comment(0)

I ran into this exact same issue a little while ago.

I ended up using the CMU Pronunciation Dictionary for quick and accurate lookups of most words. For words not in the dictionary, I fell back to a machine learning model that's ~98% accurate at predicting syllable counts.

I wrapped the whole thing up in an easy-to-use python module here: https://github.com/repp/big-phoney

Install: pip install big-phoney

Count Syllables:

from big_phoney import BigPhoney
phoney = BigPhoney()
phoney.count_syllables('triceratops')  # --> 4

If you're not using Python and you want to try the ML-model-based approach, I did a pretty detailed write up on how the syllable counting model works on Kaggle.

Caracal answered 2/7, 2018 at 19:56 Comment(1)

The OP is looking for syllabification with letters, not phonemes. – Schappe 31/8, 2023 at 4:27

Bumping @Tihamer and @joe-basirico. Very useful function, not perfect, but good for most small-to-medium projects. Joe, I have re-written an implementation of your code in Python:

def countSyllables(word):
    vowels = "aeiouy"
    numVowels = 0
    lastWasVowel = False
    for wc in word:
        foundVowel = False
        for v in vowels:
            if v == wc:
                if not lastWasVowel: numVowels+=1   #don't count diphthongs
                foundVowel = lastWasVowel = True
                        break
        if not foundVowel:  #If full cycle and no vowel found, set lastWasVowel to false
            lastWasVowel = False
    if len(word) > 2 and word[-2:] == "es": #Remove es - it's "usually" silent (?)
        numVowels-=1
    elif len(word) > 1 and word[-1:] == "e":    #remove silent e
        numVowels-=1
    return numVowels

Hope someone finds this useful!

Underclay answered 14/10, 2015 at 6:18 Comment(0)

Today I found this Java implementation of Frank Liang's hyphenation algorithmn with pattern for English or German, which works quite well and is available on Maven Central.

Cave: It is important to remove the last lines of the .tex pattern files, because otherwise those files can not be loaded with the current version on Maven Central.

To load and use the hyphenator, you can use the following Java code snippet. texTable is the name of the .tex files containing the needed patterns. Those files are available on the project github site.

 private Hyphenator createHyphenator(String texTable) {
        Hyphenator hyphenator = new Hyphenator();
        hyphenator.setErrorHandler(new ErrorHandler() {
            public void debug(String guard, String s) {
                logger.debug("{},{}", guard, s);
            }

            public void info(String s) {
                logger.info(s);
            }

            public void warning(String s) {
                logger.warn("WARNING: " + s);
            }

            public void error(String s) {
                logger.error("ERROR: " + s);
            }

            public void exception(String s, Exception e) {
                logger.error("EXCEPTION: " + s, e);
            }

            public boolean isDebugged(String guard) {
                return false;
            }
        });

        BufferedReader table = null;

        try {
            table = new BufferedReader(new InputStreamReader(Thread.currentThread().getContextClassLoader()
                    .getResourceAsStream((texTable)), Charset.forName("UTF-8")));
            hyphenator.loadTable(table);
        } catch (Utf8TexParser.TexParserException e) {
            logger.error("error loading hyphenation table: {}", e.getLocalizedMessage(), e);
            throw new RuntimeException("Failed to load hyphenation table", e);
        } finally {
            if (table != null) {
                try {
                    table.close();
                } catch (IOException e) {
                    logger.error("Closing hyphenation table failed", e);
                }
            }
        }

        return hyphenator;
    }

Afterwards the Hyphenator is ready to use. To detect syllables, the basic idea is to split the term at the provided hyphens.

    String hyphenedTerm = hyphenator.hyphenate(term);

    String hyphens[] = hyphenedTerm.split("\u00AD");

    int syllables = hyphens.length;

You need to split on "\u00AD", since the API does not return a normal "-".

This approach outperforms the answer of Joe Basirico, since it supports many different languages and detects German hyphenation more accurate.

Unbonnet answered 17/2, 2016 at 14:40 Comment(0)

Thanks Joe Basirico, for sharing your quick and dirty implementation in C#. I've used the big libraries, and they work, but they're usually a bit slow, and for quick projects, your method works fine.

Here is your code in Java, along with test cases:

public static int countSyllables(String word)
{
    char[] vowels = { 'a', 'e', 'i', 'o', 'u', 'y' };
    char[] currentWord = word.toCharArray();
    int numVowels = 0;
    boolean lastWasVowel = false;
    for (char wc : currentWord) {
        boolean foundVowel = false;
        for (char v : vowels)
        {
            //don't count diphthongs
            if ((v == wc) && lastWasVowel)
            {
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
            else if (v == wc && !lastWasVowel)
            {
                numVowels++;
                foundVowel = true;
                lastWasVowel = true;
                break;
            }
        }
        // If full cycle and no vowel found, set lastWasVowel to false;
        if (!foundVowel)
            lastWasVowel = false;
    }
    // Remove es, it's _usually? silent
    if (word.length() > 2 && 
            word.substring(word.length() - 2) == "es")
        numVowels--;
    // remove silent e
    else if (word.length() > 1 &&
            word.substring(word.length() - 1) == "e")
        numVowels--;
    return numVowels;
}

public static void main(String[] args) {
    String txt = "what";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "super";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Maryland";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "American";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "disenfranchized";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
    txt = "Sophia";
    System.out.println("txt="+txt+" countSyllables="+countSyllables(txt));
}

The result was as expected (it works good enough for Flesch-Kincaid):

txt=what countSyllables=1
txt=super countSyllables=2
txt=Maryland countSyllables=3
txt=American countSyllables=3
txt=disenfranchized countSyllables=5
txt=Sophia countSyllables=2

Glycerol answered 20/9, 2014 at 15:22 Comment(0)

Perl has Lingua::Phonology::Syllable module. You might try that, or try looking into its algorithm. I saw a few other older modules there, too.

I don't understand why a regular expression gives you only a count of syllables. You should be able to get the syllables themselves using capture parentheses. Assuming you can construct a regular expression that works, that is.

Floyd answered 1/1, 2009 at 17:34 Comment(0)

Thank you @joe-basirico and @tihamer. I have ported @tihamer's code to Lua 5.1, 5.2 and luajit 2 (most likely will run on other versions of lua as well):

countsyllables.lua

function CountSyllables(word)
  local vowels = { 'a','e','i','o','u','y' }
  local numVowels = 0
  local lastWasVowel = false

  for i = 1, #word do
    local wc = string.sub(word,i,i)
    local foundVowel = false;
    for _,v in pairs(vowels) do
      if (v == string.lower(wc) and lastWasVowel) then
        foundVowel = true
        lastWasVowel = true
      elseif (v == string.lower(wc) and not lastWasVowel) then
        numVowels = numVowels + 1
        foundVowel = true
        lastWasVowel = true
      end
    end

    if not foundVowel then
      lastWasVowel = false
    end
  end

  if string.len(word) > 2 and
    string.sub(word,string.len(word) - 1) == "es" then
    numVowels = numVowels - 1
  elseif string.len(word) > 1 and
    string.sub(word,string.len(word)) == "e" then
    numVowels = numVowels - 1
  end

  return numVowels
end

And some fun tests to confirm it works (as much as it's supposed to):

countsyllables.tests.lua

require "countsyllables"

tests = {
  { word = "what", syll = 1 },
  { word = "super", syll = 2 },
  { word = "Maryland", syll = 3},
  { word = "American", syll = 4},
  { word = "disenfranchized", syll = 5},
  { word = "Sophia", syll = 2},
  { word = "End", syll = 1},
  { word = "I", syll = 1},
  { word = "release", syll = 2},
  { word = "same", syll = 1},
}

for _,test in pairs(tests) do
  local resultSyll = CountSyllables(test.word)
  assert(resultSyll == test.syll,
    "Word: "..test.word.."\n"..
    "Expected: "..test.syll.."\n"..
    "Result: "..resultSyll)
end

print("Tests passed.")

Mobile answered 9/9, 2015 at 21:46 Comment(2)

I added two more test cases "End" and "I". The fix was to compare strings case insensitively. Ping'ing @joe-basirico and tihamer in case they suffer from the same problem and would like to update their functions. – Mobile 9/9, 2015 at 22:15

@tihamer American is 4 syllables! – Mobile 9/9, 2015 at 22:17

I could not find an adequate way to count syllables, so I designed a method myself.

You can view my method here: https://mcmap.net/q/152838/-java-writing-a-syllable-counter-based-on-specifications

I use a combination of a dictionary and algorithm method to count syllables.

You can view my library here: https://github.com/troywatson/Lawrence-Style-Checker

I just tested my algorithm and had a 99.4% strike rate!

Lawrence lawrence = new Lawrence();

System.out.println(lawrence.getSyllable("hyphenation"));
System.out.println(lawrence.getSyllable("computer"));

Output:

4
3

Jeremiahjeremias answered 25/9, 2015 at 15:44 Comment(3)

Generally, links to a tool or library should be accompanied by usage notes, a specific explanation of how the linked resource is applicable to the problem, or some sample code, or if possible all of the above. – Foreshore 25/9, 2015 at 16:7

See Syntax Highlighting. There is a help button (question mark) in the SO editor which will get you to the linked page. – Foreshore 25/9, 2015 at 18:6

The link is dead and the library does not seem to be available anymore. – Lunsford 6/9, 2021 at 16:49

After doing a lot of testing and trying out hyphenation packages as well, I wrote my own based on a number of examples. I also tried the pyhyphen and pyphen packages that interfaces with hyphenation dictionaries, but they produce the wrong number of syllables in many cases. The nltk package was simply too slow for this use case.

My implementation in Python is part of a class i wrote, and the syllable counting routine is pasted below. It over-estimates the number of syllables a bit as I still haven't found a good way to account for silent word endings.

The function returns the ratio of syllables per word as it is used for a Flesch-Kincaid readability score. The number doesn't have to be exact, just close enough for an estimate.

On my 7th generation i7 CPU, this function took 1.1-1.2 milliseconds for a 759 word sample text.

def _countSyllablesEN(self, theText):

    cleanText = ""
    for ch in theText:
        if ch in "abcdefghijklmnopqrstuvwxyz'’":
            cleanText += ch
        else:
            cleanText += " "

    asVow    = "aeiouy'’"
    dExep    = ("ei","ie","ua","ia","eo")
    theWords = cleanText.lower().split()
    allSylls = 0
    for inWord in theWords:
        nChar  = len(inWord)
        nSyll  = 0
        wasVow = False
        wasY   = False
        if nChar == 0:
            continue
        if inWord[0] in asVow:
            nSyll += 1
            wasVow = True
            wasY   = inWord[0] == "y"
        for c in range(1,nChar):
            isVow  = False
            if inWord[c] in asVow:
                nSyll += 1
                isVow = True
            if isVow and wasVow:
                nSyll -= 1
            if isVow and wasY:
                nSyll -= 1
            if inWord[c:c+2] in dExep:
                nSyll += 1
            wasVow = isVow
            wasY   = inWord[c] == "y"
        if inWord.endswith(("e")):
            nSyll -= 1
        if inWord.endswith(("le","ea","io")):
            nSyll += 1
        if nSyll < 1:
            nSyll = 1
        # print("%-15s: %d" % (inWord,nSyll))
        allSylls += nSyll

    return allSylls/len(theWords)

Posterior answered 23/9, 2018 at 13:26 Comment(0)

You can try Spacy Syllables. This works on Python 3.9:

Setup:

pip install spacy
pip install spacy_syllables
python -m spacy download en_core_web_md

Code:

import spacy
from spacy_syllables import SpacySyllables
nlp = spacy.load('en_core_web_md')
syllables = SpacySyllables(nlp)
nlp.add_pipe('syllables', after='tagger')


def spacy_syllablize(word):
    token = nlp(word)[0]
    return token._.syllables


for test_word in ["trampoline", "margaret", "invisible", "thought", "Pronunciation", "couldn't"]:
    print(f"{test_word} -> {spacy_syllablize(test_word)}")

Output:

trampoline -> ['tram', 'po', 'line']
margaret -> ['mar', 'garet']
invisible -> ['in', 'vis', 'i', 'ble']
thought -> ['thought']
Pronunciation -> ['pro', 'nun', 'ci', 'a', 'tion']
couldn't -> ['could']

Somniloquy answered 6/6, 2021 at 16:50 Comment(1)

SpacySyllables is pretty decent, just be aware that it's unfortunately not perfect. "eighty" returns ['eighty'] and "universal" returns ['uni', 'ver', 'sal']. This is due to the underlying library (Pyphen) having a default of 2 characters for the first and last syllables. – Iodoform 16/7, 2021 at 20:56

I am including a solution that works "okay" in R. Far from perfect.

countSyllablesInWord = function(words)
  {
  #word = "super";
  n.words = length(words);
  result = list();
  for(j in 1:n.words)
    {
    word = words[j];
    vowels = c("a","e","i","o","u","y");
    
    word.vec = strsplit(word,"")[[1]];
    word.vec;
    
    n.char = length(word.vec);
    
    is.vowel = is.element(tolower(word.vec), vowels);
    n.vowels = sum(is.vowel);
    
    
    # nontrivial problem 
    if(n.vowels <= 1)
      {
      syllables = 1;
      str = word;
      } else {
              # syllables = 0;
              previous = "C";
              # on average ? 
              str = "";
              n.hyphen = 0;
        
              for(i in 1:n.char)
                {
                my.char = word.vec[i];
                my.vowel = is.vowel[i];
                if(my.vowel)
                  {
                  if(previous == "C")
                    {
                    if(i == 1)
                      {
                      str = paste0(my.char, "-");
                      n.hyphen = 1 + n.hyphen;
                      } else {
                              if(i < n.char)
                                {
                                if(n.vowels > (n.hyphen + 1))
                                  {
                                  str = paste0(str, my.char, "-");
                                  n.hyphen = 1 + n.hyphen;
                                  } else {
                                           str = paste0(str, my.char);
                                          }
                                } else {
                                        str = paste0(str, my.char);
                                        }
                              }
                     # syllables = 1 + syllables;
                     previous = "V";
                    } else {  # "VV"
                          # assume what  ?  vowel team?
                          str = paste0(str, my.char);
                          }
            
                } else {
                            str = paste0(str, my.char);
                            previous = "C";
                            }
                #
                }
        
              syllables = 1 + n.hyphen;
              }
  
      result[[j]] = list("syllables" = syllables, "vowels" = n.vowels, "word" = str);
      }
  
  if(n.words == 1) { result[[1]]; } else { result; }
  }

Here are some results:

my.count = countSyllablesInWord(c("America", "beautiful", "spacious", "skies", "amber", "waves", "grain", "purple", "mountains", "majesty"));

my.count.df = data.frame(matrix(unlist(my.count), ncol=3, byrow=TRUE));
colnames(my.count.df) = names(my.count[[1]]);

my.count.df;

#    syllables vowels         word
# 1          4      4   A-me-ri-ca
# 2          4      5 be-auti-fu-l
# 3          3      4   spa-ci-ous
# 4          2      2       ski-es
# 5          2      2       a-mber
# 6          2      2       wa-ves
# 7          2      2       gra-in
# 8          2      2      pu-rple
# 9          3      4  mo-unta-ins
# 10         3      3    ma-je-sty

I didn't realize how big of a "rabbit hole" this is, seems so easy.


################ hackathon #######


# https://en.wikipedia.org/wiki/Gunning_fog_index
# THIS is a CLASSIFIER PROBLEM ...
# https://mcmap.net/q/151150/-detecting-syllables-in-a-word



# http://www.speech.cs.cmu.edu/cgi-bin/cmudict
# http://www.syllablecount.com/syllables/


  # https://enchantedlearning.com/consonantblends/index.shtml
  # start.digraphs = c("bl", "br", "ch", "cl", "cr", "dr", 
  #                   "fl", "fr", "gl", "gr", "pl", "pr",
  #                   "sc", "sh", "sk", "sl", "sm", "sn",
  #                   "sp", "st", "sw", "th", "tr", "tw",
  #                   "wh", "wr");
  # start.trigraphs = c("sch", "scr", "shr", "sph", "spl",
  #                     "spr", "squ", "str", "thr");
  # 
  # 
  # 
  # end.digraphs = c("ch","sh","th","ng","dge","tch");
  # 
  # ile
  # 
  # farmer
  # ar er
  # 
  # vowel teams ... beaver1
  # 
  # 
  # # "able"
  # # http://www.abcfastphonics.com/letter-blends/blend-cial.html
  # blends = c("augh", "ough", "tien", "ture", "tion", "cial", "cian", 
  #             "ck", "ct", "dge", "dis", "ed", "ex", "ful", 
  #             "gh", "ng", "ous", "kn", "ment", "mis", );
  # 
  # glue = c("ld", "st", "nd", "ld", "ng", "nk", 
  #           "lk", "lm", "lp", "lt", "ly", "mp", "nce", "nch", 
  #           "nse", "nt", "ph", "psy", "pt", "re", )
  # 
  # 
  # start.graphs = c("bl, br, ch, ck, cl, cr, dr, fl, fr, gh, gl, gr, ng, ph, pl, pr, qu, sc, sh, sk, sl, sm, sn, sp, st, sw, th, tr, tw, wh, wr");
  # 
  # # https://mantra4changeblog.wordpress.com/2017/05/01/consonant-digraphs/
  # digraphs.start = c("ch","sh","th","wh","ph","qu");
  # digraphs.end = c("ch","sh","th","ng","dge","tch");
  # # https://www.education.com/worksheet/article/beginning-consonant-blends/
  # blends.start = c("pl", "gr", "gl", "pr",
  #                 
  # blends.end = c("lk","nk","nt",
  # 
  # 
  # # https://sarahsnippets.com/wp-content/uploads/2019/07/ScreenShot2019-07-08at8.24.51PM-817x1024.png
  # # Monte     Mon-te
  # # Sophia    So-phi-a
  # # American  A-mer-i-can
  # 
  # n.vowels = 0;
  # for(i in 1:n.char)
  #   {
  #   my.char = word.vec[i];
  # 
  # 
  # 
  # 
  # 
  # n.syll = 0;
  # str = "";
  # 
  # previous = "C"; # consonant vs "V" vowel
  # 
  # for(i in 1:n.char)
  #   {
  #   my.char = word.vec[i];
  #   
  #   my.vowel = is.element(tolower(my.char), vowels);
  #   if(my.vowel)
  #     {
  #     n.vowels = 1 + n.vowels;
  #     if(previous == "C")
  #       {
  #       if(i == 1)
  #         {
  #         str = paste0(my.char, "-");
  #         } else {
  #                 if(n.syll > 1)
  #                   {
  #                   str = paste0(str, "-", my.char);
  #                   } else {
  #                          str = paste0(str, my.char);
  #                         }
  #                 }
  #        n.syll = 1 + n.syll;
  #        previous = "V";
  #       } 
  #     
  #   } else {
  #               str = paste0(str, my.char);
  #               previous = "C";
  #               }
  #   #
  #   }
  # 
  # 
  # 
  # 
## https://jzimba.blogspot.com/2017/07/an-algorithm-for-counting-syllables.html
# AIDE   1
# IDEA   3
# IDEAS  2
# IDEE   2
# IDE   1
# AIDA   2
# PROUSTIAN 3
# CHRISTIAN 3
# CLICHE  1
# HALIDE  2
# TELEPHONE 3
# TELEPHONY 4
# DUE   1
# IDEAL  2
# DEE   1
# UREA  3
# VACUO  3
# SEANCE  1
# SAILED  1
# RIBBED  1
# MOPED  1
# BLESSED  1
# AGED  1
# TOTED  2
# WARRED  1
# UNDERFED 2
# JADED  2
# INBRED  2
# BRED  1
# RED   1
# STATES  1
# TASTES  1
# TESTES  1
# UTILIZES  4

And for good measure, a simple kincaid readability function ... syllables is a list of counts returned from the first function ...

Since my function is a bit biased towards more syllables, that will give an inflated readability score ... which for now is fine ... if the goal is to make text more readable, this is not the worst thing.

computeReadability = function(n.sentences, n.words, syllables=NULL)
  {
  n = length(syllables);
  n.syllables = 0;
  for(i in 1:n)
    {
    my.syllable = syllables[[i]];
    n.syllables = my.syllable$syllables + n.syllables;
    }
  # Flesch Reading Ease (FRE):
  FRE = 206.835 - 1.015 * (n.words/n.sentences) - 84.6 * (n.syllables/n.words);
  # Flesh-Kincaid Grade Level (FKGL):
  FKGL = 0.39 * (n.words/n.sentences) + 11.8 * (n.syllables/n.words) - 15.59; 
  # FKGL = -0.384236 * FRE - 20.7164 * (n.syllables/n.words) + 63.88355;
  # FKGL = -0.13948  * FRE + 0.24843 * (n.words/n.sentences) + 13.25934;
  
  list("FRE" = FRE, "FKGL" = FKGL); 
  }

Jinny answered 19/11, 2020 at 1:48 Comment(0)

-2

I used jsoup to do this once. Here's a sample syllable parser:

public String[] syllables(String text){
        String url = "https://www.merriam-webster.com/dictionary/" + text;
        String relHref;
        try{
            Document doc = Jsoup.connect(url).get();
            Element link = doc.getElementsByClass("word-syllables").first();
            if(link == null){return new String[]{text};}
            relHref = link.html(); 
        }catch(IOException e){
            relHref = text;
        }
        String[] syl = relHref.split("·");
        return syl;
    }

Footing answered 9/1, 2018 at 16:9 Comment(1)

How is that a generic syllable parser? It looks like this code is only looking up syllables in a dictionary – Marabou 9/1, 2018 at 16:30

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags