Intelligent spell checking
Asked Answered
B

3

6

I'm using NHunspell to check a string for spelling errors like so:

var words = content.Split(' ');
string[] incorrect;
using (var spellChecker = new Hunspell(affixFile, dictionaryFile))
{
    incorrect = words.Where(x => !spellChecker.Spell(x))
        .ToArray();
}

This generally works, but it has some problems. For example, if I'm checking the sentence "This is a (very good) example", it will report "(very" and "good)" as being misspelled. Or if the string contains a time such as "8:30", it will report that as a misspelled word. It also has problems with commas, etc.

Microsoft Word is smart enough to recognize a time, fraction, or comma-delimited list of words. It knows when not to use an English dictionary, and it knows when to ignore symbols. How can I get a similar, more intelligent spell check in my software? Are there any libraries that provide a little more intelligence?

EDIT: I don't want to force users to have Microsoft Word installed on their machine, so using COM interop is not an option.

Basilio answered 9/3, 2012 at 17:33 Comment(0)
B
6

If your spell checker is really that stupid, you should pre-tokenize its input to get the words out and feed those one at a time (or as a string joined with spaces). I'm not familiar with C#/.NET, but in Python, you'd use a simple RE like \w+ for that:

>>> s = "This is a (very good) example"
>>> re.findall(r"\w+", s)
['This', 'is', 'a', 'very', 'good', 'example']

and I bet .NET has something very similar. In fact, according to the .NET docs, \w is supported, so you just have to find out how re.findall is called there.

Burtburta answered 9/3, 2012 at 18:0 Comment(0)
E
0
using System.Text.RegularExpressions;
...
// any occurence of ( and ) (maybe needs escaping)
string pattern = "( (\\.? | )\\.? )"; 
foreach(string i in incorrect){
  Regex.Replace(i, pattern, String.Empty) // replace with String.Empty
}

More information about regex here. After I have been reading this I think Hunspell is one of the best choices :)

Eddaeddana answered 9/3, 2012 at 18:15 Comment(0)
R
0

in C#, you could do something like this.

public static class ExtensionHelper
{
    public static string[] GetWords(this string input)
    {
        MatchCollection matches = Regex.Matches(input, @"\b[\w']*\b");

        var words = from m in matches.Cast<Match>()
                    where !string.IsNullOrEmpty(m.Value)
                    select TrimSuffix(m.Value);

        return words.ToArray();
    }

    public static string TrimSuffix(this string word)
    {
        int apostropheLocation = word.IndexOf('\'');
        if (apostropheLocation != -1)
        {
            word = word.Substring(0, apostropheLocation);
        }

        return word;
    }
}

var NumberOfMistakes = content.GetWords().Where(x => !hunspell.Spell(x)).Count();

Radmen answered 19/4, 2016 at 10:42 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.