How to add two-word patterns to be ignored by LanguageTool?

Asked 23/9, 2014 at 19:38 Answered 13/7, 2023 at 15:34

Situation:

As a workaround for the not yet implemented feature to add a user dictionary of words to Languagetool, I came up with this code snippet:

JLanguageTool langTool = new JLanguageTool(lang);
langTool.activateDefaultPatternRules();
List<Rule> rules = langTool.getAllActiveRules();
for (Rule rule:rules) {
    // System.out.println(rule.getId());
    if (rule.getId().equals("GERMAN_SPELLER_RULE")) {
        if (rule instanceof SpellingCheckRule) {
            SpellingCheckRule srule=(SpellingCheckRule) rule;
            String [] words={"word1", "word2"};
            List<String> tokens=new ArrayList<String>();
            for (String word:words) {
                tokens.add(word);
            }
            srule.addIgnoreTokens(tokens);
        }
    }
}

which will nicely add the list of words specified by

String [] words={"word1", "word2"};

to the list of ignored words. But how about word combinations/two word patterns like "Guest bathroom", "French word" "test application" - how could I get these ignored without modifying the orginal grammar file? I assume creating some user defined rule could do the trick and might also be a more elegant solution for the above code snippet.

Question:

What would be a working approach to get a user-dictionary work-around going that ignores single and two-word phrases?

Sensitometer answered 23/9, 2014 at 19:38 Comment(0)

An ignore.txt file is supported since version 2.9. see the CHANGES.txt at bullet -Spelling.

Two word phrases are not supported. see the check in method loadWordsToBeIgnored in SpellingCheckRule.java. (if you would do so the check will fail with a "RuntimeException: No space expected in ...")

Echelon answered 24/4, 2015 at 8:35 Comment(2)

Could you please provide more details on how to integrate this which would be very helpful. Thanks – Whiteheaded 1/8, 2017 at 6:25

Yes, I'd like to know how to use ignore.txt. Where do I put it? Can it be used with the command-line version? – Hanson 2/8, 2017 at 2:1

One approach might be to extend the existing SpellingCheckRule and override the method which generates potential spelling corrections. In the new method, we add logic that considers two words together as a single token. This might look something like the following:

class MultiWordSpellingCheckRule extends SpellingCheckRule {
    private Set<String> ignoredTokens;

    MultiWordSpellingCheckRule (ResourceBundle messages, Language language, UserConfig userConfig, List<Language> altLanguages, IgnoreWordsSupplier ignoreWordsSupplier) {
        super(messages, language, userConfig, altLanguages, ignoreWordsSupplier);
        ignoredTokens = new HashSet<>();
    }

    @Override
    public RuleMatch[] match(AnalyzedSentence sentence) throws IOException {
        String[] tokens = sentence.getText().split("\\s+");
        for (int i = 0; i < tokens.length - 1; i++) {
            String twoWordToken = tokens[i] + " " + tokens[i+1];
            if (ignoredTokens.contains(twoWordToken)) {
                // Skip checking this two-word token.
                i++;
                continue;
            }
        }
        return super.match(sentence);
    }

    public void addIgnoreTokens(List<String> tokens) {
        ignoredTokens.addAll(tokens);
    }
}

You can now add multi-word tokens to the list like so:

String [] words = {"word1", "word2", "Guest bathroom", "French word", "test application"};

Also, please note, you would need to deal with punctuation and other aspects of tokenization in a real-world scenario. The given code assumes only space-separated words.

Sensitometer answered 13/7, 2023 at 15:34 Comment(0)

Recommended topics

Hot tags