Use stop words in multiple languages with Lucene
Asked Answered
G

0

1

I remove stop words from a String and return the remaining words with the original upper/lower case afterwards, using Apache's Lucene (8.6.3) and the following Java 8 code (this is a shortened version):

Path resources = Paths.get(stopWordFolder);
String stopWordsFile = "";

if(Files.exists(resources)) {
    //"stopWordsFile" is set here, depending on language
    try {
        Analyzer analyzer = CustomAnalyzer.builder(resources)
            .withTokenizer("icu")
            .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", stopWordsFile,
                "format", "wordset")
            .build();

        TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
        CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
        tokenStream.reset();
        ArrayList<String> remaining = new ArrayList<String>();

        while(tokenStream.incrementToken()) {
            remaining.add(term.toString());
        }

        tokenStream.close();
        analyzer.close();
    
        return remaining;
    } catch(IOException e) {
        //Handle exception
    }
}

Depending on the language I want to use, I set stopWordsFile to the name of a different ".txt" file of stop words, which are all formatted as UTF-8.

In my other question I got help fixing my code to make exactly that work and there was also a tip to include an ascii folder filter.

While the above code works fine with all the languages I've tested it so far, I'm still wondering: What is the use of the folder filter in my case, why do I need one (do I?) and is there anything else I should consider adding to make the analyzer work "better" with multiple languages (but of course only one at a time is used)?

Gunyah answered 16/10, 2020 at 9:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.