I remove stop words from a String and return the remaining words with the original upper/lower case afterwards, using Apache's Lucene (8.6.3) and the following Java 8 code (this is a shortened version):
Path resources = Paths.get(stopWordFolder);
String stopWordsFile = "";
if(Files.exists(resources)) {
//"stopWordsFile" is set here, depending on language
try {
Analyzer analyzer = CustomAnalyzer.builder(resources)
.withTokenizer("icu")
.addTokenFilter("stop",
"ignoreCase", "true",
"words", stopWordsFile,
"format", "wordset")
.build();
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
tokenStream.reset();
ArrayList<String> remaining = new ArrayList<String>();
while(tokenStream.incrementToken()) {
remaining.add(term.toString());
}
tokenStream.close();
analyzer.close();
return remaining;
} catch(IOException e) {
//Handle exception
}
}
Depending on the language I want to use, I set stopWordsFile
to the name of a different ".txt" file of stop words, which are all formatted as UTF-8
.
In my other question I got help fixing my code to make exactly that work and there was also a tip to include an ascii folder filter.
While the above code works fine with all the languages I've tested it so far, I'm still wondering: What is the use of the folder filter in my case, why do I need one (do I?) and is there anything else I should consider adding to make the analyzer work "better" with multiple languages (but of course only one at a time is used)?