Using default and custom stop words with Apache's Lucene (weird output)
Asked Answered
B

1

3

I'm removing stop words from a String, using Apache's Lucene (8.6.3) and the following Java 8 code:

private static final String CONTENTS = "contents";
final String text = "This is a short test! Bla!";
final List<String> stopWords = Arrays.asList("short","test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

try {
    Analyzer analyzer = new StandardAnalyzer(stopSet);
    TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
    CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();

    while(tokenStream.incrementToken()) {
        System.out.print("[" + term.toString() + "] ");
    }

    tokenStream.close();
    analyzer.close();
} catch (IOException e) {
    System.out.println("Exception:\n");
    e.printStackTrace();
}

This outputs the desired result:

[this] [is] [a] [bla]

Now I want to use both the default English stop set, which should also remove "this", "is" and "a" (according to github) AND the custom stop set above (the actual one I'm going to use is a lot longer), so I tried this:

Analyzer analyzer = new EnglishAnalyzer(stopSet);

The output is:

[thi] [is] [a] [bla]

Yes, the "s" in "this" is missing. What's causing this? It also didn't use the default stop set.

The following changes remove both the default and the custom stop words:

Analyzer analyzer = new EnglishAnalyzer();
TokenStream tokenStream = analyzer.tokenStream(CONTENTS, new StringReader(text));
tokenStream = new StopFilter(tokenStream, stopSet);

Question: What is the "right" way to do this? Is using the tokenStream within itself (see code above) going to cause problems?

Bonus question: How do I output the remaining words with the right upper/lower case, hence what they use in the original text?

Baluchi answered 12/10, 2020 at 16:41 Comment(0)
A
4

I will tackle this in two parts:

  • stop-words
  • preserving original case

Handling the Combined Stop Words

To handle the combination of Lucene's English stop word list, plus your own custom list, you can create a merged list as follows:

import org.apache.lucene.analysis.en.EnglishAnalyzer;

...

final List<String> stopWords = Arrays.asList("short", "test");
final CharArraySet stopSet = new CharArraySet(stopWords, true);

CharArraySet enStopSet = EnglishAnalyzer.ENGLISH_STOP_WORDS_SET;
stopSet.addAll(enStopSet);

The above code simply takes the English stopwords bundled with Lucene and merges then with your list.

That gives the following output:

[bla]

Handling Word Case

This is a bit more involved. As you have noticed, the StandardAnalyzer includes a step in which all words are converted to lower case - so we can't use that.

Also, if you want to maintain your own custom list of stop words, and if that list is of any size, I would recommend storing it in its own text file, rather than embedding the list into your code.

So, let's assume you have a file called stopwords.txt. In this file, there will be one word per line - and the file will already contain the merged list of your custom stop words and the official list of English stop words.

You will need to prepare this file manually yourself (i.e. ignore the notes in part 1 of this answer).

My test file is just this:

short
this
is
a
test
the
him
it

I also prefer to use the CustomAnalyzer for something like this, as it lets me build an analyzer very simply.

import org.apache.lucene.analysis.custom.CustomAnalyzer;

...

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();

This does the following:

  1. It uses the "icu" tokenizer org.apache.lucene.analysis.icu.segmentation.ICUTokenizer, which takes care of tokenizing on Unicode whitespace, and handling punctuation.

  2. It applies the stopword list. Note the use of true for the ignoreCase attribute, and the reference to the stop-word file. The format of wordset means "one word per line" (there are other formats, also).

The key here is that there is nothing in the above chain which changes word case.

So, now, using this new analyzer, the output is as follows:

[Bla]

Final Notes

Where do you put the stop list file? By default, Lucene expects to find it on the classpath of your application. So, for example, you can put it in the default package.

But remember that the file needs to be handled by your build process, so that it ends up alongside the application's class files (not left behind with the source code).

I mostly use Maven - and therefore I have this in my POM to ensure the ".txt" file gets deployed as needed:

    <build>  
        <resources>  
            <resource>  
                <directory>src/main/java</directory>  
                <excludes>  
                    <exclude>**/*.java</exclude>  
                </excludes>  
            </resource>  
        </resources>  
    </build> 

This tells Maven to copy files (except Java source files) to the build target - thus ensuring the text file gets copied.

Final note - I did not investigate why you were getting that truncated [thi] token. If I get a chance I will take a closer look.


Follow-Up Questions

After combining I have to use the StandardAnalyzer, right?

Yes, that is correct. the notes I provided in part 1 of the answer relate directly to the code in your question, and to the StandardAnalyzer you use.

I want to keep the stop word file on a specific non-imported path - how to do that?

You can tell the CustomAnalyzer to look in a "resources" directory for the stop-words file. That directory can be anywhere on the file system (for easy maintenance, as you noted):

import java.nio.file.Path;
import java.nio.file.Paths;

...

Path resources = Paths.get("/path/to/resources/directory");

Analyzer analyzer = CustomAnalyzer.builder(resources)
        .withTokenizer("icu")
        .addTokenFilter("stop",
                "ignoreCase", "true",
                "words", "stopwords.txt",
                "format", "wordset")
        .build();

Instead of using .builder() we now use .builder(resources).

Anoxemia answered 12/10, 2020 at 18:16 Comment(18)
First of all: Thanks for your long post! I have 2 questions: 1. After combining I have to use the StandardAnalyzer, right? 2. About the CustomAnalyzer: I am indeed going to put all the stop words for a language in a txt file but I want to keep it on a specific non-imported path, so I can easily change the words later (if necessary). Do I have to a) handle reading the file into a CharArraySet myself or can I b) just give the analyzer the full path to the file, similar to what you did in the answer? If it's a), how do I tell the CutstomAnalyzer to use those words?Baluchi
I've updated the answer to provide some more notes. Hope that helps.Anoxemia
Thanks again! I tested the latest CustomAnalyzer code and it gave me an exception because the "icu" resource was missing, so I added the lucene-analyzers-icu-8.6.3.jar file to the classpath. .withTokenizer("icu") still throws an exception though: Exception in thread "main" java.lang.NoClassDefFoundError: com/ibm/icu/text/BreakIterator. What library do I need for "icu"?Baluchi
The first exception was: Exception in thread "main" java.lang.IllegalArgumentException: A SPI class of type org.apache.lucene.analysis.util.TokenizerFactory with name 'icu' does not exist. - and some more text after that.Baluchi
You need ICU4J - you can grab the file from here, either using Maven or by downloading the jar file. Lucene 8.6.3 uses version 62.1 of that library. I strongly recommend using Maven (or Gradle, or similar) - which would automatically take care of downloading all such transitive dependencies for you.Anoxemia
Thanks! Tbh, I'm not a huge fan of Maven/Gradle/... as I never got them to work properly, at least adding them from scratch. I added version ICU4J to the build path but still got the same exception, so I re-added lucene-analyzers-icu and now it works. Looks like you need both. I'm going to test the rest tomorrow before I vote/accept.Baluchi
Side note: Regarding the unexpected transformation of this to thi. When you use the EnglishAnalyzer, the token stream you create automatically uses a Porter stemmer (see the JavaDoc for createComponents()). One of the stemming rules involves removing a trailing "s" from some words (and more, if the word is a plural: horses becomes hors, for example).Anoxemia
Ah, I see, thanks. Unfortunately a new problem came up while testing the code with different languages. English works fine but when I use my german stop words list, I get a java.nio.charset.MalformedInputException: Input length = 1 exception at .addTokenFilter("stop",. Apparently that's caused by the encoding of the file (because German uses umlauts) but how do I tell the CustomAnalyzer to use e.g. UTF-8 while reading the file?Baluchi
OK, understood - a couple of notes about that: (1) I am not able to recreate your problem. I used Notepad++ and set the file encoding to UTF-8 for my stop-words file - including words like gebäude with umlauts. So, it's probably not a question of telling Lucene to use UTF-8, it's probably a problem with the file itself not being UTF-8.Anoxemia
Note (2): Adding extra languages can complicate things (regardless of point 1). For example, you may need to include an ascii folding filter to your analyzer: .addTokenFilter("asciiFolding"). But these are different questions from your original question. You may be better off actually creating a brand new question to focus on these specific items (more people will see your questions that way). Hope this helped in the meantime.Anoxemia
(1) Thanks for testing and the tip! I downloaded the stop word files here: Both the english (which one worked fine though) and the german one were encoded with Windows-1252, while the italian was already UTF-8. I used TextPad to save both files in UTF-8 and now it's working. (2) Going to look at the extra filter tomorrow, thanks.Baluchi
Excellent. Glad you made progress. English words encoded with Windows-1252 will work the same as UTF-8 because all the code points for "a-z" and "A-Z" are the same in both encodings. But as soon as you have characters outside that range (such as anything with an accent), then yes, the encoding schemes become much more important, as you have seen.Anoxemia
I see, thanks for the explanation. I created a new question for the multi-language thing.Baluchi
Regarding the ascii folding filter - I was thinking you needed to match accented versions of words with unaccented versions. But after reading your other question, I think that is wrong, and my suggestion about using that filter is irrelevant - sorry about that. It's very helpful if you want to match across all of these: eglise, Eglise, église, Église, for example.Anoxemia
Ah. No, words schon only be matched with the exact same version. The only exemption is upper case and lower case, so I can get the right case for the end result. So there's nothing else I should do to make it work "better"?Baluchi
I noticed that if a word is used multiple times in the input text, then it's also output multiple times. It pretty easy to check for that myself but I'm still wondering: Is there anything already in place in Lucene to only get each word once (and maybe also include the amount of times it's used in the text)?Baluchi
I don't know - but I am sure Lucene can do that. That sounds like another great question you can ask. You can certainly investigate .addTokenFilter("removeDuplicates") - see the docs here. I have never used it, so I don't know if it does what you need.Anoxemia
Thanks for the tip, going to check the docs and otherwise just do it myself.Baluchi

© 2022 - 2024 — McMap. All rights reserved.