Lucene: exception - Query parser encountered <EOF> after "some word"

C

2

20

I am working on a classification problem to classify product reviews as positive, negative or neutral as per the training data using Lucene API.

I am using an ArrayList of Review objects - "reviewList" that stores the attributes for each review while crawling the web pages.

The review attributes which include "polarity" & "review content" are then indexed using the indexer. Thereafter, based on the indexes objects, I need to classify the remaining review objects. But while doing so, there is a review object for which the Query parser is encountering an EOF character in the "review content", and hence terminating.

The line causing error has been commented accordingly -

    IndexReader reader = IndexReader.open(FSDirectory.open(new File("index")));
    IndexSearcher searcher = new IndexSearcher(reader);
    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_31);
    QueryParser parser = new QueryParser(Version.LUCENE_31, "Review", analyzer);

    int length = Crawler.reviewList.size();
    for (int i = 200; i < length; i++) {
        String true_class;
        double r_stars = Crawler.reviewList.get(i).getStars();

        if (r_stars < 2.0) {
            true_class = "-1";
        } else if (r_stars > 3.0) {
            true_class = "1";
        } else {
            true_class = "0";
        }

        String[] reviewTokens = Crawler.reviewList.get(i).getReview().split(" ");
        String parsedReview = "";

        int j;

        for (j = 0; j < reviewTokens.length; j++) {
            if (reviewTokens[j] != null) {
                if (!((reviewTokens[j].contains("-")) || (reviewTokens[j].contains("!")))) {
                    parsedReview += reviewTokens[j] + " ";
                }
            } else {
                break;
            }
        }

        Query query = parser.parse(parsedReview); // CAUSING ERROR!!

        TopScoreDocCollector results = TopScoreDocCollector.create(5, true);
        searcher.search(query, results);
        ScoreDoc[] hits = results.topDocs().scoreDocs;

I've parsed the text manually to remove the characters that are causing the error, apart from checking if the next string is null...but the error persists.

This is the error stack trace -

Exception in thread "main" org.apache.lucene.queryParser.ParseException: Cannot parse 'I made the choice ... be all "thumbs ': Lexical error at line 1, column 938.  Encountered: <EOF> after : "\"thumbs "
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:216)
at Sentiment_Analysis.Classification.classify(Classification.java:58)
at Sentiment_Analysis.Main.main(Main.java:17)
Caused by: org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column 938.  Encountered: <EOF> after : "\"thumbs "
at org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1229)
at org.apache.lucene.queryParser.QueryParser.jj_scan_token(QueryParser.java:1709)
at org.apache.lucene.queryParser.QueryParser.jj_3R_2(QueryParser.java:1598)
at org.apache.lucene.queryParser.QueryParser.jj_3_1(QueryParser.java:1605)
at org.apache.lucene.queryParser.QueryParser.jj_2_1(QueryParser.java:1585)
at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1280)
at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1266)
at org.apache.lucene.queryParser.QueryParser.Clause(QueryParser.java:1313)
at org.apache.lucene.queryParser.QueryParser.Query(QueryParser.java:1266)
at org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1226)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:206)
... 2 more
Java Result: 1

Please help me solve this problem...have been banging my head with this for hours now!

Columbic answered 21/4, 2012 at 14:41 Comment(0)

A

37

You should escape the double quote and other special characters via

Query query = parser.parse(QueryParser.escape(parsedReview));

As the QueryParser.escape Javadoc suggested,

Returns a String where those characters that QueryParser expects to be escaped are escaped by a preceding '\'.

Airboat answered 21/4, 2012 at 14:45 Comment(6)

Thanks a ton! It was spot on.. :D – Columbic 21/4, 2012 at 16:1

For those who use a more recent releases(Lucene 4.6 for me), the escape function has been moved to QueryParserUtil class. – Reclusion 24/1, 2014 at 11:32

I want to make this using solr library instead of lucene library, any idea? – Cuneo 2/4, 2015 at 6:0

@ChunliangLyu in Lucene 4.10.4 escape() is still in QueryParser (inherited from QueryParserBase), but there is also one in QueryParserUtil as you mention. -I wonder what the difference is..? – Pooch 4/12, 2015 at 16:11

@Pooch Yes you are right, the QueryParser inherits the method from QueryParserBase. I have checked the implementations QueryParserBase and QueryParserUtil in the current revision, turns out they are exactly the same. So no functionality difference, perhaps some tiny little performance difference. – Reclusion 5/12, 2015 at 2:53

Is it considered a vulnerability if users can put in & parse arbitrary values that aren't escaped? – Dniester 22/10, 2017 at 6:31

K

2

I recognise this problem.

Declaring the GROUP BY before the WHERE declaration works fine in Teradata, but throws an error while parsing.

To fix, move the GROUP BY declaration after the WHERE declaration.

Keepsake answered 5/6, 2017 at 12:17 Comment(0)

Recommended topics

Hot tags