lucene - give more weight the closer term is to beginning of title
Asked Answered
J

2

11

I understand how to boost fields either at index time or query time. However, how could I increase the score of matching a term closer to the beginning of a title?

Example:

Query = "lucene"

Doc1 title = "Lucene: Homepage"
Doc2 title = "I have a question about lucene?"

I would like the first document to score higher since "lucene" is closer to the beginning (ignoring term freq for now).

I see how to use the SpanQuery for specifying the proximity between terms, but I'm not sure how to use the information about the position in the field.

I am using Lucene 4.1 in Java.

Jordans answered 1/3, 2013 at 10:1 Comment(5)
In the inverted index the terms don't have positions and a single term appears many times in a document (field). I don't see an obvious solution.Bulbiferous
possible duplicate of Lucens best way to do "starts-with" queriesTiconderoga
@MarkoTopolnik You can store positions in lucene and know where a term is. Span queries rely on positions in fact. SpanFirstQuery seems a good fit here.Ultraviolet
@Ultraviolet I see no evidence of scoring according to position. Can you enlighten?Bulbiferous
@MarkoTopolnik Have a look at my answer, sorry for keeping you on hold a little ;)Ultraviolet
U
12

I would make use of a SpanFirstQuery, which matches terms near the beginning of a field. As all span queries it relies on positions, enabled by default while indexing in lucene.

Let's test it independently: you just have to provide your SpanTermQuery and the maximum position where the term can be found (one in my example).

SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("title", "lucene"));
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(spanTermQuery, 1);

Given your two documents this query will find only the first one with title "Lucene: Homepage", if you analyzed it with the StandardAnalyzer.

Now we can somehow combine the above SpanFirstQuery with a normal text query, and have the first one only influencing the score. You can easily do it using a BooleanQuery and putting the span query as a should clause like this:

Term term = new Term("title", "lucene");
TermQuery termQuery = new TermQuery(term);
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));

There are probably different ways to achieve the same, maybe using a CustomScoreQuery too, or custom code to implement the scoring, but this seems to me the easiest one.

The code I used to test it prints the following output (score included) executing the only TermQuery first, then the only SpanFirstQuery and finally the combined BooleanQuery:

------ TermQuery --------
Total hits: 2
title: I have a question about lucene - score: 0.26010898
title: Lucene: I have a really hard question about it - score: 0.22295055
------ SpanFirstQuery --------
Total hits: 1
title: Lucene: I have a really hard question about it - score: 0.15764984
------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------
Total hits: 2
title: Lucene: I have a really hard question about it - score: 0.26912516
title: I have a question about lucene - score: 0.09196242

Here is the complete code:

public static void main(String[] args) throws Exception {

        Directory directory = FSDirectory.open(new File("data"));

        index(directory);

        IndexReader indexReader = DirectoryReader.open(directory);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        Term term = new Term("title", "lucene");

        System.out.println("------ TermQuery --------");
        TermQuery termQuery = new TermQuery(term);
        search(indexSearcher, termQuery);

        System.out.println("------ SpanFirstQuery --------");
        SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
        search(indexSearcher, spanFirstQuery);

        System.out.println("------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------");
        BooleanQuery booleanQuery = new BooleanQuery();
        booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
        booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));
        search(indexSearcher, booleanQuery);
    }

    private static void index(Directory directory) throws Exception {
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_41, new StandardAnalyzer(Version.LUCENE_41));

        IndexWriter writer = new IndexWriter(directory, config);

        FieldType titleFieldType = new FieldType();
        titleFieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
        titleFieldType.setIndexed(true);
        titleFieldType.setStored(true);

        Document document = new Document();
        document.add(new Field("title","I have a question about lucene", titleFieldType));
        writer.addDocument(document);

        document = new Document();
        document.add(new Field("title","Lucene: I have a really hard question about it", titleFieldType));
        writer.addDocument(document);

        writer.close();
    }

    private static void search(IndexSearcher indexSearcher, Query query) throws Exception {
        TopDocs topDocs = indexSearcher.search(query, 10);

        System.out.println("Total hits: " + topDocs.totalHits);

        for (ScoreDoc hit : topDocs.scoreDocs) {
            Document result = indexSearcher.doc(hit.doc);
            for (IndexableField field : result) {
                System.out.println(field.name() + ": " + field.stringValue() +  " - score: " + hit.score);
            }
        }
    }
Ultraviolet answered 1/3, 2013 at 20:4 Comment(6)
This only boosts the first position, right? So the scoring is really implemented by hand, there's no support from the SpanQuery. One could also add more clauses for positions further out, but it would become quite unwieldy, especially if two or more terms were scored this way. Let's say I don't call this an "obvious" solution :)Bulbiferous
It boosts the first position if you say new SpanFirstQuery(spanTermQuery, 1). You can increase the maximum position as you prefer, you don't need to add other clauses. I don't know what you mean "by hand", I haven't done any scoring manually, I just used what lucene exposes out-of-the-box. Obvious or not is personal. From my perspective it's easy since I didn't have to write custom code around similarity/scorers etc.Ultraviolet
I used it by breaking up the query into tokens and then creating a SpanFirstQuery for each token with increasing max position (1 for the first token, 2 for the second). It significantly boosted my previous relevance score so I'd say it worked great.Jordans
Is there a way to do this directly in HTTP requests, or in SOLR UI ?Cothurnus
does it depend on Analyzer we're using?. I'm using KeywordAnalyzer in lucene 4.0 and this just doesn't seem to work. check this #41447505Rummer
can we boost based on query text Lucene exists in both fields for example can we boost a doc if query text exists in both fields Title and DescriptionLeventis
S
0

From the book "Lucene In Action 2"

" Lucene provides a built-in query PayloadTermQuery, in the package org.apache.lucene.search.payloads. This query is just like SpanTermQuery in that it matches all documents containing the specified term and keeps track of the actual occurrences (spans) of the matches.

But then it goes further by enabling you to contribute a scoring factor based on the payloads that appear at each term’s occurrence. To do this, you’ll have to create your own Similarity class that defines the scorePayload method, like this "

public class BoostingSimilarity extends DefaultSimilarity {
public float scorePayload(int docID, String fieldName,
int start, int end, byte[] payload,
int offset, int length) {
....
}

"start" in the above code is nothing but start position of the payload. Payload is associated with the term. So the start-position also applies to the term (at least that's what I believe..)

By using the above code, but disregarding the payload, you will have access to the "start" position at the place of scoring and then you may boost the score based on that start value.

For example : new score = original score * ( 1.0f / start-position )

I hope the above works, please post here if you find any other efficient solution..

Seawards answered 1/3, 2013 at 19:59 Comment(2)
This sounded good but I could not get it to work in Lucene 4.1Jordans
The scorePayload() function is not called if you don't have actual payloads. Same for the PayloadFunction parameter of PayloadTermQueryLin

© 2022 - 2024 — McMap. All rights reserved.