Lucens best way to do "starts-with" queries
Asked Answered
W

3

6

I want to be able to do the following types of queries:

The data to index consists of (let's say), music videos where only the title is interesting. I simply want to index these and then create queries for them such that, whatever word or words the user used in the query, the documents containing those words, in that order, at the beginning of the tile will be returned first, followed (in no particular order) by documents containing at least one of the searched words in any position of the title. Also all this should be case insensitive.

Example:

For documents:

  • Video1Title = Sea is blue
  • Video2Title = Wild sea
  • Video3Title = Wild sea Whatever
  • Video4Title = Seaside Whatever

If I search "sea" I want to get

  • "Video1Title = Sea is blue"

first followed by all the other documents that contain "sea" in title, but not at the beginning.

If I search "Wild sea" I want to get

  • Video2Title = Wild sea
  • Video3Title = Wild sea Whatever

first followed by all the other documents that have "Wild" or "Sea" in their title but don't have "Wild Sea" as title prefix.

If I search "Seasi" I don't wanna get anything (I don't care for Keyword Tokenization and prefix queries).

Now AFAIKS, there's no actual way to tell Lucene "find me documents where word1 and word2 and etc. are in positions 1 and 2 and 3 and etc."

There are "workarounds" to simulate that behaviour:

  • Index the field twice. In field1 you have the words tokenized (using perhaps StandardAnalyzer) and in field2 you have them all clumped up into one element (using KeywordAnalyzer). Then if you search something like :

    +(field1:word1 word2 word3) (field2:"word1 word2 word3*")

effectively telling Lucene "Documents must contain word1 or word2 or word3 in the title, and furthermore those that match "title starts with >word1 word2 word3<" are better (get higher score).

  • Add a "lucene_start_token" to the beginning of the field when indexing them such that Video2Title = Wild sea is indexed as "title:lucene_start_token Wild sea" and so on for the rest

Then do a query such that:

+(title:sea) (title:"lucene_start_token sea")

and having Lucene return all documents which contain my search word(s) in the title and also give a better score on those who matched "lucene_start_token+search words"

My question is then, are there indeed better ways to do this (maybe using PhraseQuery and Term position)? If not, which of the above is better perfromance-wise?

Wardle answered 21/2, 2013 at 15:17 Comment(0)
P
6

You can use Lucene Payloads for that. You can give custom boost for every term of the field value.

So, when you index your titles you can start using a boost factor of 3 (for example):

title: wild|3.0 creatures|2.5 blue|2.0 sea|1.5

title: sea|3.0 creatures|2.5

Indexing this way you are boosting nearest terms to the start of title.

The main problem using this approach is you have to tokenize by yourself and add all this boost information "manually" as the Analyzer needs the text structured that way (term1|1.1 term2|3.0 term3).

Petrochemistry answered 26/2, 2013 at 0:0 Comment(2)
This is a great way to do this. It's true that it sort of complicates the query as opposed to the other approaces, but I think it's the most efficient (no extra field(s), heavy work is done at index time, not at query time). The only potential downfall I can see is when you start with an already complex query. Balancing that with the new queries added by this approach might get tricky.Wardle
Why not relying on positions and use a SpanFirstQuery like suggested here? Does it fit your requirements entirely?Fennie
B
1

What you could do is index the title and each token separately, e.g. text wild deep blue endless sea would be indexed like:

title: wild deep blue endless sea
t1: wild
t2: deep
t3: blue
t4: endless
t5: sea

Then if someone queries "wild deep", the query would be rewritten into

title:"wild deep" OR (t1:wild AND t2:deep)

This way you will always find all matching documents (if they match title) but matching t1..tN tokens will score the relevant documents higher.

Botulin answered 25/2, 2013 at 16:33 Comment(3)
thanks, that's a good idea. And this way I can match only whole words only if I want to (as opposed to my 1 extra field which is key-word analyzed and which is then used against a prefix (ending in *) query where I never know if it matches whole word or not at the end). I just have to use the same analyzer(s) that I use on the title field to obtain all the tokens, and then re-index each one in a separate field. Right? :)Wardle
Yes, you are correct - you have to use the same analyzer. I should have probably mentioned this in my answer.Botulin
This is a great approach, better than those I've found myself (mentioned in the question). You do end up having a lot of fields, and this can lead to problems if the index size is already very big (the index is split more often and/or more heap memory is needed). However for a manageable index size this approach is a great balance of simplicity of implementation and accuracy of results.Wardle
P
0

One should create a KeywordField or StringField field and search it with a PrefixQuery.

About PrefixQuery:

A Query that matches documents containing terms with a specified prefix. A PrefixQuery is built by QueryParser for input like app*.

About KeywordField:

Field that indexes a per-document String or BytesRef into an inverted index for fast filtering, stores values in a columnar fashion using DocValuesType.SORTED_SET doc values for sorting and faceting, and optionally stores values as stored fields for top-hits retrieval. This field does not support scoring: queries produce constant scores. If you need more fine-grained control you can use StringField ...

About StringField:

A field that is indexed but not tokenized: the entire String value is indexed as a single token.

Phonate answered 25/10, 2023 at 13:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.