Lucene Proximity Search for phrase with more than two words
Asked Answered
V

1

9

Lucene's manual has explained the meaning of proximity search for a phrase with two words clearly, such as the "jakarta apache"~10 example in http://lucene.apache.org/core/2_9_4/queryparsersyntax.html#Proximity Searches

However, I am wondering what does a search like "jakarta apache lucene"~10 exactly do? Does it allow neighboring words to be at most 10 words apart, or all pairs of words to be that?

Thanks!

Valid answered 28/8, 2014 at 21:19 Comment(0)
C
20

The slop (proximity) works like an edit distance (see PhraseQuery.setSlop). So, the terms could be reordered or have extra terms added. This means that the proximity would be the maximum number of terms added into the whole query. That is:

"jakarta apache lucene"~3

Will match:

  • "jakarta lucene apache" (distance: 2)
  • "jakarta extra words here apache lucene" (distance: 3)
  • "jakarta some words apache separated lucene" (distance: 3)

But not:

  • "lucene jakarta apache" (distance: 4)
  • "jakarta too many extra words here apache lucene" (distance: 5)
  • "jakarta some words apache further separated lucene" (distance: 4)

Some people have been confused by:

"lucene jakarta apache" (distance: 4)

The simple explanation is that swapping terms takes two edits, so:

  1. jakarta apache lucene (distance: 0)
  2. jakarta lucene apache (first swap, distance: 2)
  3. lucene jakarta apache (second swap, distance: 4)

The longer, but more accurate, explanation is that every edit allows a term to be moved by one position. The first move of a swap transposes two terms on top of each other. Keeping this in mind explains why any set of three terms can be rearranged into any order with distance no greater than 4.

  1. jakarta apache lucene (distance: 0)
  2. jakarta [apache,lucene] (distance: 1)
  3. [jakarta,apache,lucene] (all transposed at the same position, distance: 2)
  4. lucene [jakarta,apache] (distance: 3)
  5. lucene jakarta apache (distance: 4)
Churchwoman answered 28/8, 2014 at 23:4 Comment(5)
Unable to find a working code example on searching of more than 2 terms in java. All examples are on 2 words searching. If you had any code example then can you share that?Camphorate
I extended example from following link and added a 3rd term but it is not working. javacodegeeks.com/2015/09/…Camphorate
What is the distance between "one two three four" and "three four one two"?Bulganin
The query words must be in the same order as in the document by default. Ex, "jakarta apache"~10 will not match "download apache project jakarta" but will match "download jakarta from apache". You can turn off proximity order requirement by parser.setInOrder(false) See docs: lucene.apache.org/core/5_2_1/queryparser/org/apache/lucene/…Lockman
"jakarta apache lucene"~3, if this is the query and the text is 'jakarta jakarta apache lucene', will it match twice?Whatever

© 2022 - 2024 — McMap. All rights reserved.