Why does CONTAINS find inequal text strings in JCR-SQL2?
Asked Answered
D

2

5

Working with a JCR-SQL2 query I noticed that the CONTAINS operator finds nodes which do not have exactly the same string that was in the condition.

Example

The following query:

SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, 'my/search-expression')

would not find only nodes that contain the my/search-expression string, but also nodes with strings like my/another/search/expression.

Why does the query not find only the exact string provided? How could it be changed to narrow down the results?

This question is intended to be answered by myself, for knowledge sharing - but feel free to add your own answer or improve an existing one.

Detestable answered 3/1, 2018 at 12:37 Comment(0)
D
9

An execution plan for the example query reveals the root cause of the problem:

  [nt:base] as [s] /* lucene:lucene(/oak:index/lucene) +:fulltext:my +:fulltext:search +:fulltext:expression ft:("my/search-expression") where contains([s].[*], 'my/search-expression') */

The CONTAINS operator triggers a full text search. Non-word characters, like "/" or "-", are used as word delimiters. As a result, the query looks for all nodes that contain the words: "my", "search" and "expression".

What can be done with it? There are several options.

1. Use double quotes

If you want to limit results to phrases with given words in exact order and without any other words between them, put the search expression inside double quotes:

SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, '"my/search-expression"')

Now, the execution plan is different:

[nt:base] as [s] /* lucene:lucene(/oak:index/lucene) :fulltext:"my search expression" ft:("my/search-expression") where contains([s].[*], '"my/search-expression"') */

The query will now look for the whole phrase, not single words. However, it still ignores non-word characters, so such phrases would also be found: "my search expression" or "my-search-expression".

2. Use LIKE expression (not recommended)

If you want to find only the exact phrase, keeping non-word characters, you can use the LIKE expression:

SELECT * FROM [nt:base] AS s WHERE s.* LIKE '%my/search-expression%'

This is, however, much slower. I needed to add another condition to avoid timeout during explaining the execution plan. For this query:

SELECT * FROM [nt:base] AS s WHERE s.* LIKE '%my/search-expression%'  AND ISDESCENDANTNODE([/content/my/content]) 

the execution plan is:

[nt:base] as [s] /* traverse "/content/my/content//*" where ([s].[*] like '%my/search-expression%') and (isdescendantnode([s], [/content/my/content])) */

It would find only nodes with this phrase: "my/search-expression".

3. Use double quotes and refine the results

It would be probably better to use the first approach (CONTAINS with double quotes) and refine the results later, for example in application code if the query is run from an application.

4. Mix CONTAINS and LIKE

Another option is to mix full-text search and LIKE expression with AND:

SELECT * FROM [nt:base] AS s WHERE CONTAINS(s.*, '"my/search-expression"') AND s.* LIKE '%my/search-expression%'

The execution plan is now:

[nt:base] as [s] /* lucene:lucene(/oak:index/lucene) :fulltext:"my search expression" ft:("my/search-expression") where (contains([s].[*], '"my/search-expression"')) and ([s].[*] like '%my/search-expression%') */

Now, it should be fast and strict in the same time.

Detestable answered 3/1, 2018 at 12:37 Comment(0)
S
1

Had the same problem.

So basically you should define different tokenizer for your lucene index, in my case "Whitespace" tokenizer was just fine.

With Standard tokenizer "my/search-expression" is splitted in 3 tokens "my", "search", "expression". Standard tokenizer use some special characters as delimiter.

Thats the reason why for "my/search-expression" you get 0 results.

Another example:

"some-other my search/expression" with Whitespace tokenizer this is splitted into:

"some-other", "my", "search/expression"

When you search for "some-other my" this should return results.

List of tokenizers

Lucene index example:

<yourLucene
jcr:primaryType="oak:QueryIndexDefinition"
type="lucene"
async="async"
evaluatePathRestrictions="{Boolean}true"
includedPaths="[/somepath]"
queryPaths="[/somepath]"
compatVersion="{Long}2">
<analyzers jcr:primaryType="nt:unstructured">
    <default jcr:primaryType="nt:unstructured">
        <tokenizer
            jcr:primaryType="nt:unstructured"
            name="Whitespace"/>
        <filters jcr:primaryType="nt:unstructured">
            <Standard jcr:primaryType="nt:unstructured"/>
            <LowerCase jcr:primaryType="nt:unstructured"/>
            <Stop jcr:primaryType="nt:unstructured"/>
        </filters>
    </default>
</analyzers>
<indexRules jcr:primaryType="nt:unstructured">
    <nt:unstructured jcr:primaryType="nt:unstructured">
        <properties jcr:primaryType="nt:unstructured">
            <someprop
                jcr:primaryType="nt:unstructured"
                name="someprop"
                propertyIndex="{Boolean}true"
                type="String"/>
        </properties>
    </nt:unstructured>
</indexRules>
Selfinduced answered 8/9, 2020 at 10:54 Comment(3)
Interesting idea. It can be useful in some cases I believe. Thanks!Detestable
Just a note: I would not get 0 results with "my/search-expression" query string with the standard Lucene index. I would get even more results than I want - as I wrote in the question.Detestable
oh now I see. I had case that for "my/search-expression" got 0 results. I guess it all depends about which Tokenizer your index is using. Or during the querying if you have option to define it. But nevermind, maybe will help someone. I got few extra grays before figure it what is happening.Selfinduced

© 2022 - 2024 — McMap. All rights reserved.