MarkLogic cts:element-query false positives?
Asked Answered
P

2

6

Given this document :-

<items>
  <item><type>T1</type><value>V1</value></item>
  <item><type>T2</type><value>V2</value></item>
</items>

unsurprisingly, I find that this will pull back the page in a cts:uris() :-

cts:and-query((
  cts:element-query(xs:QName('item'),
    cts:element-value-query(xs:QName('type'),'T1')
    ),
  cts:element-query(xs:QName('item'),
    cts:element-value-query(xs:QName('value'),'V2')
    )
  ))

but somewhat surprisingly (to me at least) I also find that this will too :-

cts:element-query(xs:QName('item'),
  cts:and-query((
    cts:element-value-query(xs:QName('type'),'T1'),
    cts:element-value-query(xs:QName('value'),'V2')
    ))
  )

This doesn't seem right, as there is no single item with type=T1 and value=V2. To me this seems like a false positive.

Have I misunderstood how cts:element-query works? (I have to say that the documentation isn't particularly clear in this area).

Or is this something where MarkLogic strives to give me the result I expect, and had I had more or better indexes in place, I would be less likely to get a false positive match.

Painkiller answered 23/5, 2016 at 18:35 Comment(0)
V
5

In addition to the answer by @wst, you only need to enable element value positions to get accurate results from unfiltered search. Here some code to show this:

xdmp:document-insert("/items.xml", <items>
  <item><type>T1</type><value>V1</value></item>
  <item><type>T2</type><value>V2</value></item>
</items>);

cts:search(collection(),
  cts:element-query(xs:QName('item'),
    cts:and-query((
      cts:element-value-query(xs:QName('type'),'T1'),
      cts:element-value-query(xs:QName('value'),'V2')
    ))
  ), 'unfiltered'
)

Without element value positions enabled this returns the test document. After enabling the positions, the query returns nothing.

As said by @wst, cts:search() runs filtered by default, whereas cts:uris() (and for instance xdmp:estimate() only runs unfiltered.

HTH!

Voice answered 24/5, 2016 at 9:23 Comment(1)
This is much more inline with what I expected. Thanks.Painkiller
P
4

Yes, I think this is a slight misunderstanding of how queries work. In cts:search, the default behavior is to enable the filtered option. In this case ML will evaluate the query using only indexes, and then once candidate documents have been selected, it will load them into memory, inspect, and filter out false positives. This is more time consuming, but more accurate.

cts:uris is a lexicon function, so queries passed to it will only resolve via indexes, and there is no option to filter false positives.

The simple way to handle this query via indexes would be to change your schema such that documents are based on <item> instead of <items>. Then each item would have a separate index entry, and results would not be commingled before filtering.

Another way that doesn't involve updating documents is to wrap the queries you expect to occur in the same element in a cts:near-query. That would prevent a <type> in one <item> from matching with a <value> in a different <item>. I suggest reading the documentation because you may need to enable one or more position-based indexes for cts:near-query to be accurate.

Proctology answered 23/5, 2016 at 19:8 Comment(4)
I understand the distinction between search and filter, and I did expect false positives sometimes, its just that they are quite rare. What I had expected is that the fact the two cts:element-value-querys were within the same cts:element-query would mean that their matches would have to be within the same instance of the element (named item), not merely within any old elements named item. The syntax does suggest that the two examples I give intend different things. I don't know if the cts:near-query thing is an answer in the general case, in actual fact type and value could be far apart.Painkiller
@AndyKey In the first case that would only be true in a filtered search. The resolution of the index is only at a document level. The index doesn't "see" that those values are in different items, just that they return true for some items in a document. By enabling position indexes and using cts:near-query you can work around that.Proctology
Accepted as answer. However, it does seem odd that a check is done that the type and value exist beneath the item (as opposed to merely returning all documents with type and value that match regardless of whether they are beneath an item), and yet no check that the matches occur beneath the same item.Painkiller
@AndyKey This is a simplification, but one way to think about the index primitive is as key/value pairs where the key is the element QName, and insofar as it relates to your query the item doesn't have identity distinct from its name. Identity really only exists at the document/fragment level, which is why it doesn't differentiate between element of the same name, even for a nested query like this.Proctology

© 2022 - 2024 — McMap. All rights reserved.