XPath pulling more than one match
Asked Answered
P

3

15

The (BaseX) error

I am running queries on a large dataset in BaseX, but one XQuery is crashing my programme with an error [XPTY0004] Item expected, sequence found: (attribute begin {"6"}, ...)..

In my query I am trying to make sure that one element comes before another by comparing begin - an attribute that is present in XML - with number(). But whenever I try the most basic of XQueries (return matching nodes) on my dataset (e.g. with this online tool) I get an error that resembles the one that I had before:

[Error] SaxonCE.XSLT20Processor 14:08:39.692 SEVERE: XPathException in invokeTransform: A sequence of more than one item is not allowed as the first argument of number() ("6", "10")

So I am guessing there is something going on with the node's siblings, i.e. that there are more than one of these nodes, and that it's unclear which should be compared. Examples follow below.

Why does the order matter?

The XPath is used in a query engine for treebanks: linguistically annotated corpora. In some cases we want nodes to match in order, and sometimes it does not matter. As a simplistic example: sometimes we want to match something as specific as the concerned man where the order article, adjective, noun matters. In other queries it doesn't matter, and we want to match phrases such as the time available as well, where the order of article, adjective, noun can be in any order.

In other words, in the first case the order of the elements should be respected, in the second one it shouldn't. Here is a possible XPath representation of such a construction that holds an article, an adjective, and a noun.

node[@cat="np" and node[@pt="art"] and node[@pt="adj"] and node[@pt="n"]]

By default, XPath does not care about the order of these elements and does a greedy search, i.e. it will also match items such as the time available (art, n, adj). But I want to re-write the above XPath to make sure that the order of the nodes is respected, and so a construction such as the time available (art, n, adj) is not matched by the concerned man (art, adj, n) is.

# Possible representation of *the time available*
<node id="0" begin="1" cat="np">
    <node id="1" begin="1" pt="art" text="the" />        
    <node id="2" begin="2" pt="n" text="time" />
    <node id="3" begin="3" pt="adj" text="available" />
</node>

# Possible representation of *the concerned man*
<node id="0" begin="1" cat="np">
    <node id="1" begin="1" pt="art" text="the" />        
    <node id="2" begin="2" pt="adj" text="concerned" />
    <node id="3" begin="3" pt="n" text="man" />
</node>

One way to go about is to use a numeric comparison of the begin attribute that is available in the corpus. It is numerical ascending, so if we want to ensure the order of XPath is intact, we can say that the numeric value of each child node of @cat="np" should be less than the next by using number(). But as I showed above, this leads to an error - an error that would not occur in the simple example code that I just showed.

Another example.

<node id="0" begin="2">
    <node id="1" begin="2">
        <node id="2" begin="2"/>
        <node id="3" begin="3"/>
    </node>
    <node id="4" begin="5">
        <node id="5" begin="5"/>
    </node>
    <node id="6" begin="6"/>
</node>

This XPath should match:

/node/node[number(@begin) < number(../node/@begin)]

But when put through an XQuery processor you'd get the error described above. A sequence of more than one item is not allowed as the first argument of number() ("2", "5", ...).


I tried the solution provided by @Michael Kay but the same issue seems to play.

XQuery

for $node in node[every $n in node[position() lt last()] satisfies (number($n/@begin) lt number($n/following-sibling::node/@begin))]
return $node

Data

<node id="0" begin="2">
    <node id="1" begin="2">
        <node id="2" begin="2"/>
        <node id="3" begin="3"/>
    </node>
    <node id="4" begin="5">
        <node id="5" begin="5"/>
    </node>
    <node id="6" begin="6"/>
</node>

Error

SaxonCE.XSLT20Processor 14:48:49.809 SEVERE: XPathException in invokeTransform: A sequence of more than one item is not allowed as the first argument of number() ("5", "6")


Update April 19th, 2017

I bumped into some unexpected behaviour today, which makes the solution provided by @har07 not sufficient any more. I had wrongly assumed that the not() clause had only an effect on the nodes in the XPath (and not all the nodes in XML). In other words, when the not() clause is added to the topmost node of the XPath, all its children in XML will have a fixed, sorted word order. (Now that I read it like this, it seems only normal.) However, what I actually want is that the word order is only set on the nodes specified in XPath, and not possible other nodes in matching XML. Hopefully and example will make this more clear.

Let's say that I want to match the following XPath, a cat="np" that contains rel="det" pt="vnw" lemma="die" and at least two times rel="mod" pt="adj".

//node[@cat="np" and node[@rel="det" and @pt="vnw" and @lemma="die"] and count(node[@rel="mod" and @pt="adj"]) > 1]

but with the added requirement that the order of this XPath is followed, i.e.

//node[
    @cat="np" and 
    not(node[
        position() < last()
    ][number(@begin) > following-sibling::node/number(@begin)]) and 
    node[
        @rel="det" and 
        @pt="vnw" and 
        @lemma="die"
    ] and 
    count(node[
        @rel="mod" and 
        @pt="adj"
    ]) > 1
]

So rel="det" has to occur before the two rel="mod"s in XML. This works fine, and all matches are correct, but not all expected matches are found. The cause is that the not() line obviously targets all the XML nodes rather than the XPath-specified nodes. In case where down the line a node is found that does not adhere to the not rule, there won't be a match - even if that node is not specified in XPath. The above XPath, for instance, will not match the following XML because inside cat="np" there is a node whose begin attribute is larger than its next sibling, which is not allowed by the not rule.

<node begin="4" cat="np" id="8" rel="obj1">
    <node begin="4" id="9" pos="det" pt="vnw" rel="det" word="die" lemma="die" />
    <node begin="5" id="10" pos="adj" pt="adj" rel="mod" word="veelzijdige" />
    <node begin="6" id="11" pos="adj" pt="adj" rel="mod" word="getalenteerde" />
    <node begin="7" id="12" pos="noun" pt="n" rel="hd" word="figuren" />
    <node begin="8" id="31" index="1" rel="obj1" />
    <node begin="2" id="32" index="2" rel="obj2" />
</node>

However, I would like this cat="np" to match, and make the not() function less aggressive, i.e. only require that nodes specified in XPath (in this example rel="det" pt="vnw" lemma="die", and the two rel="mod" pt="adj" nodes) follow the order requirement where the begin attribute should be smaller than the next item of the XPath structure. Other items inside cat="np" that have not been specified in XPath are allowed to have an attribute that is larger than its next sibling.

Note that the last item of the XPath structure (which would match id="11" in the example XML) does not necessarily have to have a begin attribute that is lower than its following node in XML (which is not specified in the XPath).

As before, I am especially interested in how to solve this with a pure XPath option, but XQuery alternatives are also welcome. Preferably as a function that takes an XPath structure as input, and applies the 'word order' to its topmost node and all its descendants. Example code and usage with the XPath shown here as an example is encouraged.

Presley answered 8/3, 2017 at 13:56 Comment(4)
I think "that there are more than one of these nodes, and that it's unclear which should be compared" is right but we can't tell which node exactly you want to compare if there are several selected, so we can't tell you how to fix it. You will need to explain in plain text which condition between which nodes you want to compare to allow us to suggest fixes to your XQuery.Iolaiolande
@MartinHonnen Please see my edit.Presley
Why does the order of the nodes matter? What purpose does that serve? That might help give context to your situation.Haemorrhage
@TonyAbrams The XPath is used in a query engine for treebanks: linguistically annotated corpora. In some cases we want nodes to match in order, and sometimes it does not matter. As a simplistic example: sometimes we want to match something as specific as the concerned man where the order article, adjective, noun matters. In other queries it doesn't matter, and we want to match phrases such as the time available as well, where the order of article, adjective, noun can be in any order.Presley
C
1

Regarding the a-sequence-of-more-than-one-item-is-not-allowed exception you're facing, notice that XPath 2.0 and above, and XQuery, supports function invocation on path step (.../number()). That said, you can call number() on individual node passing a single begin attribute at a time to avoid the exception :

/node/node[number(@begin) < ../node/number(@begin)]

However, the predicate expression used in the XPath above evaluates to true when at least there is one sibling node with begin attribute value greater than begin attribute of current node, which seems not the desired behavior.

You can apply the same fix on the suggested XQuery, but apparently there was another similar problem due to lt being used to compare a value against a sequence of values (to be clear, I'm referring to the 2nd lt in the suggested XQuery). You can try the following, slightly modified, XQuery instead :

for $node in node[
    every $n in node[position() lt last()] 
    satisfies not($n/following-sibling::node[number(@begin) lt number($n/@begin)])
]
return $node

"One way to go about is to use a numeric comparison of the begin attribute that is available in the corpus. It is numerical ascending, so if we want to ensure the order of XPath is intact, we can say that the numeric value of each child node of @cat="np" should be less than the next by using number()."

If I understand this correctly, you can use the following XPath :

/node/node[
    not(
        node[position() < last()]
            [number(@begin) > following-sibling::node/number(@begin)]
    )
]

demo

The XPath should return all 2nd level node elements, where, for every child node except the last within current 2nd level node, none of the following-sibling node has a numerically lower value of begin attribute than that of current child node.

Given the following sample XML :

<node id="0" begin="2">
    <node id="0" begin="1" cat="np">
        <node id="1" begin="1" pt="art" text="the" />
        <node id="2" begin="3" pt="n" text="time" />
        <node id="3" begin="2" pt="adj" text="available" />
    </node>
    <node id="0" begin="1" cat="np">
        <node id="1" begin="1" pt="art" text="the" />
        <node id="2" begin="2" pt="adj" text="concerned" />
        <node id="3" begin="3" pt="n" text="man" />
    </node>
</node>

Only the 2nd node would be selected, for it is the only 2nd level node that have begin attribute values in ascending order :

<node id="0" begin="1" cat="np">
   <node id="1" begin="1" pt="art" text="the"/>
   <node id="2" begin="2" pt="adj" text="concerned"/>
   <node id="3" begin="3" pt="n" text="man"/>
</node>

Update April 19th, 2017 :

"...However, I would like this cat="np" to match, and make the not() function less aggressive, i.e. only require that nodes specified in XPath (in this example rel="det" pt="vnw" lemma="die", and the two rel="mod" pt="adj" nodes) follow the order requirement where the begin attribute should be smaller than the next item of the XPath structure."

Then we need to add another predicate to specify those nodes within the not(), that is where we check the attribute order requirement :

node[(@rel="det" and @pt="vnw" and @lemma="die") or (@rel="mod" and @pt="adj")]
    [position() < last()]
    [number(@begin) > 
         following-sibling::node[(@rel="det" and @pt="vnw" and @lemma="die") or (@rel="mod" and @pt="adj")]/number(@begin)
    ]

So the complete expression would be as follows :

//node[@cat="np" and 
    not(node[(@rel="det" and @pt="vnw" and @lemma="die") or (@rel="mod" and @pt="adj")]
            [position() < last()]
            [number(@begin) > 
                 following-sibling::node[
                    (@rel="det" and @pt="vnw" and @lemma="die") or (@rel="mod" and @pt="adj")
                 ]/number(@begin)
            ]
    ) 
    and node[@rel="det" and @pt="vnw" and @lemma="die"] 
    and count(node[@rel="mod" and @pt="adj"]) > 1
]

demo

Christoper answered 25/3, 2017 at 6:25 Comment(7)
Its not clear without sample XML + expected output, but, if I understand correctly, you can just apply the same predicate on .//node instead of just node, like : /node/node[ not( .//node[position() < last()] [number(@begin) > following-sibling::node/number(@begin)] ) ]Christoper
By using .//node, the XPath checked the order of all descendants node instead of checking only the children node : xpatheval.apphb.com/r0xKVJ1V8Christoper
The expression in your comment evaluates to true if, at least, one of the node satisfies the predicate, but the requirement is that all node should satisfy the condition. That's why we checked for the opposite condition, so that we can tell whether any of the descendant node didn't satisfy the condition and conclude whether the ancestor node is 'ordered' based on thatChristoper
Hi there! I've added a large update to my post, following the discovery that your XPath code makes sure that everything in the matching XML structure needs to follow this fixed word order. However, that is too aggressive an approach for what I want: I want only the specified nodes in XPath to follow this requirement. Please see my edited post!Presley
The only reason the 2nd XPath in 'Update April 19th, 2017' section didn't match the XML that followed it is in the 3rd condition: node[@rel="det" and @pt="vnw" and @lemma="die"].. There is no lemma attribute in the XML, so I would be very confused if you really expect the XPath to match that XML (am I missing something?). If the last attribute check was changed to @word="die" it would've matched the XML : xpatheval.apphb.com/u94vAXG0xChristoper
Seems like the new problem has nothing to do with the not() condition...Christoper
You're right, I posted the wrong XML. However, the issue persists with the correct XML. Please see the edited example; with the correct XML.Presley
I
1

The part of your question that I think I understand is this:

Let's say that I want to match XML where each direct child of the root has an attribute begin that is smaller than the next sibling.

<node id="0" begin="2">
    <node id="1" begin="2">
        <node id="2" begin="2"/>
        <node id="3" begin="3"/>
    </node>
    <node id="4" begin="5">
        <node id="5" begin="5"/>
    </node>
    <node id="6" begin="6"/>
</node>

This XPath should match:

/node/node[number(@begin) < number(../node/@begin)]

Now, it's fairly clear why that gives you an error. Within the predicate, .. selects the node with id=0, this has three child nodes (with ids 1, 4, and 6), and each of these has a @begin attribute, so number(../node/@begin) is selecting a sequence of three attributes.

Your query doesn't seem in any way related to the prose requirement, namely

where each direct child of the root has an attribute begin that is smaller than the next sibling

The condition for that would be

node[every $n in node[position() lt last()] satisfies (number($n/@begin) lt number($n/following-sibling::node/@begin)]

Immune answered 8/3, 2017 at 16:49 Comment(4)
I'm afraid you're now moving into the part of the question that I didn't understand at all. But hopefully, by explaining why it fails in the simple case, you can extrapolate why it fails in a more complex case.Immune
Sorry, I don't understand the question (sorry if I'm impatient, but I self-impose a limit of about 5 minutes for working on it). For example, around the middle there are four numbered examples labelled "matches - good", "matches - bad", "matches - good" and "does not match - good", and I have no idea what "good" and "bad" are supposed to mean.Immune
@MichaelKay "good" = expected results that work and "bad" = unexpected results the mess things up ... at least that is what I got out of itHaemorrhage
I have re-written the whole thing. I hope it makes thing more clear. I also tried your last solution, but the same issue seems to happen. I have posted the input, data, and error at the end of the post. @TonyAbrams Indeed. However, perhaps it wasn't clear it so I removed it. I hope things are clearer now.Presley
C
1

Regarding the a-sequence-of-more-than-one-item-is-not-allowed exception you're facing, notice that XPath 2.0 and above, and XQuery, supports function invocation on path step (.../number()). That said, you can call number() on individual node passing a single begin attribute at a time to avoid the exception :

/node/node[number(@begin) < ../node/number(@begin)]

However, the predicate expression used in the XPath above evaluates to true when at least there is one sibling node with begin attribute value greater than begin attribute of current node, which seems not the desired behavior.

You can apply the same fix on the suggested XQuery, but apparently there was another similar problem due to lt being used to compare a value against a sequence of values (to be clear, I'm referring to the 2nd lt in the suggested XQuery). You can try the following, slightly modified, XQuery instead :

for $node in node[
    every $n in node[position() lt last()] 
    satisfies not($n/following-sibling::node[number(@begin) lt number($n/@begin)])
]
return $node

"One way to go about is to use a numeric comparison of the begin attribute that is available in the corpus. It is numerical ascending, so if we want to ensure the order of XPath is intact, we can say that the numeric value of each child node of @cat="np" should be less than the next by using number()."

If I understand this correctly, you can use the following XPath :

/node/node[
    not(
        node[position() < last()]
            [number(@begin) > following-sibling::node/number(@begin)]
    )
]

demo

The XPath should return all 2nd level node elements, where, for every child node except the last within current 2nd level node, none of the following-sibling node has a numerically lower value of begin attribute than that of current child node.

Given the following sample XML :

<node id="0" begin="2">
    <node id="0" begin="1" cat="np">
        <node id="1" begin="1" pt="art" text="the" />
        <node id="2" begin="3" pt="n" text="time" />
        <node id="3" begin="2" pt="adj" text="available" />
    </node>
    <node id="0" begin="1" cat="np">
        <node id="1" begin="1" pt="art" text="the" />
        <node id="2" begin="2" pt="adj" text="concerned" />
        <node id="3" begin="3" pt="n" text="man" />
    </node>
</node>

Only the 2nd node would be selected, for it is the only 2nd level node that have begin attribute values in ascending order :

<node id="0" begin="1" cat="np">
   <node id="1" begin="1" pt="art" text="the"/>
   <node id="2" begin="2" pt="adj" text="concerned"/>
   <node id="3" begin="3" pt="n" text="man"/>
</node>

Update April 19th, 2017 :

"...However, I would like this cat="np" to match, and make the not() function less aggressive, i.e. only require that nodes specified in XPath (in this example rel="det" pt="vnw" lemma="die", and the two rel="mod" pt="adj" nodes) follow the order requirement where the begin attribute should be smaller than the next item of the XPath structure."

Then we need to add another predicate to specify those nodes within the not(), that is where we check the attribute order requirement :

node[(@rel="det" and @pt="vnw" and @lemma="die") or (@rel="mod" and @pt="adj")]
    [position() < last()]
    [number(@begin) > 
         following-sibling::node[(@rel="det" and @pt="vnw" and @lemma="die") or (@rel="mod" and @pt="adj")]/number(@begin)
    ]

So the complete expression would be as follows :

//node[@cat="np" and 
    not(node[(@rel="det" and @pt="vnw" and @lemma="die") or (@rel="mod" and @pt="adj")]
            [position() < last()]
            [number(@begin) > 
                 following-sibling::node[
                    (@rel="det" and @pt="vnw" and @lemma="die") or (@rel="mod" and @pt="adj")
                 ]/number(@begin)
            ]
    ) 
    and node[@rel="det" and @pt="vnw" and @lemma="die"] 
    and count(node[@rel="mod" and @pt="adj"]) > 1
]

demo

Christoper answered 25/3, 2017 at 6:25 Comment(7)
Its not clear without sample XML + expected output, but, if I understand correctly, you can just apply the same predicate on .//node instead of just node, like : /node/node[ not( .//node[position() < last()] [number(@begin) > following-sibling::node/number(@begin)] ) ]Christoper
By using .//node, the XPath checked the order of all descendants node instead of checking only the children node : xpatheval.apphb.com/r0xKVJ1V8Christoper
The expression in your comment evaluates to true if, at least, one of the node satisfies the predicate, but the requirement is that all node should satisfy the condition. That's why we checked for the opposite condition, so that we can tell whether any of the descendant node didn't satisfy the condition and conclude whether the ancestor node is 'ordered' based on thatChristoper
Hi there! I've added a large update to my post, following the discovery that your XPath code makes sure that everything in the matching XML structure needs to follow this fixed word order. However, that is too aggressive an approach for what I want: I want only the specified nodes in XPath to follow this requirement. Please see my edited post!Presley
The only reason the 2nd XPath in 'Update April 19th, 2017' section didn't match the XML that followed it is in the 3rd condition: node[@rel="det" and @pt="vnw" and @lemma="die"].. There is no lemma attribute in the XML, so I would be very confused if you really expect the XPath to match that XML (am I missing something?). If the last attribute check was changed to @word="die" it would've matched the XML : xpatheval.apphb.com/u94vAXG0xChristoper
Seems like the new problem has nothing to do with the not() condition...Christoper
You're right, I posted the wrong XML. However, the issue persists with the correct XML. Please see the edited example; with the correct XML.Presley
T
-1

in terms of your recursive search request:

Using //node[@pt=("art" or "adj" or "n")]/ancestor::* searches from the inner levels of your xml tree. In your sample xml this will return (per element group) each top level in a recursive manner.

For more info: http://www.w3.org/TR/xpath-30/

Trakas answered 31/3, 2017 at 3:18 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.