most readable way in XPath to write "is value X a member of sequence S"?

Asked 14/3, 2013 at 19:29 Answered 22/6, 2013 at 18:16

XPath 2.0 has some new functions and syntax, relative to 1.0, that work with sequences. Some of theset don't really add to what the language could already do in 1.0 (with node sets), but they make it easier to express the desired logic in ways that are more readable. This increases the chances of the programmer getting the code correct -- and keeping it that way. For example,

empty(s) is equivalent to not(s), but its intent is much clearer when you want to test whether a sequence is empty.
- Correction: the effective boolean value of a sequence is in general more complicated than that. E.g. empty((0)) != not((0)). This applies to exists(s) vs. s in a boolean context as well. However, there are domains of s where empty(s) is equivalent to not(s), so the two could be used interchangeably within those domains. But this goes to show that the use of empty() can make a non-trivial difference in making code easier to understand.
Similarly, exists(s) is equivalent to boolean(s) that already existed in XPath 1.0 (or just s in a boolean context), but again is much clearer about the intent.
Quantified expressions; e.g. "some $x in expression satisfies test($x)" would be equivalent to boolean(expression[test(.)]) (although the new syntax is more flexible, in that you don't need to worry about losing the context item because you have the variable to refer to it by).
Similarly, "every $x in expression satisfies test($x)" would be equivalent to not(expression[not(test(.))]) but is more readable.

These functions and syntax were evidently added at no small cost, solely to serve the goal of writing XPath that is easier to map to how humans think. This implies, as experienced developers know, that understandable code is significantly superior to code that is difficult to understand.

Given all that ... what would be a clear and readable way to write an XPath test expression that asks

Does value X occur in sequence S?

Some ways to do it: (Note: I used X and S notation here to indicate the value and the sequence, but I don't mean to imply that these subexpressions are element name tests, nor that they are simple expressions. They could be complicated.)

X = S: This would be one of the most unreadable, since it requires the reader to
- think about which of X and S are sequences vs. single values
- understand general comparisons, which are not obvious from the syntax
  1. However, one advantage of this form is that it allows us to put the topic (X) before the comment ("is a member of S"), which, I think, helps in readability.
  2. See also CMS's good point about readability, when the syntax or names make the "cardinality" of X and S obvious.
index-of(S, X): This one is clear about what's intended as a value and what as a sequence (if you remember the order of arguments to index-of()). But it expresses more than we need to: it asks for the index, when all we really want to know is whether X occurs in S. This is somewhat misleading to the reader. An experienced developer will figure out what's intended, with some effort and with understanding of the context. But the more we rely on context to understand the intent of each line, the more understanding the code becomes a circular (spiral) and potentially Sisyphean task! Also, since index-of() is designed to return a list of all the indexes of occurrences of X, it could be more expensive than necessary: a smart processor, in order to evaluate X = S, wouldn't necessarily have to find all the contents of S, nor enumerate them in order; but for index-of(S, X), correct order would have to be determined, and all contents of S must be compared to X. One other drawback of using index-of() is that it's limited to using eq for comparison; you can't, for example, use it to ask whether a node is identical to any node in a given sequence.
- Correction: This form, used as a conditional test, can result in a runtime error: Effective boolean value is not defined for a sequence of two or more items starting with a numeric value. (But at least we won't get wrong boolean values, since index-of() can't return a zero.) If S can have multiple instances of X, this is another good reason to prefer form 3 or 6.
exists(index-of(X, S)): makes the intent clearer, and would help the processor eliminate the performance penalty if the processor is smart enough.
some $m in S satisfies $m eq X: This one is very clear, and matches our intent exactly. It seems long-winded compared to 1, and that in itself can reduce readability. But maybe that's an acceptable price for clarity. Keep in mind that X and S could potentially be complex expressions themselves -- they're not necessarily just variable references. An advantage is that since the eq operator is explicit, you can replace it with is or any other comparison operator.
S[. eq X]: clearer than 1, but shares the semantic drawbacks of 2: it computes all members of S that are equal to X. Actually, this could return a false negative (incorrect effective boolean value), if X is falsy. E.g. (0, 1)[. eq 0] returns 0 which is falsy, even though 0 occurs in (0, 1).
exists(S[. eq X]): Clearer than 1, 2, 3, and 5. Not as clear as 4, but shorter. Avoids the drawbacks of 5 (or at least most of them, depending on the processor smarts).

I'm kind of leaning toward the last one, at this point: exists(S[. eq X])

What about you... As a developer coming to a complex, unfamiliar XSLT or XQuery or other program that uses XPath 2.0, and wanting to figure out what that program is doing, which would you find easiest to read?

Apologies for the long question. Thanks for reading this far.

Edit: I changed = to eq wherever possible in the above discussion, to make it easier to see where a "value comparison" (as opposed to a general comparison) was intended.

Wyrick answered 14/3, 2013 at 19:29 Comment(5)

Lest someone think I'm against the use of existential comparisons ... I'm not. I use them frequently, for short or routine tasks. But today I have a long XPath expression that isn't working the way I expect, and I can't figure out why. So I'm breaking everything down to make it as clear as possible. P.S. @DimitreNovatchev I'd like to hear your perspective on this. :-) – Wyrick 14/3, 2013 at 20:40

If possible, I would use the functx library for XQuery or XSLT; the code is then a fairly readable functx:is-node-in-sequence($X, $Y) – Tit 14/3, 2013 at 20:48

@evilotto: Thanks... I hadn't heard of that library. So we can use it in XSLT, even though it's written in XQuery? – Wyrick 14/3, 2013 at 21:35

There's an XSLT version also - xsltfunctions.com (both linked from functx.com) – Tit 15/3, 2013 at 1:16

@evil: I'd like to have this as an answer so that I can upvote it and so that others can easily find it. – Wyrick 15/3, 2013 at 18:12

For what it's worth, if names or context make clear that X is a singleton, I'm happy to use your first form, X = S -- for example when I want to check an attribute value against a set of possible values:

<xsl:when test="@type = ('A', 'A+', 'A-', 'B+')" />

<xsl:when test="@type = $magic-types"/>

If I think there is a risk of confusion, then I like your sixth formulation. The less frequently I have to remember the rules for calculating an effective boolean value, the less frequently I make a mistake with them.

Carthy answered 22/6, 2013 at 18:16 Comment(0)

I prefer this one:

count(distinct-values($seq)) eq count(distinct-values(($x, $seq)))

When $x is itself a sequence, this expression implements the (value-based) subset of relation between two sets of values, that are represented as sequences. This implementation of subset of has just linear time complexity -- vs many other ways of expressing this, that have O(N^2)) time complexity.

To summarize, the question whether a single value belongs to a set of values is a special case of the question whether one set of values is a subset of another. If we have a good implementation of the latter, we can simply use it for answering the former.

Procrustes answered 15/3, 2013 at 4:13 Comment(8)

Thanks, Dimitre. I'm curious, why does distinct-values() have linear time complexity? I would expect it to be O(n^2). I guess if it's implemented with a hashtable... – Wyrick 15/3, 2013 at 18:10

@LarsH, ANy hash-table based implementation gives us this efficient implementation. A binary search tree implementation is also efficient O(N*log(N)) ), but not as efficient as the hashtable-based one. Can you notice a similarity with a wellknown node-set subset test? :) – Procrustes 15/3, 2013 at 19:10

I suspect you're referring to the count($nodeset) = count($nodeset | $x) of XPath 1.0. :-) Yeah, that and the above work well, but I can't say I find them highly readable. For XPath 1.0 there was little choice, but I don't know why we'd want to use this in 2.0, other than nostalgia. I see your point about efficiency for the subset-of relation, but for the member-of relation there are more readable expressions that are also O(N). – Wyrick 15/3, 2013 at 19:44

@LarsH, In fact, for the "member-of" relation there is an O(1) implementation -- if the hasn-set corresponding to distinct-values($seq) has already been calculated. A good optimizer would do this. – Procrustes 15/3, 2013 at 19:58

I'm not familiar with hasn-set (or is it hash set?) Can you elaborate? – Wyrick 15/3, 2013 at 20:14

let us continue this discussion in chat – Wyrick 15/3, 2013 at 20:26

It is HashSet<T> and is a standard .NET class: msdn.microsoft.com/en-us/library/bb359438.aspx – Procrustes 15/3, 2013 at 21:4

Speaking about efficiency, very few people know that there is a very efficient implementation of substring -- and the same algorithm can be adapted for sub-sequence (not subset). – Procrustes 15/3, 2013 at 23:9

The functx library has a nice implementation of this function, so you can use

functx:is-node-in-sequence($X, $Y)

(this particular function can be found at http://www.xqueryfunctions.com/xq/functx_is-node-in-sequence.html)

The whole functx library is available for both XQuery (http://www.xqueryfunctions.com/) and XSLT (http://www.xsltfunctions.com/)

Marklogic ships the functx library with their core product; other vendors may also.

Tit answered 15/3, 2013 at 18:24 Comment(3)

evil otto, Lars wants to test the membership of any item -- not only a node -- to a sequence. The cited function doesn't do that -- it is limited to membership of a node to a sequence of nodes. – Procrustes 15/3, 2013 at 23:4

There is also functx:is-value-in-sequence() - the essence of my suggestion is that functx provides nicely-named versions of many such functions. – Tit 18/3, 2013 at 16:31

Yes, Lars was asking about the second function. It might be useful if you could edit the answer and replace the function with is-value-in-sequence(). Anyway, I see that the implementation is quite inefficient. – Procrustes 18/3, 2013 at 18:24

Another possibility, when you want to know whether node X occurs in sequence S, is

exists((X) intersect S)

I think that's pretty readable, and concise. But it only works when X and the values in S are nodes; if you try to ask

exists(('bob') intersect ('alice', 'bob'))

you'll get a runtime error. In the program I'm working on now, I need to compare strings, so this isn't an option.

As Dimitri notes, the occurrence of a node in a sequence is a question of identity, not of value comparison.

Wyrick answered 15/3, 2013 at 21:9 Comment(8)

Lars, There is an important difference: This uses identity-based equality -- not value-based equality. Both are useful, but quite different. – Procrustes 15/3, 2013 at 23:6

@Dimitre: good point; and isn't it true that identity relations in XPath apply only to nodes? I.e. there is no other data type for which identity-based equality has any meaning. You could ask whether strings $a and $b have the same type and the same string value, but cannot ask whether they are the same string object. – Wyrick 16/3, 2013 at 2:2

Lars, exactly. "value types" have value-based equality. "Reference types" have identity-based equality. – Procrustes 16/3, 2013 at 4:51

@DimitreNovatchev: I would have said that reference types (like nodes) can also have value-based equality, as in X = S. But maybe it's more accurate to say that if you take the value of a node (e.g. the implicit string(X) in X = S) that you can do value-based equality. – Wyrick 30/5, 2013 at 15:11

[I would have said that reference types (like nodes) can also have value-based equality]. Only if one views nodes as containers, and nothing more. Nodes that are "value-equal" in general aren't "identity-equal" and can have very different important properties -- such as parent, preceding and following nodes, document node, etc. – Procrustes 30/5, 2013 at 15:23

@DimitreNovatchev: Sure. But value-based comparison of nodes (viewing them merely as containers, for a given purpose) is a common and useful operation, as long as the user understands what is actually being compared. That's why XPath has had the powerful general comparison = operator for node-sets since 1.0. – Wyrick 30/5, 2013 at 15:27

Sure, but in case one is designing a general capability such as Saxon's saxon:memo-function="yes", then the only safe implementation is using identity-based equality for nodes. Of course, if an additional attribute is used saying: "node-equality='value-based'" then it would be possible to implement both types of equality. – Procrustes 30/5, 2013 at 15:42

@DimitreNovatchev: Agreed that if you're designing a general capability where the spec just says "the same arguments", then you're only safe if you interpret that as strictly as possible. – Wyrick 30/5, 2013 at 17:35

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags