Regular Expressions in MarkLogic's xQuery
Asked Answered
R

2

8

I am trying an XQuery using fn:matches with a regular expression, but the MarkLogic implementation of XQuery does not seem to allow hexidecimal character representations. The following gives me an "Invalid regular expression" error.

(: Find text containing non-ISO-Latin characters :)
let $regex := '[^\x00-\xFF]'
let $results := fn:collection('mydocs')//myns:myelem[fn:matches(., $regex)]
let $count := fn:count($results)

return
    <figures count="{$count}">
        { $results }
    </figures>

However, this one does not give the error.

let $regex := '[^a-zA-Z0-9]'
let $results := fn:collection('mydocs')//myns:myelem[fn:matches(., $regex)]
let $count := fn:count($results)

return
    <figures count="{$count}">
        { $results }
    </figures>

Is there a way to use the hexidecimal character representation, or an alternative that would give me the same result, in MarkLogic's implementation of XQuery?

Robinson answered 1/5, 2015 at 18:48 Comment(4)
Can you try the following code and let us know if it runs without error: let $regex := '[^\x00\xFF]' If it runs, it means you have a problem with the range. If it doesn't run, then MarkLogic regex will appear to not accept hexadecimal matches.Oyler
Thanks. It does indeed run: let $regex := '[^\x00-\xFF]' return $regex does not return an errorRobinson
The problem is the hex characters in a range then. Every regex engine has different escaping rules when you're using a character set (i.e. sometime engines require \[a-z\] others might need [\x{00}]. It'll be hard to test without an actual MarkLogic console in front of me.Oyler
Can you use the [[:ascii:]] class in MarkLogic regex? In your first example, you are essentially trying to match any ASCII character.Oyler
B
7

XQuery can use numeric character references in strings, in much the same way that XML and HTML can:

decimal: "&#10;" hex: "&#0a;" (or just "&#a;")

However, you can't represent some characters: <= "&#x09;", for instance.

There's no regex type in XQuery (you just use a string as a regex), so you can use character references in your regular expressions:

fn:matches("a", "[^&#x09;-&#xFF;]")

(: => xs:boolean("false") :)

Update: here's the XQuery 1.0 spec on character references: http://www.w3.org/TR/xquery/#dt-character-reference.

Based on some brief testing, I think MarkLogic enforces XML 1.1 character reference rules: http://www.w3.org/TR/xml11/#charsets

For posterity, here are the XML 1.0 rules: http://www.w3.org/TR/REC-xml/#charsets

Bowler answered 1/5, 2015 at 19:44 Comment(0)
R
3

Well, it seems MarkLogic's implementation of xQuery wants Unicode. As it turned out, even very small ranges in hex(e.g., [^x00-x0F]) threw the "Invalid regular expression" error, but Unicode notation did not throw the error. The following give me results.

let $regex := '[^U0000-U00FF]'
let $results := fn:collection('mydocs')//myns:myelem[fn:matches(., $regex)]
let $count := fn:count($results)

return
    <figures count="{$count}">
        { $results }
    </figures>

I think that the mere assignment of let $regex := '[^\x00-\xFF]' did not throw the error because it was treated as a string when I tried return $regex.

Robinson answered 1/5, 2015 at 20:3 Comment(1)
That regex is not matching unicode characters by hexadecimal codepoint; it's matching anything but U00, 0-U, and 00FF (ie, those ranges are interpreted as literal characters).Bowler

© 2022 - 2024 — McMap. All rights reserved.