How can I remove duplicate nodes in XQuery?

Asked 13/3, 2009 at 20:3 Answered 13/4, 2017 at 9:56

I have an XML document I generate on the fly, and I need a function to eliminate any duplicate nodes from it.

My function looks like:

declare function local:start2() {
    let $data := local:scan_books()
    return <books>{$data}</books>
};

Sample output is:

<books>
  <book>
    <title>XML in 24 hours</title>
    <author>Some Guy</author>  
  </book>
  <book>
    <title>XML in 24 hours</title>
    <author>Some Guy</author>  
  </book>
</books>

I want just the one entry in my books root tag, and there are other tags, like say pamphlet in there too that need to have duplicates removed. Any ideas?

Updated following comments. By unique nodes, I mean remove multiple occurrences of nodes that have the exact same content and structure.

Overprize answered 13/3, 2009 at 20:3 Comment(0)

A simpler and more direct one-liner XPath solution:

Just use the following XPath expression:

  /*/book
        [index-of(/*/book/title, 
                  title
                 )
                  [1]
        ]

When applied, for example, on the following XML document:

<books>
    <book>
        <title>XML in 24 hours</title>
        <author>Some Guy</author>
    </book>
    <book>
        <title>Food in Seattle</title>
        <author>Some Guy2</author>
    </book>
    <book>
        <title>XML in 24 hours</title>
        <author>Some Guy</author>
    </book>
    <book>
        <title>Food in Seattle</title>
        <author>Some Guy2</author>
    </book>
    <book>
        <title>How to solve XPAth Problems</title>
        <author>Me</author>
    </book>
</books>

the above XPath expression selects correctly the following nodes:

<book>
    <title>XML in 24 hours</title>
    <author>Some Guy</author>
</book>
<book>
    <title>Food in Seattle</title>
    <author>Some Guy2</author>
</book>
<book>
    <title>How to solve XPAth Problems</title>
    <author>Me</author>
</book>

The explanation is simple: For every book, select only one of its occurences -- such that its index in all-books is the same as the first index of its title in all-titles.

Machos answered 20/3, 2009 at 14:1 Comment(6)

Hey Dimitre, thanks for the answer; but if I understand correctly, it depends on all the elements having the same structure which is the built into the query - for example it would show two nodes the same if they had the same title and different authors... – Overprize 20/3, 2009 at 21:56

@Brabster It is not at all clear from your question how the test for inequality/uniqueness should be defined. If you define it, it will help you to find a simpler solution – Machos 21/3, 2009 at 14:58

This does not seem to work with XPath 1.0, can we get a working XPath 1.0 solution? – Chloechloette 6/6, 2012 at 7:54

@Abarax: This question is tagged "xquery". XQuery is a superset of XPAth 2.0. Never anyone was asking for an XPath 1.0 answer. This operation is generally known as grouping and in the general case cannot be expressed with a single XPath expression and in the specific cases when such single XPath 1.0 expression exists, it may be inefficient. This is why XSLT 1.0 is tyoically used for efficient grouping -- e.e. the Muenchian grouping method. – Machos 6/6, 2012 at 11:42

@DimitreNovatchev no worries, makes sense. Would this same solution work for multiple fields that define uniqueness? i.e. title and author defines uniqueness – Chloechloette 7/6, 2012 at 22:11

@Abarax: A similar expression will work -- one needs to specify the concatenation of the fields. – Machos 7/6, 2012 at 23:17

You can use the built-in distinct-values() function...

Amann answered 11/3, 2010 at 5:44 Comment(1)

How can you use that? – Gatha 4/12, 2013 at 5:3

A solution inspired by functional programming. This solution is extensible in that you can replace the "=" comparison by your custom-built boolean local:compare($element1, $element2) function. This function has worst-case quadratic complexity in the length of the list. You could get n(log n) complexity by sorting the list before-hand and only comparing with the immediate successor.

To my best knowledge, the fn:distinct-values (or fn:distinct-elements) functions does not allow to use a custom-built comparison function.

declare function local:deduplicate($list) {
  if (fn:empty($list)) then ()
  else 
    let $head := $list[1],
      $tail := $list[position() > 1]
    return
      if (fn:exists($tail[ . = $head ])) then local:deduplicate($tail)
      else ($head, local:deduplicate($tail))
};

let $list := (1,2,3,4,1,2,1) return local:deduplicate($list)

Swish answered 1/7, 2010 at 9:31 Comment(1)

This solution seems to work. Could you please explain the line "fn:exists($tail[ . = $head ])"? I have modified this to be "$head = $tail" and it works. – Chloechloette 6/6, 2012 at 7:56

I solved my problem by implementing a recursive uniqueness search function, based solely on the text content of my document for uniqueness matching.

declare function ssd:unique-elements($list, $rules, $unique) {
    let $element := subsequence($rules, 1, 1)
    let $return :=
    if ($element) then
        if (index-of($list, $element) >= 1) then
            ssd:unique-elements(insert-before($element, 1, $list), subsequence($rules, 2), $unique)
        else <test>
            <unique>{$element}</unique>
            {ssd:unique-elements(insert-before($element, 1, $list), subsequence($rules, 2), insert-before($element, 1, $unique))/*}
            </test>
    else ()
    return $return
};

Called as follows:

declare function ssd:start2() {
    let $data := ()
    let $sift-this := 
       <test>
           <data>123</data>
           <data>456</data>
           <data>123</data>
           <data>456</data>
           <more-data>456</more-data>
       </test>
    return ssd:unique-elements($data, $sift-this/*, ())/*/*
};

ssd:start2()

output:

<?xml version="1.0" encoding="UTF-8"?>
<data>123</data>
<data>456</data>

I guess if you need slightly different equivalence matching, you can alter the matching in the algorithm accordingly. Should get you started at any rate.

Overprize answered 13/3, 2009 at 22:50 Comment(0)

What about fn:distinct-values?

Judaica answered 29/5, 2010 at 19:2 Comment(0)

You can use this functx function: functx:distinct-deep

No need reinvent the wheel

Eton answered 27/11, 2014 at 3:8 Comment(0)

To remove duplicates I usually use a helper function. In your case it'll look like that:

declare function local:remove-duplicates($items as item()*) 
as item()*
{
  for $i in $items
  group by $i
    return $items[index-of($items, $i)[1]]
};

declare function local:start2() {
    let $data := local:scan_books()
    return <books>{local:remove-duplicates($data)}</books>
};

Ottie answered 13/4, 2017 at 9:56 Comment(0)

Recommended topics

Hot tags