XML to CSV conversion using XQuery

P

2

5

I have an XML file and I need to convert it into XQuery. consider a simple set of XML:

books[book]
book[@isbn, title, descrption]

eg:

<books>
    <book isbn="1590593049">
        <title>Extending Flash MX 2004</title>
        <description>
        Using javascript alongwith actionscript 3.0 and mxml.</description>
    </book>
    <book isbn="0132149184">
        <title>Java Software Solutions</title>
        <description>
            Complete book full of case studies on business solutions and design concepts while building mission critical
            business applications.
        </description>
    </book>

How to convert it to CSV format using XQuery? The CSV is used by Microsoft excel,

so it would be delimited by comma (,) character and special characters should be escaped.

Profile answered 30/12, 2012 at 6:46 Comment(0)

R

4

A pure XPath 2.0 expression:

for $b in /*/book
    return
      concat(escape-html-uri(string-join(($b/@isbn,
                                          $b/title,
                                          $b/description
                                          )
                                           /normalize-space(),
                                        ",")
                             ),
             codepoints-to-string(10))

XSLT 2 - based verification:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:sequence select=
   "for $b in /*/book
       return
         concat(escape-html-uri(string-join(($b/@isbn,
                                             $b/title,
                                             $b/description
                                             )
                                              /normalize-space(),
                                           ',')
                                ),
                codepoints-to-string(10))"/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document (corrected from its malformedness):

<books>
    <book isbn="1590593049">
        <title>Extending Flash MX 2004</title>
        <description>
        Using javascript alongwith actionscript 3.0 and mxml.</description>
    </book>
    <book isbn="0132149184">
        <title>Java Software Solutions</title>
        <description>
            Complete book full of case studies on business solutions and design concepts while building mission critical
            business applications.
        </description>
    </book>
</books>

the wanted, correct result is produced:

1590593049,Extending Flash MX 2004,Using javascript alongwith actionscript 3.0 and mxml.
 0132149184,Java Software Solutions,Complete book full of case studies on business solutions and design concepts while building mission critical business applications.

Update:

In a comment the OP has requested that any in-text comma be surrounded by a quote and that (after that) any quote be replaced by two quotes, and, finally, if the wholw result contains a quote, it must be surrounded by (single) quotes.

Here is a pure XPath 2.0 expression that produces this:

for $b in /*/book,
    $q in codepoints-to-string(34),
    $NL in codepoints-to-string(10),
    $isbn in normalize-space(replace($b/@isbn, ',', concat($q,',',$q))),
    $t in normalize-space(replace($b/title, ',', concat($q,',',$q))),
    $d in normalize-space(replace($b/description, ',', concat($q,',',$q))),
    $res in
     escape-html-uri(string-join(($isbn,$t,$d), ',')),
    $res2 in replace($res, $q, concat($q,$q))
   return
    if(contains($res2, $q))
       then concat($q, $res2, $q, $NL)
       else concat($res2, $NL)

When this XPath expression is evaluated against this (extended with a new test-case) XML document:

<books>
    <book isbn="1590593049">
        <title>Extending Flash MX 2004</title>
        <description>
        Using javascript alongwith actionscript 3.0 and mxml.</description>
    </book>
    <book isbn="0132149184">
        <title>Java Software Solutions</title>
        <description>
            Complete book full of case studies on business solutions and design concepts while building mission critical
            business applications.
        </description>
    </book>
    <book isbn="XX1234567">
        <title>Quotes and comma</title>
        <description>
            Hello, World from "Ms-Excel"
        </description>
    </book>
</books>

the wanted, correct result is produced:

1590593049,Extending Flash MX 2004,Using javascript alongwith actionscript 3.0 and mxml.
0132149184,Java Software Solutions,Complete book full of case studies on business solutions and design concepts while building mission critical business applications.
"XX1234567,Quotes and comma,Hello"","" World from ""Ms-Excel"""

Roybn answered 30/12, 2012 at 17:47 Comment(5)

but if it were pure XPath 2 it would not interpret the '
' as line break – Motorcycle 30/12, 2012 at 19:30

@BeniBela, Your question isn't clear -- I updated my answer with an XSLT transformation that uses the same XPath expression. In case this wasn't a legal XPath expression, there would be an error, but the transformation works without any problem. – Roybn 30/12, 2012 at 19:52

Well, if you include it in XSLT it is no longer pure XPath. Then the XML parser replaces the &#xa. If you just run the pure XPath without XSLT you get:

1590593049,Extending Flash MX 2004,Using javascript alongwith actionscript 3.0 and mxml.&#xa; 0132149184,Java Software Solutions,Complete book full of case studies on business solutions and design concepts while building mission critical business applications.&#xa;

in the example – Motorcycle 30/12, 2012 at 21:26

@BeniBela, You might be right. I wonder why XQuery processors recognize the character reference... Of course, to correct the noticed issue, one would simply replace 
 with codepoints-to-string(10) – Roybn 30/12, 2012 at 21:58

@DimitreNovatchev, the above snippet you posted works great in normal case, but In CSV, fields are spearated by a comma(,). Now if a text itself contains a comma, then excel will interpret that comma as a separator comma, when that comma was a part of the text, for example, part of the description cell or title cell. A typical approach is, to escape commas by wrapping double quotes (") and escape double quotes by adding double quotes ahead of double quotes: Hello, World from "Ms-Excel" will be: "Hello, World from ""Ms-Excel""". How to escape this way? Thanks for your great help. – Profile 31/12, 2012 at 7:15

W

5

Assuming your xml is in the variable $books you could create a csv file with each book node on a new line using this:

declare function local:my-replace($input) {
  for $i in $input
  return '"' || replace($i, '"', '""') || '"'
};
for $book in $books//book
return string-join(local:my-replace(($book/@isbn, $book/title, $book/description)), ",") || '&#xa;'

string-join concatenates the different strings, the local function my-replace replaces the values in the sequence according to your specification.

Whitebook answered 30/12, 2012 at 10:26 Comment(2)

the above snippet you posted works great in normal case, but In CSV, fields are spearated by a comma(,). Now if a text itself contains a comma, then excel will interpret that comma as a separator comma, when that comma was a part of the text, for example, part of the description cell or title cell. A typical approach is, to escape commas by wrapping double quotes (") and escape double quotes by adding double quotes ahead of double quotes: Hello, World from "Ms-Excel" will be: "Hello, World from ""Ms-Excel""". How to escape this way? Thanks for your great help. – Profile 31/12, 2012 at 7:16

I just edited my answer to reflect the specification you added. I just created a user-defined function my-replace, which adds a " at the beginning and the end of every element and replaces " with "". – Whitebook 1/1, 2013 at 11:38

R

4

A pure XPath 2.0 expression:

for $b in /*/book
    return
      concat(escape-html-uri(string-join(($b/@isbn,
                                          $b/title,
                                          $b/description
                                          )
                                           /normalize-space(),
                                        ",")
                             ),
             codepoints-to-string(10))

XSLT 2 - based verification:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

 <xsl:template match="/">
  <xsl:sequence select=
   "for $b in /*/book
       return
         concat(escape-html-uri(string-join(($b/@isbn,
                                             $b/title,
                                             $b/description
                                             )
                                              /normalize-space(),
                                           ',')
                                ),
                codepoints-to-string(10))"/>
 </xsl:template>
</xsl:stylesheet>

When this transformation is applied on the provided XML document (corrected from its malformedness):

<books>
    <book isbn="1590593049">
        <title>Extending Flash MX 2004</title>
        <description>
        Using javascript alongwith actionscript 3.0 and mxml.</description>
    </book>
    <book isbn="0132149184">
        <title>Java Software Solutions</title>
        <description>
            Complete book full of case studies on business solutions and design concepts while building mission critical
            business applications.
        </description>
    </book>
</books>

the wanted, correct result is produced:

1590593049,Extending Flash MX 2004,Using javascript alongwith actionscript 3.0 and mxml.
 0132149184,Java Software Solutions,Complete book full of case studies on business solutions and design concepts while building mission critical business applications.

Update:

In a comment the OP has requested that any in-text comma be surrounded by a quote and that (after that) any quote be replaced by two quotes, and, finally, if the wholw result contains a quote, it must be surrounded by (single) quotes.

Here is a pure XPath 2.0 expression that produces this:

for $b in /*/book,
    $q in codepoints-to-string(34),
    $NL in codepoints-to-string(10),
    $isbn in normalize-space(replace($b/@isbn, ',', concat($q,',',$q))),
    $t in normalize-space(replace($b/title, ',', concat($q,',',$q))),
    $d in normalize-space(replace($b/description, ',', concat($q,',',$q))),
    $res in
     escape-html-uri(string-join(($isbn,$t,$d), ',')),
    $res2 in replace($res, $q, concat($q,$q))
   return
    if(contains($res2, $q))
       then concat($q, $res2, $q, $NL)
       else concat($res2, $NL)

When this XPath expression is evaluated against this (extended with a new test-case) XML document:

<books>
    <book isbn="1590593049">
        <title>Extending Flash MX 2004</title>
        <description>
        Using javascript alongwith actionscript 3.0 and mxml.</description>
    </book>
    <book isbn="0132149184">
        <title>Java Software Solutions</title>
        <description>
            Complete book full of case studies on business solutions and design concepts while building mission critical
            business applications.
        </description>
    </book>
    <book isbn="XX1234567">
        <title>Quotes and comma</title>
        <description>
            Hello, World from "Ms-Excel"
        </description>
    </book>
</books>

the wanted, correct result is produced:

1590593049,Extending Flash MX 2004,Using javascript alongwith actionscript 3.0 and mxml.
0132149184,Java Software Solutions,Complete book full of case studies on business solutions and design concepts while building mission critical business applications.
"XX1234567,Quotes and comma,Hello"","" World from ""Ms-Excel"""

Roybn answered 30/12, 2012 at 17:47 Comment(5)

but if it were pure XPath 2 it would not interpret the '
' as line break – Motorcycle 30/12, 2012 at 19:30

@BeniBela, Your question isn't clear -- I updated my answer with an XSLT transformation that uses the same XPath expression. In case this wasn't a legal XPath expression, there would be an error, but the transformation works without any problem. – Roybn 30/12, 2012 at 19:52

Well, if you include it in XSLT it is no longer pure XPath. Then the XML parser replaces the &#xa. If you just run the pure XPath without XSLT you get:

1590593049,Extending Flash MX 2004,Using javascript alongwith actionscript 3.0 and mxml.&#xa; 0132149184,Java Software Solutions,Complete book full of case studies on business solutions and design concepts while building mission critical business applications.&#xa;

in the example – Motorcycle 30/12, 2012 at 21:26

@BeniBela, You might be right. I wonder why XQuery processors recognize the character reference... Of course, to correct the noticed issue, one would simply replace 
 with codepoints-to-string(10) – Roybn 30/12, 2012 at 21:58

@DimitreNovatchev, the above snippet you posted works great in normal case, but In CSV, fields are spearated by a comma(,). Now if a text itself contains a comma, then excel will interpret that comma as a separator comma, when that comma was a part of the text, for example, part of the description cell or title cell. A typical approach is, to escape commas by wrapping double quotes (") and escape double quotes by adding double quotes ahead of double quotes: Hello, World from "Ms-Excel" will be: "Hello, World from ""Ms-Excel""". How to escape this way? Thanks for your great help. – Profile 31/12, 2012 at 7:15

Recommended topics

Hot tags