How to save newlines in XML attribute?
Asked Answered
A

4

64

I need to save content that containing newlines in some XML attributes, not text. The method should be picked so that I am able to decode it in XSLT 1.0/ESXLT/XSLT 2.0

What is the best encoding method?

Please suggest/give some ideas.

Archival answered 5/1, 2010 at 5:45 Comment(3)
possible duplicate of Are line breaks in XML attribute values valid?Hallway
made an example for a similar question: https://mcmap.net/q/205313/-preserving-attribute-whitespaceBanking
related: stackoverflow.com/questions/260436 - related: stackoverflow.com/questions/449627 - related: stackoverflow.com/questions/1289524Banking
S
80

In a compliant DOM API there is nothing you need to do. Simply save actual newline characters to the attribute, the API will encode them correctly on its own (see Canonical XML spec, section 5.2).

If you do your own encoding (i.e. replacing \n with 
 before saving the attribute value), the API will encode your input again, resulting in 
 in the XML file.

Bottom line is, the string value is saved verbatim. You get out what you put in, no need to interfere.

However… some implementations are not compliant. For example, they will encode & characters in attribute values, but forget about newline characters or tabs. This puts you in a losing position since you can't simply replace newlines with 
 beforehand.

These implementations will save newline characters unencoded, like this:

<xml attribute="line 1
line 2" />

Upon parsing such a document, literal newlines in attributes are normalized into a single space (again, in accordance to the spec) - and thus they are lost.

Saving (and retaining!) newlines in attributes is impossible in these implementations.

Skater answered 6/1, 2010 at 10:40 Comment(20)
Something I ran into: XML uses Unix-style newlines (LF). So if you want to store Windows-style newlines (CR+LF), you'll either need to convert the newlines after reading from your attribute, or escape the newlines somehow. Source: w3schools.com/xml/xml_syntax.aspParticularize
@Joe: Where do you take the info from that XML uses Unix-style newlines? As far as I can see, the spec does not restrict that.Skater
@Skater Scroll down to the bottom of that link. Look for the heading "XML Stores New Line as LF". I noticed this in practice too--both the XmlWriter in C# and in a 3rd party component strips out the CR characters (leaving just LFs, like Unix).Particularize
@Joe: Sorry, I don't give w3schools a lot of credibility. If it was in the spec, that would be a different matter.Skater
@Tomalak: Hmm, ok that's fair then. I saw the effects before I even looked it up. Here it is from the spec: w3.org/TR/xml/#sec-line-ends -- quoted "To simplify the tasks of applications, the XML processor must behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character."Particularize
@Joe: Ah, I see. Thanks for pointing this out. However, that's a slightly different issue. An attribute like a="&#xD;&#xA;" will not be affected by this rule - it does not contain actual CR or LF characters, only their references. After parsing, a CRLF sequence will be in the attribute value. And if you save a CRLF to an attribute value it should be serialized as &#xD;&#xA; again, unless I'm misinterpreting it.Skater
@Tomalak: That's what was interesting. When we used the 3rd party component (this was our first attempt to keep CRLF), it actually did remove the &#xD; entity. I couldn't tell you whether that's part of the spec or an extra step taken though.Particularize
@Skater The framework (System.Xml) implementation is not compliant. A possible fix is { var a = elem.Attributes[0]; a.InnerText = s; a.InnerXml = a.InnerXml.Replace("\r\n", "&#10;"); }, though it'll cost time, and gets a little more complicated if you need to handle any non-windows newlines.Morula
The .NET Framework's XmlWriter can be made to behave correctly and (reasonably) sensibly using the NewLineHandling property (by setting it to Entitize). Unfortunately, preservation of newlines is impossible in the XML DOM as implemented in Firefox - a 2002 bug - while Chrome's implementation does the right thing.Alkaline
@Skater could you please help me with this question #39039916?Saltatory
@Sandeepan No, sorry. Bash escaping issues are not my strong suit. I strongly recommend not trying to embed XML into bash scripts. Use actual XML files and XML-aware tools (xmlstarlet, xsltproc, maybe even xmlsh).Skater
@Skater no issues :) However, mine is an XML script which has bash embedded, i.e. opposite of what you are saying.Saltatory
@Sandeepan Ah, all-right. You can't put actual newline characters into an attribute value, as my answer above explains (Note that the first sentence talks about an API. Editing an XML file with a text editor is not the same thing.) My recommendation would be to write multi-line values into the an element instead.Skater
@Skater I am not sure what you mean by writing multi-line values. I checked #450127 and it seems having line breaks in XML attribute values is valid.Saltatory
@Sandeepan Of course they are. Just not as literal characters. Just like < is valid in an attribute. Just not as a literal character. Read my answer again. It's all in there, really. :)Skater
Looks like the Java XMLStreamWriter (at least the internal com.sun.xml one) is in the category of "impossible to do": #8331864Keg
@Skater what about w3.org/TR/xml/#AVNormalize? My read of that section says that any implementation that preserves newlines in attributes is non-compliant.Jigaboo
@DanAlbert That's what I say as well. You can't output >literal< newlines to the serialized representation of an attribute (i.e. "to the XML code") in a compliant implementation. But you can output >representations< of newlines (i.e. character entities such as &xA;). Preserving the newline is fine. Preserving it unencoded is not.Skater
Ah, I misunderstood your answer to mean that only happened in non-compliant implementations. Thanks for clairifying!Jigaboo
@DanAlbert Of course you technically can put actual newlines into attribute values when serializing the document to XML (or when editing XML by hand), but when the document is parsed later, literal newlines found in an attribute value will be normalized into spaces, and you will lose them (see similar thread). Some implementations make that mistake - Python's own ElementTree for example did it for the longest time. I had filed a bug in 2009, and they've fixed it eventually.Skater
M
49

You can use the entity &#10; to represent a newline in an XML attribute. &#13; can be used to represent a carriage return. A windows style CRLF could be represented as &#13;&#10;.

This is legal XML syntax. See XML spec for more details.

Mantra answered 5/1, 2010 at 5:48 Comment(7)
Is it a valid XML Character??Interrex
I guess i have to use some encoding instead of entity As getAttribute wont work with a string containing newline. Do you have many idea? Will entity solve the getAttribute problem?Archival
@Chathuranga Chandrasekara: Yes. It's valid XML. I updated my answer to include a link to the XML spec where these symbols are mentioned.Mantra
@Tommy: What programming language/API are you using? What is this getAttribute() method you speak of?Mantra
@Asaph: Javascript. client side: javascript. server side: php (xslt 1.0/esxlt), tomcat (xslt 2.0 saxon8).Archival
@Tommy: Are you sure getAttribute won't decode &#10; and convert it to a newline? It should work. Did you test it?Mantra
@Mantra could you please help me with this question #39039916?Saltatory
P
0

A crude answer can be:

XmlDocument xDoc = new XmlDocument();
xDoc.Load(@"Agenda.xml");
//make stuff with the xml
//make attributes value = "\r\n" (you need both expressions to make a new line)
string a = xDoc.InnerXml.Replace("&#xD;", "\r").Replace("&#xA;", "\n").Replace("><",">\r    \n<");
StreamWriter sDoc = new StreamWriter(@"Agenda.xml");
sDoc.Write(a);
sDoc.Flush();
sDoc.Dispose();

This will as you see is just a string

Prepuce answered 3/11, 2011 at 10:55 Comment(0)
P
0

A slightly different approach that has been helpful in some situations-

Placeholders and Find & Replace.

Before parsing you can simply use your own custom linebreak marker/placeholder, then on the 2nd half of the situation just string replace it with whatever line break character is effective, whether that's \n or or or #&10; or \u2028 or any of the various line break characters out there. Find & replace them back in after setting the placeholder of your own in the data initially.

This is useful when parsers like jQuery $.parseXML() strip the unencoded line breaks. For example, you could use {LBREAK} as your line break char, insert it while raw text, and replace it later after parsed to an XML object. String.replaceAll() is a helpful prototype.

So rough code concept with jquery and a replaceAll prototype (have not tested this code but it will show the concept):

function onXMLHandleLineBreaks(_result){
    var lineBreakCharacterThatGetsLost = '&#10;';
    var lineBreakCharacterThatGetsLost = '&#xD;';
    var rawXMLText = _result; // hold as text only until line breaks are ready
        rawXMLText = String(rawXMLText).replaceAll(lineBreakCharacterThatGetsLost, '{mylinebreakmarker}'); // placemark the linebreaks with a regex find and replace proto
    var xmlObj = $.parseXML(rawXML); // to xml obj
    $(xmlObj).html( String(xmlObj.html()).replaceAll('{mylinebreakmarker}'), lineBreakCharacterThatWorks ); // add back in line breaks
    console.log('xml with linebreaks that work: ' + xmlObj);
}

And of course you could adjust the line break chars that work or don't work to your data situation, and you could put that in a loop for a set of line break characters that don't work and iterate through them to do a an entire set of linebreak characters.

Perlaperle answered 18/1, 2019 at 20:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.