Validating XML with XSDs ... but still allow extensibility
Asked Answered
A

5

34

Maybe it's me, but it appears that if you have an XSD

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="User">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="GivenName" />
                <xs:element name="SurName" />
            </xs:sequence>
            <xs:attribute name="ID" type="xs:unsignedByte" use="required" />
        </xs:complexType>
    </xs:element>
</xs:schema>

that defines the schema for this document

<?xml version="1.0" encoding="utf-8" ?>
<User ID="1">
    <GivenName></GivenName>
    <SurName></SurName>
</User>

It would fail to validate if you added another element, say EmailAddress, and mix up the order

<?xml version="1.0" encoding="utf-8" ?>
<User ID="1">
    <SurName></SurName>
    <EmailAddress></EmailAddress>
    <GivenName></GivenName>
</User>

I don't want to add EmailAddress to the document and have it be marked optional.

I just want an XSD that validates the bare minimum requirements that the document must meet.

Is there a way to do this?

EDIT:

marc_s pointed out below that you can use xs:any inside of xs:sequence to allow more elements, unfortunately, you have to maintain the order of elements.

Alternatively, I can use xs:all which doesn't enforce the order of elements, but alas, doesn't allow me to place xs:any inside of it.

Aldrin answered 27/7, 2010 at 20:46 Comment(1)
Some good answers and discussion, but I have to go with Abel's as it was so detailed and also explained WHY what I was looking for didn't exist.Aldrin
G
57

Your issue has a resolution, but it will not be pretty. Here's why:

Violation of non-deterministic content models

You've touched on the very soul of W3C XML Schema's. What you are asking — variable order and variable unknown elements — violates the hardest, yet most basic principle of XSD's, the rule of Non-Ambiguity, or, more formally, the Unique Particle Attribution Constraint:

A content model must be formed such that during validation [..] each item in the sequence can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.

In normal English: when an XML is validated and the XSD processor encounters <SurName> it must be able to validate it without first checking whether it is followed by <GivenName>, i.e., no looking forward. In your scenario, this is not possible. This rule exists to allow implementations through Finite State Machines, which should make implementations rather trivial and fast.

This is one of the most-debated issues and is a heritage of SGML and DTD (content models must be deterministic) and XML, that defines, by default, that the order of elements is important (thus, trying the opposite, making the order unimportant, is hard).

As Marc_s already suggested, Relax_NG is an alternative that allows for non-deterministic content models. But what can you do if you're stuck with W3C XML Schema?

Non-working semi-valid solutions

You've already noticed that xs:all is very restrictive. The reason is simple: the same non-deterministic rule applies and that's why xs:any, min/maxOccurs larger then one and sequences are not allowed.

Also, you may have tried all sorts of combinations of choice, sequence and any. The error that the Microsoft XSD processor throws when encountering such invalid situation is:

Error: Multiple definition of element 'http://example.com/Chad:SurName' causes the content model to become ambiguous. A content model must be formed such that during validation of an element information item sequence, the particle contained directly, indirectly or implicitly therein with which to attempt to validate each item in the sequence in turn can be uniquely determined without examining the content or attributes of that item, and without any information about the items in the remainder of the sequence.

In O'Reilly's XML Schema (yes, the book has its flaws) this is excellently explained. Furtunately, parts of the book are available online. I highly recommend you read through section 7.4.1.3 about the Unique Particle Attribution Rule, their explanations and examples are much clearer than I can ever get them.

One working solution

In most cases it is possible to go from an undeterministic design to a deterministic design. This usually doesn't look pretty, but it's a solution if you have to stick with W3C XML Schema and/or if you absolutely must allow non-strict rules to your XML. The nightmare with your situation is that you want to enforce one thing (2 predefined elements) and at the same time want to have it very loose (order doesn't matter and anything can go between, before and after). If I don't try to give you good advice but just take you directly to a solution, it will look as follows:

<xs:element name="User">
    <xs:complexType>
        <xs:sequence>
            <xs:any minOccurs="0" processContents="lax" namespace="##other" />
            <xs:choice>
                <xs:sequence>                        
                    <xs:element name="GivenName" />
                    <xs:any minOccurs="0" processContents="lax" namespace="##other" />
                    <xs:element name="SurName" />
                </xs:sequence>
                <xs:sequence>
                    <xs:element name="SurName" />
                    <xs:any minOccurs="0" processContents="lax" namespace="##other" />
                    <xs:element name="GivenName" />
                </xs:sequence>
            </xs:choice>
            <xs:any minOccurs="0" processContents="lax" namespace="##any" />
        </xs:sequence>
        <xs:attribute name="ID" type="xs:unsignedByte" use="required" />
    </xs:complexType>
</xs:element>

The code above actually just works. But there are a few caveats. The first is xs:any with ##other as its namespace. You cannot use ##any, except for the last one, because that would allow elements like GivenName to be used in that stead and that means that the definition of User becomes ambiguous.

The second caveat is that if you want to use this trick with more than two or three, you'll have to write down all combinations. A maintenance nightmare. That's why I come up with the following:

A suggested solution, a variant of a Variable Content Container

Change your definition. This has the advantage of being clearer to your readers or users. It also has the advantage of becoming easier to maintain. A whole string of solutions are explained on XFront here, a less readable link you may have already seen from the post from Oleg. It's an excellent read, but most of it does not take into account that you have a minimum requirement of two elements inside the variable content container.

The current best-practice approach for your situation (which happens more often than you may imagine) is to split your data between the required and non-required fields. You can add an element <Required>, or do the opposite, add an element <ExtendedInfo> (or call it Properties, or OptionalData). This looks as follows:

<xs:element name="User2">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="GivenName" />
            <xs:element name="SurName" />
            <xs:element name="ExtendedInfo" minOccurs="0">
                <xs:complexType>
                    <xs:sequence>
                        <xs:any minOccurs="0" maxOccurs="unbounded" processContents="lax" namespace="##any" />
                    </xs:sequence>
                </xs:complexType>
            </xs:element>
        </xs:sequence>
    </xs:complexType>
</xs:element>

This may seem less than ideal at the moment, but let it grow a bit. Having an ordered set of fixed elements isn't that big a deal. You're not the only one who'll be complaining about this apparent deficiency of W3C XML Schema, but as I said earlier, if you have to use it, you'll have to live with its limitations, or accept the burden of developing around these limitations at a higher cost of ownership.

Alternative solution

I'm sure you know this already, but the order of attributes is by default undetermined. If all your content is of simple types, you can alternatively choose to make a more abundant use of attributes.

A final word

Whatever approach you take, you will lose a lot of verifiability of your data. It's often better to allow content providers to add content types, but only when it can be verified. This you can do by switching from lax to strict processing and by making the types themselves stricter. But being too strict isn't good either, the right balance will depend on your ability to judge the use-cases that you're up against and weighing that in against the trade-offs of certain implementation strategies.

Geochronology answered 5/8, 2010 at 0:38 Comment(3)
@John, you're welcome. Now I only hope it also helps Chad with his problem ;-)Geochronology
Wow...good answer. It definitely explains the reasoning behind it! The other idea I had was a combination xsl/xsd, whereby I run the input xml through and xslt only keeping the elements I wish to validate (min requirements) then validate that against the xsd. If it passes, allow the original xml through. But I don't think it will be performant enough.Aldrin
@Chad: if you consider using XSLT + XPath, consider switching from W3C XML Schema to Schematron. It's also an ISO-standard XML Schema Language and is made to work well with XSLT, XPath and a bit of regular expressions. Schematron works the other way around: it is rule based: you define rules for tree patterns as opposed to a grammar with XSD. If you have some experience with XSLT, it should be easy to adopt its only 6 (!) elements.Geochronology
A
6

After reading of the answer of marc_s and your discussion in comments I decide to add a little.

It seems to me there are no perfect solution of your problem Chad. There are some approaches how to implement extensible content model in XSD, but all me known implementation have some restrictions. Because you didn't write about the environment where you plan to use extensible XSD I can you only recommend some links which probably will help you to choose the way which can be implemented in your environment:

  1. http://www.xfront.com/ExtensibleContentModels.html (or http://www.xfront.com/ExtensibleContentModels.pdf) and http://www.xfront.com/VariableContentContainers.html
  2. http://www.xml.com/lpt/a/993 (or http://www.xml.com/pub/a/2002/07/03/schema_design.html)
  3. http://msdn.microsoft.com/en-us/library/ms950793.aspx
Alyse answered 1/8, 2010 at 20:33 Comment(1)
+1 for linking to the xfront website, still a classic treatment of the subject.Geochronology
M
4

You should be able to extend your schema with the <xs:any> element for extensibility - see W3Schools for details.

<?xml version="1.0" encoding="utf-8"?>
<xs:schema attributeFormDefault="unqualified" elementFormDefault="qualified" xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <xs:element name="User">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="GivenName" />
                <xs:element name="SurName" />
                <xs:any minOccurs="0" maxOccurs="unbounded" processContents="lax" />
            </xs:sequence>
            <xs:attribute name="ID" type="xs:unsignedByte" use="required" />
        </xs:complexType>
    </xs:element>
</xs:schema>

When you add the processContents="lax" then the .NET XML validation should succeed on it.

See MSDN docs on xs:any for more details.

Update: if you require more flexibility and less stringent validation, you might want to look at other methods of defining schemas for your XML - something like RelaxNG. XML Schema is - on purpose - rather strict about its rules, so maybe that's just the wrong tool for this job at hand.

Marx answered 27/7, 2010 at 20:55 Comment(5)
while, that almost works, the order becomes important. GivenName and SurName must be the first and second elements respectively...at least I think that's the effect of xs:sequence. I could change it to xs:all. But then I can't use xs:any...Aldrin
which tells me that XSD is inherently broken. XML is supposed to be eXtensible. Order of elements isn't supposed to matter. So first, why can you define a sequence? Second, why can't we write extensible XSDs?Aldrin
I realize 'a sequence is a sequence', I think it's bad that it exists, as it violates the basic principle of XML, extensibility. Aside from that, why can't I define an XSD that is truly extensible. Using xs:sequence and xs:any sort of works, but I have to ensure that the sequenced elements are first, and appear in order... which, shouldn't matter given the nature of XML.Aldrin
@Chad: Maybe, if you really need this kind of flexibility, XML schema is just the wrong tool for the job. Have you ever looked at RelaxNG ?? relaxng.orgMarx
I need to define a minimum level of conformance. Nowhere did I say "anything, anywhere, anytime". I need to define a minimum document, and allow it to be extensible. If you need to define a "sequence" of elements in your XML, I would say you should define it IN the element with a sequence attribute. Relying on the order of the elements in the document feels like an affront to the essence of XML.Aldrin
G
1

Well, you can always use DTD :-) except that DTD also prescribes ordering. Validation with "unordered" grammar is terribly expensive. You could play with xsd:choice and min and max occurs but it's probably going to balk as well. You could also write XSD extensions / derived schemas.

The way you posed the problem it looks like you don't really want XSD at all. You can just load it and then validate whatever minimum you want with XPaths, but just protesting against XSD, how many years after it became omni-present standard is really, really not going to get you anywhere.

Gumdrop answered 4/8, 2010 at 14:15 Comment(5)
The problem with loading, then validating with XPaths is it ends up in code, and is tough to change. I'm not protesting against XSD, I use them a fair bit, but it never really occurred to me until this problem that they missed the mark. IMHO, if you have a data format who's best selling feature is it's extensibility, but don't allow for truly extensible definitions in structure...you failed. Or maybe I'm just crazy.Aldrin
@Chad / @zb_z: if it helps: Tim Bray and other really well known names at W3C also consider that they missed the mark. I've tried to explain this issue, which also applies to DTD by the way, which is called the Unique particle Constraint, see above (or below) ;-)Geochronology
Just listing your options if you really, really insist on arbitrary tag re-arrangements. You can keep XPaths in a config file. The key point is that there's a huge perf price to pay for that luxury. That's the core reason why was [&] kicked out of DTD for XML and why it got very limited support in XSD. If you saw how sometimes both DTD and XSD checkers would complain that the grammar is nondeterministic? That's another way to view it - parser wouldn't be able to decide next step based on the current and next token - it would need infinite lookahead and to check/enumerate permutations.Gumdrop
those are indeed the arguments used in favor against non-deterministic schemas. But meanwhile, actually, already at the time, it has been proven that it's not so difficult as it seems and the performance drop proved negligible. Both Schematron and Relax NG have shown that non-determinism is not a problem. Whether it's a good design of your schema is a whole other story, of course.Geochronology
Any perf results and implementations without Haskell or backtracking of other kind? XSD validation is expected to be 0 lookahead which has some important perf, stability and streaming consequences. Like that it gets trusted for automatic code generation and serialization. One needs guaranteed worst case complexity for that. If my memory serves me Relax NG doesn't have cardinality constraints which XSD does (much more expensive to validate with interleaving) and is by and large not 0 lookahead. Let's not forget that with 1 lookahead one can parse C code and still no NFA.Gumdrop
J
1

RelaxNG will solve this problem succinctly, if you can use it. Determinism isn't a requirement for schemas. You can translate an RNG or RNC schema into XSD, but it will approximate in this case. Whether that's good enough for your use is up to you.

The RNC schema for this case is:

start = User
User = element User {
   attribute ID { xsd:unsignedByte },
   ( element GivenName { text } &
     element SurName { text } &
     element * - (SurName | GivenName) { any })
}

any = element * { (attribute * { text } | text | any)* }

The any rule matches any well-formed XML fragment. So this will require the User element to contain GivenName and SurName elements containing text in any order, and allow any other elements containing pretty much anything.

Jolynnjon answered 12/8, 2013 at 19:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.