Inconsistent XSD validation of nested elements using `<xs:any>`
Asked Answered
F

2

6

I'm working on a tool to help a user author XHTML-ish documents which are similar in nature to JSP files. The documents are XML and can contain any well-formed tags in the XHTML namespace, and weaved between them are elements from my product's namespace. Among other things, the tool validates the input using XSD.

Example input:

<?xml version="1.0"?>
<markup>
  <html xmlns="http://www.w3.org/1999/xhtml" xmlns:c="https://my_tag_lib.example.com/">
    <c:section>
      <c:paragraph>
        <span>This is a test!</span>
        <a href="http://www.google.com/">click here for more!</a>
      </c:paragraph>
    </c:section>
  </html>
</markup>

My problem is that the XSD validation doesn't behave consistently depending on how deeply I nest elements. What I want is for all elements in the https://my_tag_lib.example.com/ namespace to be checked against the schema while any elements in namespace http://www.w3.org/1999/xhtml to be liberally tolerated. I would like to not list all HTML elements which are permitted in my XSD - users may want to use obscure elements only available on certain browsers etc. Instead I'd just like to white list any element belonging to the namespace using <xs:any>.

What I'm discovering is that under some circumstances, elements which belong to the my_tag_lib namespace but don't appear in the schema are passing validation, while other elements which do appear in the schema can be made to fail by giving them invalid attributes.

So: * valid elements are validated against the XSD schema * invalid elements are skipped by the validator?

For example, this passes validation:

<?xml version="1.0"?>
<markup>
  <html xmlns="http://www.w3.org/1999/xhtml" xmlns:c="https://my_tag_lib.example.com/">
    <c:section>
      <div>
        <c:my-invalid-element>This is a test</c:my-invalid-element>
      </div>
    </c:section>
  </html>
</markup>

But then this fails validation:

<?xml version="1.0"?>
<markup>
  <html xmlns="http://www.w3.org/1999/xhtml" xmlns:c="https://my_tag_lib.example.com/">
    <c:section>
      <div>
        <c:paragraph my-invalid-attr="true">This is a test</c:paragraph>
      </div>
    </c:section>
  </html>
</markup>

Why are the attributes being validated against the schema for recognized elements, while unrecognized elements are seemingly not getting sanitized at all? What's the logic here? I've been using xmllint to do the validation:

xmllint --schema markup.xsd example.xml

Here are my XSD files:

File: markup.xsd

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xhtml="http://www.w3.org/1999/xhtml">
  <xs:import namespace="http://www.w3.org/1999/xhtml" schemaLocation="html.xsd" />
  <xs:element name="markup">
    <xs:complexType mixed="true">
      <xs:sequence>
        <xs:element ref="xhtml:html" />
      </xs:sequence>
    </xs:complexType>
  </xs:element>
</xs:schema>

File: html.xsd

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.w3.org/1999/xhtml">
  <xs:import namespace="https://my_tag_lib.example.com/" schemaLocation="my_tag_lib.xsd" />
  <xs:element name="html">
    <xs:complexType mixed="true">
      <xs:choice minOccurs="0" maxOccurs="unbounded">
        <xs:any processContents="lax" namespace="http://www.w3.org/1999/xhtml" />
        <xs:any processContents="strict" namespace="https://my_tag_lib.example.com/" />
      </xs:choice>
    </xs:complexType>
  </xs:element>
</xs:schema>

File: my_tag_lib.xsd

<?xml version="1.0" encoding="ISO-8859-1" ?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="https://my_tag_lib.example.com/">
  <xs:element name="section">
    <xs:complexType mixed="true">
      <xs:choice minOccurs="0" maxOccurs="unbounded">
        <xs:any processContents="lax" namespace="http://www.w3.org/1999/xhtml" />
        <xs:any processContents="strict" namespace="https://my_tag_lib.example.com/" />
      </xs:choice>
    </xs:complexType>
  </xs:element>
  <xs:element name="paragraph">
    <xs:complexType mixed="true">
      <xs:choice minOccurs="0" maxOccurs="unbounded">
        <xs:any processContents="lax" namespace="http://www.w3.org/1999/xhtml" />
        <xs:any processContents="strict" namespace="https://my_tag_lib.example.com/" />
      </xs:choice>
    </xs:complexType>
  </xs:element>
</xs:schema>
Flashlight answered 1/4, 2014 at 1:57 Comment(0)
E
1

The div element is not declared, so there is nothing that keeps if from not accepting invalid types in your schema, and the paragraph element does not allow my-invalid-attr.

Perhaps some examples might make that clearer.

If the element is declared (such as html, section, paragraph) and its contents are from the taglib namespace (which you declared as having processContents="strict"), they will be treated as strict. That means that attributes or child elements have to be declared. This should fail validation:

<html>
    <c:my-invalid-element>This is a test</c:my-invalid-element>
</html>

So will this:

<c:section>
    <c:my-invalid-element>This is a test</c:my-invalid-element>
</c:section>

this:

<div>
    <c:paragraph>
         <c:my-invalid-element>This is a test<c:my-invalid-element>
    </c:paragraph>
</div>

And this (since attributes are part of content):

<c:paragraph my-invalid-attr="true">This is a test</c:paragraph>

But if the element is not declared (such as div), it will match the xs:any declaration. There is no declaration restricting the contents of div, so it allows anything. So this should pass validation:

<div>
    <c:my-invalid-element>This is a test</c:my-invalid-element>
</div>

And since c:my-invalid-element is also not declared, it will allow any content or attributes. This is valid:

<div>
    <c:my-invalid-element invalid-attribute="hi"> <!-- VALID -->
        <c:invalid></c:invalid>
        <html></html>
    </c:my-invalid-element>
</div>

But if you place the invalid element inside one the html it will fail:

<div>
    <c:my-invalid-element invalid-attribute="hi">
        <html><c:invalid></c:invalid></html>  <!-- NOT VALID -->
    </c:my-invalid-element>
</div>

The same would happen if you use an undeclared attribute inside a declared element (which will not match xs:any) no matter how deep your nesting:

<div>
    <c:my-invalid-element invalid-attribute="hi"> <!-- VALID -->
        <c:invalid>
            <b> 
                <c:section bad-attribute="boo"></c:section> <!-- FAILS! -->
 ...
Edacity answered 1/4, 2014 at 17:44 Comment(0)
T
2

What you're missing is understanding of the context determined declaration.

First, have a look at this little experiment.

<?xml version="1.0"?>
<markup>
    <html xmlns="http://www.w3.org/1999/xhtml" xmlns:c="https://my_tag_lib.example.com/">
        <c:section>
            <div>
                <html>
                    <c:my-invalid-element>This is a test</c:my-invalid-element>
                </html>
            </div>
        </c:section>
    </html>
</markup>

This is the same as your valid example, except that now I've changed the context in which c:my-invalid-element is being assessed from "lax" to "strict". This is done by interjecting the html element, which now forces all the elements in your tag namespace to be strict. As you can easily confirm, the above is invalid.

This tells you (without reading the documentation) that in your examples, the determined context must have been "lax" as opposed to your expectation, which is "strict".

Why is the context lax? div is processed "laxly" (it matches the wildcard, but no definition exists for it), hence it's children will be assessed laxly. Matching with what lax means: in the first case, a definition for c:my-invalid-element was not found, therefore the instruction given is don't worry if you can't - all good. In the invalid sample, a definition for c:paragraph can be found, hence it must be ·valid· with respect to that definition - not good, because of the unexpected attribute.

Terracotta answered 3/4, 2014 at 17:56 Comment(1)
So there's no way to coerce all elements in one namespace to be strict - it's all based on the current context?Flashlight
E
1

The div element is not declared, so there is nothing that keeps if from not accepting invalid types in your schema, and the paragraph element does not allow my-invalid-attr.

Perhaps some examples might make that clearer.

If the element is declared (such as html, section, paragraph) and its contents are from the taglib namespace (which you declared as having processContents="strict"), they will be treated as strict. That means that attributes or child elements have to be declared. This should fail validation:

<html>
    <c:my-invalid-element>This is a test</c:my-invalid-element>
</html>

So will this:

<c:section>
    <c:my-invalid-element>This is a test</c:my-invalid-element>
</c:section>

this:

<div>
    <c:paragraph>
         <c:my-invalid-element>This is a test<c:my-invalid-element>
    </c:paragraph>
</div>

And this (since attributes are part of content):

<c:paragraph my-invalid-attr="true">This is a test</c:paragraph>

But if the element is not declared (such as div), it will match the xs:any declaration. There is no declaration restricting the contents of div, so it allows anything. So this should pass validation:

<div>
    <c:my-invalid-element>This is a test</c:my-invalid-element>
</div>

And since c:my-invalid-element is also not declared, it will allow any content or attributes. This is valid:

<div>
    <c:my-invalid-element invalid-attribute="hi"> <!-- VALID -->
        <c:invalid></c:invalid>
        <html></html>
    </c:my-invalid-element>
</div>

But if you place the invalid element inside one the html it will fail:

<div>
    <c:my-invalid-element invalid-attribute="hi">
        <html><c:invalid></c:invalid></html>  <!-- NOT VALID -->
    </c:my-invalid-element>
</div>

The same would happen if you use an undeclared attribute inside a declared element (which will not match xs:any) no matter how deep your nesting:

<div>
    <c:my-invalid-element invalid-attribute="hi"> <!-- VALID -->
        <c:invalid>
            <b> 
                <c:section bad-attribute="boo"></c:section> <!-- FAILS! -->
 ...
Edacity answered 1/4, 2014 at 17:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.