Using XSD schema validation for XPath queries

Asked 24/9, 2019 at 11:27 Answered 5/12, 2019 at 2:55

I'm using the following code to create a DOMDocument and validate it against an external xsd file.

<?php

$xmlPath = "/xml/some/file.xml";
$xsdPath = "/xsd/some/schema.xsd";
    
$doc = new \DOMDocument();
$doc->loadXML(file_get_contents($xmlPath), LIBXML_NOBLANKS);

if (!$doc>schemaValidate($xsdPath)) {
    throw new InvalidXmlFileException();
}

Update 2 (rewritten question)

This works fine, meaning that if the XML doesn't match the definitions of XSD it will throw a meaningful exception.

Now, I want to retrieve information from the DOMDocument using Xpath. It works fine aswell, however, from this point on the DOMDocument is completely detached from the XSD! For example, if I have a DOMNode I cannot know whether it is of type simpleType or type complexType. I can check whether the node has child (hasChild()) nodes, but this is not the same. Also, there is tons of information more in the XSD (like, min and max number of occurrence, etc).

The question really is, do I have to query the XSD myself or is there a programmatic way of asking those kind of questions. I.e. is this DOMNode a complex or simple type?

In another post it was suggested "to process the schema using a real schema processor, and then use its API to ask questions about the contents of the schema". Does XPath has an API to retrieve information of the XSD or is there a different convenient way with DOMDocument?

For the record, the original question

Now, I wanted to proceed to parse information from the DOMDocument using XPath. To increase the integrity of my data I'm storing to a database and giving meaningful error message to the client I wanted to constantly use the schema information to validate the queries. I.e. I wanted to validate fetched childNodes against allowed child nodes defined in the xsd. I wanted to that by using XPath on the xsd document.

However, I sumbled across this post. It basically sais it is a kind of kirky way to that yourself and you should rather use a real schema processor and use its API to make the queries. If I understand that right, I'm using a real schema processor with schemaValidate, but what is meant by using its API?

I kind of guessed already I'm not using the schema in a correct way, but I have no idea how to research a proper usage.

The question

If I use schemaValidate on the DOMDocument is that a one-time validation (true or false) or is it tied to the DOMDocument for longer then? Precisely, can I use the validation also for adding nodes somehow or can I use it to select nodes I'm interested in as suggested by the referenced SO post?

Update

The question was rated unclear, so I want to try again. Say I would like to add a node or edit a node value. Can I use the schema provided in the xsd so that I can validate the user input? Originally, in order to do that I wanted to query the xsd manually with another XPath instance to get the specs for a certain node. But as suggested in the linked article this is not best practice. So the question would be, does the DOM lib offer any API to make such a validation?

Maybe I'm overthinking it. Maybe I just add the node and run the validation again and see where/why it breaks? In that case, the answer of the custom error handling would be correct. Can you confirm?

Procrastinate answered 24/9, 2019 at 11:27 Comment(0)

Your question is not very clear, but it sounds like you want to get detailed reporting about any schema validation failures. While DomDocument::validateSchema() only returns a boolean, you can use internal libxml functions to get some more detailed information.

We can start with your original code, only changing one thing at the top:

<?php
// without this, errors are echoed directly to screen and/or log
libxml_use_internal_errors(true);
$xmlPath = "file.xml";
$xsdPath = "schema.xsd";

$doc = new \DOMDocument();
$doc->loadXML(file_get_contents($xmlPath), LIBXML_NOBLANKS);

if (!$doc->schemaValidate($xsdPath)) {
    throw new InvalidXmlFileException();
}

And then we can make the interesting stuff happen in the exception which is presumably (based on the code you've provided) caught somewhere higher up in the code.

<?php

class InvalidXmlFileException extends \Exception
{
    private $errors = [];

    public function __construct()
    {
        foreach (libxml_get_errors() as $err) {
            $this->errors[] = self::formatXmlError($err);
        }
        libxml_clear_errors();
    }

    /**
     * Return an array of error messages
     *
     * @return array
     */
    public function getXmlErrors(): array
    {
        return $this->errors;
    }

    /**
     * Return a human-readable error message from a libxml error object
     *
     * @return string
     */
    private static function formatXmlError(\LibXMLError $error): string
    {
        $return = "";
        switch ($error->level) {
        case \LIBXML_ERR_WARNING:
            $return .= "Warning $error->code: ";
            break;
         case \LIBXML_ERR_ERROR:
            $return .= "Error $error->code: ";
            break;
        case \LIBXML_ERR_FATAL:
            $return .= "Fatal Error $error->code: ";
            break;
        }

        $return .= trim($error->message) .
               "\n  Line: $error->line" .
               "\n  Column: $error->column";

        if ($error->file) {
            $return .= "\n  File: $error->file";
        }

        return $return;
    }
}

So now when you catch your exception you can just iterate over $e->getXmlErrors():

try {
    // do stuff
} catch (InvalidXmlFileException $e) {
    foreach ($e->getXmlErrors() as $err) {
        echo "$err\n";
    }
}

For the formatXmlError function I just copied an example from the PHP documentation that parses the error into something human readable, but no reason you couldn't return some structured data or whatever you like.

Tournedos answered 24/9, 2019 at 23:0 Comment(3)

Thanks, the custom error handling is also a topic I'm interested in. But my original question is different. How can I help to clarify? – Procrastinate 25/9, 2019 at 7:22

Update makes things clearer (for me at least!) You want to validate any additions to the document before they happen. Personally, I don’t think there’s anything wrong with just adding the node and revalidating the whole thing. PHP doesn’t have a dedicated schema parser but if you wanted one, this looks promising: github.com/goetas-webservices/xsd-reader/blob/master/README.md – Tournedos 25/9, 2019 at 13:57

More problems arose and I have re-written the question. I didn't want to open a new question because the root cause of the question remains the same. I would like to kindly ask you to review it @Tournedos – Procrastinate 26/9, 2019 at 14:50

I think what you're looking for is the PSVI (post schema validation infoset), see this answer for some references.

An other option would be to use XPath2 that has operators to check schema types.

I don't know if there are libraries in PHP that allows you to get PSVI or perform XPath2 queries, in Java there is Xerces for PSVI and Saxon for XPath2

For example With Xerces is possible to cast a DOM Element to a Xerces ElementPSVI in order to get schema informations of an Element.

I can warn that using XPath on the schema (as you were doing) will work only for simple cases since the XML of the schema is very different from the actual schema model (assembled schema) that is a graph of components with properties that are yes calculated from the XML declaration (schema file) but with very complex rules that are almost impossible to recreate with XPath.

So you need at least the PSVI or to make XPath2 queries but, in my experience, obtaining decent validation for application users from an XML schema is difficult.

What are you trying to achieve ?

Oil answered 5/12, 2019 at 2:55 Comment(0)

Update 2 (rewritten question)

For the record, the original question

The question

Update

Recommended topics

Hot tags