Speeding up XML schema validations of a batch of XML files against the same XML schema (XSD)
Asked Answered
K

2

7

I would like to speed up the process of validating a batch of XML files against the same single XML schema (XSD). Only restrictions are that I am in a PHP environment.

My current problem is that the schema I would like to validate against includes the fairly complex xhtml schema of 2755 lines (http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd). Even for very simple data this takes a long time (around 30 seconds pr. validation). As I have thousands of XML files in my batch, this doesn't really scale well.

For validating the XML file I use both of these methods, from the standard php-xml libraries.

  • DOMDocument::schemaValidate
  • DOMDocument::schemaValidateSource

I am thinking that the PHP implementation fetches the XHTML schema via HTTP and builds some internal representation (possibly a DOMDocument) and that this is thrown away when the validation is completed. I was thinking that some option for the XML-libs might change this behaviour to cache something in this process for reuse.

I've build a simple test setup which illustrates my problem:

test-schema.xsd

<xs:schema attributeFormDefault="unqualified"
    elementFormDefault="qualified"
    targetNamespace="http://myschema.example.com/"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:myschema="http://myschema.example.com/"
    xmlns:xhtml="http://www.w3.org/1999/xhtml">
    <xs:import
        schemaLocation="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd"
        namespace="http://www.w3.org/1999/xhtml">
    </xs:import>
    <xs:element name="Root">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="MyHTMLElement">
                    <xs:complexType>
                        <xs:complexContent>
                            <xs:extension base="xhtml:Flow"></xs:extension>
                        </xs:complexContent>
                    </xs:complexType>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>
</xs:schema>

test-data.xml

<?xml version="1.0" encoding="UTF-8"?>
<Root xmlns="http://myschema.example.com/" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xml="http://www.w3.org/XML/1998/namespace" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://myschema.example.com/ test-schema.xsd ">
  <MyHTMLElement>
    <xhtml:p>This is an XHTML paragraph!</xhtml:p>
  </MyHTMLElement>
</Root>

schematest.php

<?php
$data_dom = new DOMDocument();
$data_dom->load('test-data.xml');

// Multiple validations using the schemaValidate method.
for ($attempt = 1; $attempt <= 3; $attempt++) {
    $start = time();
    echo "schemaValidate: Attempt #$attempt returns ";
    if (!$data_dom->schemaValidate('test-schema.xsd')) {
        echo "Invalid!";
    } else {
        echo "Valid!";
    }
    $end = time();
    echo " in " . ($end-$start) . " seconds.\n";
}

// Loading schema into a string.
$schema_source = file_get_contents('test-schema.xsd');

// Multiple validations using the schemaValidate method.
for ($attempt = 1; $attempt <= 3; $attempt++) {
    $start = time();
    echo "schemaValidateSource: Attempt #$attempt returns ";
    if (!$data_dom->schemaValidateSource($schema_source)) {
        echo "Invalid!";
    } else {
        echo "Valid!";
    }
    $end = time();
    echo " in " . ($end-$start) . " seconds.\n";
}

Running this schematest.php file produces the following output:

schemaValidate: Attempt #1 returns Valid! in 30 seconds.
schemaValidate: Attempt #2 returns Valid! in 30 seconds.
schemaValidate: Attempt #3 returns Valid! in 30 seconds.
schemaValidateSource: Attempt #1 returns Valid! in 32 seconds.
schemaValidateSource: Attempt #2 returns Valid! in 30 seconds.
schemaValidateSource: Attempt #3 returns Valid! in 30 seconds.

Any help and suggestions on how to solve this issue, are very welcomed!

Kainite answered 13/12, 2012 at 17:23 Comment(1)
Please make a local copy of that W3C schema.Jean
F
17

You can safely substract 30 seconds from the timing values as overhead.

Remote requests to W3C servers are being delayed because most libraries do not reflect caching the documents (even the HTTP headers suggest that). But read your own:

The W3C servers are slow to return DTDs. Is the delay intentional?

Yes. Due to various software systems downloading DTDs from our site millions of times a day (despite the caching directives of our servers), we have started to serve DTDs and schema (DTD, XSD, ENT, MOD, etc.) from our site with an artificial delay. Our goals in doing so are to bring more attention to our ongoing issues with excessive DTD traffic, and to protect the stability and response time of the rest of our site. We recommend HTTP caching or catalog files to improve performance.

W3.org tries to keep requests low. That is understandable. PHP's DomDocument is based on libxml. And libxml allows to set an external entity loader. The whole Catalog support section is interesting in this case.

To solve the issue in question, setup a catalog.xml file:

<?xml version="1.0"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
    <system systemId="http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd"
            uri="xhtml1-transitional.xsd"/>
    <system systemId="http://www.w3.org/2001/xml.xsd"
            uri="xml.xsd"/>
</catalog>

Save a copy of the two .xsd files with the names given in that catalog file next to the catalog (relative as well as absolute paths file:///... do work if you prefer a different directory).

Then ensure your systems environment variable XML_CATALOG_FILES is set to the filename of the catalog.xml file. When everything is setup, the validation just runs through:

schemaValidate: Attempt #1 returns Valid! in 0 seconds.
schemaValidate: Attempt #2 returns Valid! in 0 seconds.
schemaValidate: Attempt #3 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #1 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #2 returns Valid! in 0 seconds.
schemaValidateSource: Attempt #3 returns Valid! in 0 seconds.

If it still takes long, it's just a sign that the environment variable is not set to the right location. I have handled the variable as well as some edge cases as well in a blog post:

It should take care of diverse edge cases, like filenames containing spaces.

Alternatively it is possible to create a simple external entity loader callback function that uses a URL => file mapping for the local file-system in form of an array:

$mapping = [
     'http://www.w3.org/2002/08/xhtml/xhtml1-transitional.xsd'
         => 'schema/xhtml1-transitional.xsd',

     'http://www.w3.org/2001/xml.xsd'                          
         => 'schema/xml.xsd',
];

As this shows, I've placed a verbatim copy of these two XSD files into a subdirectory called schema. The next step is to make use of libxml_set_external_entity_loader to activate the callback function with the mapping. Files that exist on disk already are preferred and loaded directly. If the routine encounters a non-file that has no mapping, a RuntimeException will be thrown with a detailed message:

libxml_set_external_entity_loader(
    function ($public, $system, $context) use ($mapping) {

        if (is_file($system)) {
            return $system;
        }

        if (isset($mapping[$system])) {
            return __DIR__ . '/' . $mapping[$system];
        }

        $message = sprintf(
            "Failed to load external entity: Public: %s; System: %s; Context: %s",
            var_export($public, 1), var_export($system, 1),
            strtr(var_export($context, 1), [" (\n  " => '(', "\n " => '', "\n" => ''])
        );

        throw new RuntimeException($message);
    }
);

After setting this external entity loader, there isn't any longer the delay for the remote-requests.

And that's it. See Gist. Take care: This external entity loader has been written for loading the XML file to validate from disk and to "resolve" the XSD URIs to local filenames. Other kind of operations (e.g. DTD based validation) might need some code changes / extension. More preferable is the XML catalog. It also works for different tools.

Fasciation answered 13/12, 2012 at 18:0 Comment(6)
Thank you so much! I thought this was a parsing issue :) But when I think back 30 seconds sounds pretty round to occur as a random artefact. Thats a bunch!Hit
@creen: I again edited the answer, it now shows how to set the external entity loader which does the translation to local files on the fly. I would say that is the preferred way instead of editing the local copies.Fasciation
The external entity loader is nice, but do note that using libxml's catalog support accomplishes essentially the same thing without new PHP code.Damales
@C. M. Sperberg-McQueen: Yes, I was testing on windows today and I could not find much information where the catalog is located on windows with the PHP installation. I wanted to use the catalog firsthand because it looked more straight forward to me. Do you know more? Probably also in conjunction with PHP?Fasciation
@Fasciation Very nice ... That will fix a lot of problems for me in the future :)Hit
I have got the catalog.xml files to work. No need for the mapping function then. I streamline the code a bit and will edit the answer or do a blog post. It's not really complicated, I had some hurdles on windows but figured out all edge-points, still reviewing relative paths. Works great so far!Fasciation
S
0

As an alternative to @hakre: Download the external resource (DTD) on first try, use the downloaded version afterwards:

libxml_set_external_entity_loader(    
    function ($public, $system, $context) {
        if(is_file($system)){
            return $system;
        }
        $cached_file= tempnam(sys_get_temp_dir(), md5($system));
        if (is_file($cached_file)) {
            return $cached_file;
        }
        copy($system,$cached_file);
        return $cached_file;
    }
);
Sussman answered 16/3, 2015 at 16:9 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.