Check for malicious XML before allowing DTD loading?
Asked Answered
I

2

2

Since libxml 2.9, loading external entities has been disabled when parsing XML, to prevent XXE attacks.

In that case, to be able to load a DTD file when parsing the XML with PHP's DOMDocument, LIBXML_DTDLOAD must be specified.

What would be a good way to verify that only the expected DTD will be loaded, before enabling LIBXML_DTDLOAD?

One approach I can think of (as shown in the example code below) would be to keep entity loading disabled, parse the XML file once, check that the DOCTYPE declaration is as expected, then parse the XML again with entity loading enabled. Would that be sufficient?

<?php

$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article/>
XML;

// entity loading disabled

libxml_disable_entity_loader();

$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD); // PHP Warning:  DOMDocument::load(): I/O warning : failed to load external entity

print $doc->doctype->systemId; // http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd

// entity loading enabled

libxml_disable_entity_loader(false);

$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD);

print $doc->doctype->systemId; // http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd
Ibis answered 2/7, 2014 at 8:35 Comment(5)
Sufficient for what? Could you please make this a concrete programming question? -- Possible duplicate of: Clarifications on XXE vulnerabilities throughout PHP versions (answer pending)Catamenia
The question, as stated, is "What would be a good way to verify that only the expected DTD will be loaded, before enabling LIBXML_DTDLOAD?"Ibis
(which implies): Are those DTDs actually loaded if you use LIBXML_DTDLOAD? Have you tested? Also please provide example data and code in your question. We need an example here for reproduction and clarity - at least if you want a sufficient and clear answer. According to my tests in the linked question, those aren't loaded regardless of your setting. But I'm not sure about the stability of that test.Catamenia
I've added example code to the question.Ibis
+1 for the snappy example :)Catamenia
C
2

What would be a good way to verify that only the expected DTD will be loaded, before enabling LIBXML_DTDLOAD?

If you want to filter (whitelist) expected DTDs, you can do so by disabling all others by returning NULL from your own callable that has been set as external entity loader via libxml_set_external_entity_loader.

That is, you would use the LIBXML_DTDLOAD flag and then resolve to a resource handle in your function in case the DTD is white-listed. In case not, you return said NULL.

<?php
/**
 * @link https://mcmap.net/q/1778051/-check-for-malicious-xml-before-allowing-dtd-loading/367456
 */

$xml = <<<XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd">
<article/>
XML;

/* own entity loader */
libxml_set_external_entity_loader(function() {
  var_dump(func_get_args()); // just for demonstrating purposes
  return NULL;
});

$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD);

echo "----\n";

/* restore default entity loader */    
libxml_set_external_entity_loader(NULL);

$doc = new DOMDocument;
$doc->loadXML($xml, LIBXML_DTDLOAD);

Example output:

array(3) {
  [0]=>
  string(66) "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN"
  [1]=>
  string(66) "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd"
  [2]=>
  array(4) {
    ["directory"]=>
    string(1) "/"
    ["intSubName"]=>
    string(7) "article"
    ["extSubURI"]=>
    string(66) "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd"
    ["extSubSystem"]=>
    string(66) "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN"
  }
}

Warning: DOMDocument::loadXML(): Failed to load external entity "-//NLM//DTD JATS (Z39.96) Journal Publishing DTD v1.0 20120330//EN" in Entity, line: 2 in /in/jemmH on line 18
----

Warning: DOMDocument::loadXML(): php_network_getaddresses: getaddrinfo failed: Name or service not known in /in/jemmH on line 25

Warning: DOMDocument::loadXML(http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd): failed to open stream: php_network_getaddresses: getaddrinfo failed: Name or service not known in /in/jemmH on line 25

Notice: DOMDocument::loadXML(): failed to load external entity "http://jats.nlm.nih.gov/publishing/1.0/JATS-journalpublishing1.dtd" in Entity, line: 2 in /in/jemmH on line 25
Catamenia answered 7/7, 2014 at 21:35 Comment(2)
Thanks, that seems like a good approach. Do you know how to restore the default external entity loader after the XML has been parsed?Ibis
There is no need to restore the external entit loader normally, as you have alread taken care of the loading, as you mimicked the default behavior already for the whitelisted DTDs. However if you really want to restore the default handler, call libxml_set_external_entity_loader(NULL);. An example is here: 3v4l.org/jemmHCatamenia
B
-1

Your approach is seems good but in order to improve performence you may want to do White Listing on DOCTYPE declarations, if it pass then you can parse it with entity loading enabled.

Berg answered 2/7, 2014 at 14:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.