Is Scala/Java not respecting w3 "excess dtd traffic" specs?
Asked Answered
M

9

16

I'm new to Scala, so I may be off base on this, I want to know if the problem is my code. Given the Scala file httpparse, simplified to:

object Http {
   import java.io.InputStream;
   import java.net.URL;

   def request(urlString:String): (Boolean, InputStream) =
      try {
         val url = new URL(urlString)
         val body = url.openStream
         (true, body)
      }
      catch {
         case ex:Exception => (false, null)
      }
}

object HTTPParse extends Application {
   import scala.xml._;
   import java.net._;

   def fetchAndParseURL(URL:String) = {
      val (true, body) = Http request(URL)
      val xml = XML.load(body) // <-- Error happens here in .load() method
      "True"
   }
}

Which is run with (URL doesn't matter, this is a joke example):

scala> HTTPParse.fetchAndParseURL("http://stackoverflow.com")

The result invariably:

   java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/html4/strict.dtd
    at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1187)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:973)
    at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEnti...

I've seen the Stack Overflow thread on this with respect to Java, as well as the W3C's System Team Blog entry about not trying to access this DTD via the web. I've also isolated the error to the XML.load() method, which is a Scala library method as far as I can tell.

My Question: How can I fix this? Is this something that is a by product of my code (cribbed from Raphael Ferreira's post), a by product of something Java specific that I need to address as in the previous thread, or something that is Scala specific? Where is this call happening, and is it a bug or a feature? ("Is it me? It's her, right?")

Melosa answered 8/7, 2009 at 5:44 Comment(6)
You've already got the answer, but I want to comment on the W3C blog entry: per the XML spec, if you use a SYSTEM identifier then the parser has to be able to retrieve that content: xml.com/axml/target.html#dt-sysid -- even if it hasn't changed in years. I linked to the annotated spec (a creation of Tim Bray, one of the original spec editors) because it has some nice commentary on SYSTEM versus PUBLIC identifiers.Rhubarb
@kdgregory, but this doesn't count, because the content mettadore is trying to retrieve isn't XML.Giaour
read more carefully: "URL doesn't matter, this is a joke example"; find any site that produces valid XHTML and you'll have the same issueRhubarb
@Rhubarb The url obviously does matter, because if you look at the error message, it's trying to retrieve the html4 dtd. So the real page is also not XML. The difference is also significant in practice: it's likely that the xhtml dtd would be successfully retrieved, given that (as other have pointed out) it's required by the xml spec (but not the html spec).Giaour
:shrug: it does the same thing when you try to open a URLConnection to "w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"; clearly they're looking at user-agent and blocking the Java APIRhubarb
I don't think they're blocking the Java API yet, though w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic did state that they might start at that level. I think that they basically block as much as they can, though, given that blog post. It looks like the issue might not be Scala/Java/XML related so much as a specific case of innappropriate use of the XHTML DTD url.Melosa
M
2

It works. After some detective work, the details as best I can figure them:

Trying to parse a developmental RESTful interface, I build the parser and get the above (rather, a similar) error. I try various parameters to change the XML output, but get the same error. I try to connect to an XML document I quickly whip up (cribbed stupidly from the interface itself) and get the same error. Then I try to connect to anything, just for kicks, and get the same (again, likely only similar) error.

I started questioning whether it was an error with the sources or the program, so I started searching around, and it looks like an ongoing issue- with many Google and SO hits on the same topic. This, unfortunately, made me focus on the upstream (language) aspects of the error, rather than troubleshoot more downstream at the sources themselves.

Fast forward and the parser suddenly works on the original XML output. I confirmed that there was some additional work has been done server side (just a crazy coincidence?). I don't have either earlier XML but suspect that it is related to the document identifiers being changed.

Now, the parser works fine on the RESTful interface, as well any well formatted XML I can throw at it. It also fails on all XHTML DTD's I've tried (e.g. www.w3.org). This is contrary to what @SeanReilly expects, but seems to jive with what the W3 states.

I'm still new to Scala, so can't determine if I have a special, or typical case. Nor can I be assured that this problem won't re-occur for me in another form down the line. It does seem that pulling XHTML will continue to cause this error unless one uses a solution similar to those suggested by @GClaramunt $ @J-16 SDiZ have used. I'm not really qualified to know if this is a problem with the language, or my implementation of a solution (likely the later)

For the immediate timeframe, I suspect that the best solution would've been for me to ensure that it was possible to parse that XML source-- rather than see that other's have had the same error and assume there was a functional problem with the language.

Hope this helps others.

Melosa answered 8/7, 2009 at 19:45 Comment(4)
Wrote this as @Daniel posted, so missed it. Using his inline implementation of everyone's negation of doctype verification, the parser works on all XML and XHTML I can throw at it. Thanks all!Melosa
I'd recommend defining that "MyXML" inside a top-level object, and importing it. No need to create a new parser every time.Enrico
If you really want to help others, please report the bug. All this energy for stack overflow, yet nobody was bothered enough to open a ticket where someone might see it. lampsvn.epfl.ch/trac/scalaBohemianism
A bit late on the response to this, but @extempore, I wasn't filing a bug because I couldn't be sure that it was just my own stupidity, which it seems to have been. Since the XML parser was parsing valid XML, but breaking on broken XML and XHTML, it seems that it was functioning as expected, and that it was ME who was the bug.Melosa
D
11

I've bumped into the SAME issue, and I haven't found an elegant solution (I'm thinking into posting the question to the Scala mailing list) Meanwhile, I found a workaround: implement your own SAXParserFactoryImpl so you can set the f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true); property. The good thing is it doesn't require any code change to the Scala code base (I agree that it should be fixed, though). First I'm extending the default parser factory:

package mypackage;

public class MyXMLParserFactory extends SAXParserFactoryImpl {
      public MyXMLParserFactory() throws SAXNotRecognizedException, SAXNotSupportedException, ParserConfigurationException {
        super();
        super.setFeature("http://xml.org/sax/features/validation", false);
        super.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false); 
        super.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false); 
        super.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false); 
      } 
    }

Nothing special, I just want the chance to set the property.

(Note: that this is plain Java code, most probably you can write the same in Scala too)

And in your Scala code, you need to configure the JVM to use your new factory:

System.setProperty("javax.xml.parsers.SAXParserFactory", "mypackage.MyXMLParserFactory");

Then you can call XML.load without validation

Darciedarcy answered 8/7, 2009 at 15:39 Comment(3)
Your method is much simpler, I love this.Winchell
nice! Set the following features to 'false' as well: apache.org/xml/features/disallow-doctype-decl apache.org/xml/features/nonvalidating/load-dtd-grammar apache.org/xml/features/nonvalidating/load-external-dtdRuthie
For better taste (and useful for name refactoring): System.setProperty("javax.xml.parsers.SAXParserFactory", classOf[MyXMLParserFactory].getName)Sudan
E
7

Without addressing, for now, the problem, what do you expect to happen if the function request return false below?

def fetchAndParseURL(URL:String) = {      
  val (true, body) = Http request(URL)

What will happen is that an exception will be thrown. You could rewrite it this way, though:

def fetchAndParseURL(URL:String) = (Http request(URL)) match {      
  case (true, body) =>      
    val xml = XML.load(body)
    "True"
  case _ => "False"
}

Now, to fix the XML parsing problem, we'll disable DTD loading in the parser, as suggested by others:

def fetchAndParseURL(URL:String) = (Http request(URL)) match {      
  case (true, body) =>
    val f = javax.xml.parsers.SAXParserFactory.newInstance()
    f.setNamespaceAware(false)
    f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
    val MyXML = XML.withSAXParser(f.newSAXParser())
    val xml = MyXML.load(body)
    "True"
  case _ => "False"
}

Now, I put that MyXML stuff inside fetchAndParseURL just to keep the structure of the example as unchanged as possible. For actual use, I'd separate it in a top-level object, and make "parser" into a def instead of val, to avoid problems with mutable parsers:

import scala.xml.Elem
import scala.xml.factory.XMLLoader
import javax.xml.parsers.SAXParser
object MyXML extends XMLLoader[Elem] {
  override def parser: SAXParser = {
    val f = javax.xml.parsers.SAXParserFactory.newInstance()
    f.setNamespaceAware(false)
    f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
    f.newSAXParser()
  }
}

Import the package it is defined in, and you are good to go.

Enrico answered 8/7, 2009 at 18:48 Comment(5)
Boy, I really need to be careful when posting code. I just whipped something in, sloppily removing things that didn't matter and ignoring most problems. The feedback is always helpful, but I should be much better about making sure my sample code ducks are in a row so that people don't spend their time trying to fix ancillary things when they don't matter. Sorry about that @Daniel and everyone else.Melosa
Just re-read that- didn't mean to sound snippy. Not saying it's unappreciated- I actually used your code- just that I should spend more time cleaning up, rather than make others spend it.Melosa
The thing is, I have seen people do that and expect it to work before in a slightly different setting. So I thought it better to address that, as it might happily work until... it doesn't. :)Enrico
I'm having trouble compiling this code, the method withSAXParser doesn't seem to exist on the XML object...?Hg
It seems both XMLLoader and the method withSAXParser are Scala 2.8 only. I never even considered the possibility they might not exist on 2.7, but there it is.Enrico
W
3

This is a scala problem. Native Java has an option to disable loading the DTD:

f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);

There are no equivalent in scala.

If you somewhat want to fix it yourself, check scala/xml/parsing/FactoryAdapter.scala and put the line in

278   def loadXML(source: InputSource): Node = {
279     // create parser
280     val parser: SAXParser = try {
281       val f = SAXParserFactory.newInstance()
282       f.setNamespaceAware(false)

<-- insert here

283       f.newSAXParser()
284     } catch {
285       case e: Exception =>
286         Console.err.println("error: Unable to instantiate parser")
287         throw e
288     }
Winchell answered 8/7, 2009 at 5:59 Comment(2)
With the DOM parser, this will cause an error if the DOCTYPE is present (and you have to set it as an attribute, not a feature). And is the Scala parser wrapper really so borked that it ignores namespaces?!? Are we in 1997?Rhubarb
@kdgregory: Scala wrapper is broken --> agree. Ignore namespace --> I guess you have never try parsing the RSS/Atom out in the wild with namespace enabled :)Winchell
V
3

GClaramunt's solution worked wonders for me. My Scala conversion is as follows:

package mypackage
import org.xml.sax.{SAXNotRecognizedException, SAXNotSupportedException}
import com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl
import javax.xml.parsers.ParserConfigurationException

@throws(classOf[SAXNotRecognizedException])
@throws(classOf[SAXNotSupportedException])
@throws(classOf[ParserConfigurationException])
class MyXMLParserFactory extends SAXParserFactoryImpl() {
    super.setFeature("http://xml.org/sax/features/validation", false)
    super.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false)
    super.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false)
    super.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
}

As mentioned his the original post, it is necessary to place the following line in your code somewhere:

System.setProperty("javax.xml.parsers.SAXParserFactory", "mypackage.MyXMLParserFactory")
Vip answered 1/5, 2011 at 2:0 Comment(0)
M
2

It works. After some detective work, the details as best I can figure them:

Trying to parse a developmental RESTful interface, I build the parser and get the above (rather, a similar) error. I try various parameters to change the XML output, but get the same error. I try to connect to an XML document I quickly whip up (cribbed stupidly from the interface itself) and get the same error. Then I try to connect to anything, just for kicks, and get the same (again, likely only similar) error.

I started questioning whether it was an error with the sources or the program, so I started searching around, and it looks like an ongoing issue- with many Google and SO hits on the same topic. This, unfortunately, made me focus on the upstream (language) aspects of the error, rather than troubleshoot more downstream at the sources themselves.

Fast forward and the parser suddenly works on the original XML output. I confirmed that there was some additional work has been done server side (just a crazy coincidence?). I don't have either earlier XML but suspect that it is related to the document identifiers being changed.

Now, the parser works fine on the RESTful interface, as well any well formatted XML I can throw at it. It also fails on all XHTML DTD's I've tried (e.g. www.w3.org). This is contrary to what @SeanReilly expects, but seems to jive with what the W3 states.

I'm still new to Scala, so can't determine if I have a special, or typical case. Nor can I be assured that this problem won't re-occur for me in another form down the line. It does seem that pulling XHTML will continue to cause this error unless one uses a solution similar to those suggested by @GClaramunt $ @J-16 SDiZ have used. I'm not really qualified to know if this is a problem with the language, or my implementation of a solution (likely the later)

For the immediate timeframe, I suspect that the best solution would've been for me to ensure that it was possible to parse that XML source-- rather than see that other's have had the same error and assume there was a functional problem with the language.

Hope this helps others.

Melosa answered 8/7, 2009 at 19:45 Comment(4)
Wrote this as @Daniel posted, so missed it. Using his inline implementation of everyone's negation of doctype verification, the parser works on all XML and XHTML I can throw at it. Thanks all!Melosa
I'd recommend defining that "MyXML" inside a top-level object, and importing it. No need to create a new parser every time.Enrico
If you really want to help others, please report the bug. All this energy for stack overflow, yet nobody was bothered enough to open a ticket where someone might see it. lampsvn.epfl.ch/trac/scalaBohemianism
A bit late on the response to this, but @extempore, I wasn't filing a bug because I couldn't be sure that it was just my own stupidity, which it seems to have been. Since the XML parser was parsing valid XML, but breaking on broken XML and XHTML, it seems that it was functioning as expected, and that it was ME who was the bug.Melosa
G
1

There are two problems with what you are trying to do:

  • Scala's xml parser is trying to physically retrieve the DTD when it shouldn't. J-16 SDiZ seems to have some advice for this problem.
  • The Stack overflow page you are trying to parse isn't XML. It's Html4 strict.

The second problem isn't really possible to fix in your scala code. Even once you get around the dtd problem, you'll find that the source just isn't valid XML (empty tags aren't closed properly, for example).

You have to either parse the page with something besides an XML parser, or investigate using a utility like tidy to convert the html to xml.

Giaour answered 8/7, 2009 at 5:58 Comment(1)
Thanks, That's a good point, and I do understand that (the URL was a joke). It doesn't actually seem to matter what URL I use, whether valid XML or not. This question is specific to the DTD issue and I tried to minimize the code to that amount just enough to reproduce the error.Melosa
S
0

My knowledge of Scala is pretty poor, but couldn't you use ConstructingParser instead?

  val xml = new java.io.File("xmlWithDtd.xml")
  val parser = scala.xml.parsing.ConstructingParser.fromFile(xml, true)
  val doc = parser.document()
  println(doc.docElem)
Shoshone answered 8/7, 2009 at 12:39 Comment(0)
E
0

For scala 2.7.7 I managed to do this with scala.xml.parsing.XhtmlParser

Empurple answered 15/12, 2009 at 12:35 Comment(1)
That's probably a better solution that an XML parser for some cases, because it would allow much looser definitions. Still, I did find that the XML parser DOES work in every case that there is valid XML to parse. It only seems to break when there is broken XML. I guess the choices of parser should probably be based on the use case (of course).Melosa
M
0

Setting Xerces switches only works if you are using Xerces. An entity resolver works for any JAXP parser.

There are more generalized entity resolvers out there, but this implementation does the trick when all I'm trying to do is parse valid XHTML.

http://code.google.com/p/java-xhtml-cache-dtds-entityresolver/

Shows how trivial it is to cache the DTDs and forgo the network traffic.

In any case, this is how I fix it. I always forget. I always get the error. I always go fetch this entity resolver. Then I'm back in business.

Mikael answered 2/1, 2010 at 22:57 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.