Load HTML string into DOM tree with Javascript
Asked Answered
F

3

6

I'm currently working with an automation framework that is pulling a webpage down for analysis, which is then presented as a string for processing. The Rhino Javascript engine is available to assist in parsing the returned web page.

It seems that if the string (which is a complete webpage) can be loaded in a DOM representation, it would provide a very nice interface for parsing and analyzing content.

Using only Javascript, is this a possible and/or feasible concept?

Edit:

I'll decompose the question for clarify: Say I have an string in javascript that contains html like such:


var $mywebpage = '<!DOCTYPE HTML PUB ...//snipped//... </body></html>';

is it possible/realistic to load it somehow into a dom object?

Filipino answered 4/2, 2011 at 22:8 Comment(1)
If I understood right, you can append a html string to the body of a document document.body.innerHTML="string"Steffens
C
0

if you have this variable that contains html, you can load it into a DOM object, for example, by id.

var mywebpage = '<!DOCTYPE HTML PUB ...//snipped//... </body></html>';

element = document.getElementById('dom-id');  //<-- element you are loading it into.

element.innerHTML = mywebpage;
Chubb answered 4/2, 2011 at 22:23 Comment(2)
That is a step in the right direction. Since I'm using Rhino, I'm uncertain if i can actually access or possibly 'create' a dom object. I will continue looking at this and update as I learn more.Filipino
ok, well, as long as you have a string of html, it will load into whatever DOM element you select.Chubb
F
1

I'm accepting JonDavidJohn's answer as it was useful in solving my problem, thought including this additional answer for others that may view this in the future.

It appears that while Javascript allows the loading of html strings into a DOM element, DOM is not part of core ECMAScript, and as such is not available to scripts running under Rhino.

As a side note worth mentioning, a good alternative that was implemented in Rhino 1.6 is E4X. While not a DOM implementation, it does provide for conceptually similar capabilities.

Filipino answered 10/2, 2011 at 18:43 Comment(0)
S
1

If the document is XHTML, you can parse it with any XML parser. E4X would probably do the job nicely, as would the built-in Java XML parsing interfaces.

The env.js library is designed to emulate the browser environment under Rhino, but I believe your document also needs to be compliant XHTML:

http://ejohn.org/blog/bringing-the-browser-to-the-server/

http://www.envjs.com/

If it's HTML, however, it's more difficult, as browsers are designed to be extremely lenient in how markup is parsed. See here for a list of HTML parsers in Java:

http://java-source.net/open-source/html-parsers

This is not an easy problem to solve. People have gone so far as to embed the Mozilla Gecko engine in Java via JNI in order to use its parsing capabilities.

I would recommend you look into the following pure-Java project:

http://lobobrowser.org/cobra.jsp

The goal of the Lobo project is to develop a pure-Java web browser. It's a pretty interesting project, and there's a lot there, but I believe you could use the parser standalone quite easily in your own application, as described in the following link:

http://lobobrowser.org/cobra/java-html-parser.jsp

Siamese answered 14/2, 2011 at 5:8 Comment(0)
C
0

if you have this variable that contains html, you can load it into a DOM object, for example, by id.

var mywebpage = '<!DOCTYPE HTML PUB ...//snipped//... </body></html>';

element = document.getElementById('dom-id');  //<-- element you are loading it into.

element.innerHTML = mywebpage;
Chubb answered 4/2, 2011 at 22:23 Comment(2)
That is a step in the right direction. Since I'm using Rhino, I'm uncertain if i can actually access or possibly 'create' a dom object. I will continue looking at this and update as I learn more.Filipino
ok, well, as long as you have a string of html, it will load into whatever DOM element you select.Chubb

© 2022 - 2024 — McMap. All rights reserved.