Can I load an entire HTML document into a document fragment in Internet Explorer?

Asked 19/9, 2011 at 17:15 Answered 25/1, 2018 at 14:41

Solved javascript html internet-explorer dom

Here's something I've been having a little bit of difficulty with. I have a local client-side script that needs to allow a user to fetch a remote web page and search that resulting page for forms. In order to do this (without regex), I need to parse the document into a fully traversable DOM object.

Some limitations I'd like to stress:

I don't want to use libraries (like jQuery). There's too much bloat for what I need to do here.
Under no circumstances should scripts from the remote page be executed (for security reasons).
DOM APIs, such as getElementsByTagName, need to be available.
It only needs to work in Internet Explorer, but in 7 at the very least.
Let's pretend I don't have access to a server. I do, but I can't use it for this.

What I've tried

Assuming I have a complete HTML document string (including DOCTYPE declaration) in the variable html, here's what I've tried so far:

var frag = document.createDocumentFragment(),
div  = frag.appendChild(document.createElement("div"));

div.outerHTML = html;
//-> results in an empty fragment

div.insertAdjacentHTML("afterEnd", html);
//-> HTML is not added to the fragment

div.innerHTML = html;
//-> Error (expected, but I tried it anyway)

var doc = new ActiveXObject("htmlfile");
doc.write(html);
doc.close();
//-> JavaScript executes

I've also tried extracting the <head> and <body>nodes from the HTML and adding them to a <HTML> element inside the fragment, still no luck.

Does anyone have any ideas?

Loera answered 19/9, 2011 at 17:15 Comment(5)

I don't want to use libraries (like jQuery). There's too much bloat for what I need to do here there's always the closure compiler: #1692361 – Meliamelic 19/9, 2011 at 17:25

@Juan Mendes: an iframe would execute the script and IE7 has no methods for sandboxing, except for the security attribute which doesn't guarantee that script won't run. – Loera 19/9, 2011 at 17:31

I'm just going to say the word HTAs (about which I know nothing), paste the following link and retreat. There's a good chance it's completely useless. msdn.microsoft.com/en-us/library/ms536496%28v=vs.85%29.aspx – Impetuous 23/9, 2011 at 17:26

Which versions of IE? I've ran into rendering issues where Trident won't render something that's loaded into the innerHTML value on 6/7. This behaviour happens when you use an inappropriate DOM method to do something. – Quach 24/9, 2011 at 13:22

Very related: Parsing a HTML string using DOMParser with MIME-type text/html: JavaScript DOMParser access innerHTML and other properties. – Tonetic 12/2, 2012 at 18:53

Fiddle: http://jsfiddle.net/JFSKe/6/

DocumentFragment doesn't implement DOM methods. Using document.createElement in conjunction with innerHTML removes the <head> and <body> tags (even when the created element is a root element, <html>). Therefore, the solution should be sought elsewhere. I have created a cross-browser string-to-DOM function, which makes use of an invisible inline-frame.

All external resources and scripts will be disabled. See Explanation of the code for more information.

Code

/*
 @param String html    The string with HTML which has be converted to a DOM object
 @param func callback  (optional) Callback(HTMLDocument doc, function destroy)
 @returns              undefined if callback exists, else: Object
                        HTMLDocument doc  DOM fetched from Parameter:html
                        function destroy  Removes HTMLDocument doc.         */
function string2dom(html, callback){
    /* Sanitise the string */
    html = sanitiseHTML(html); /*Defined at the bottom of the answer*/

    /* Create an IFrame */
    var iframe = document.createElement("iframe");
    iframe.style.display = "none";
    document.body.appendChild(iframe);

    var doc = iframe.contentDocument || iframe.contentWindow.document;
    doc.open();
    doc.write(html);
    doc.close();

    function destroy(){
        iframe.parentNode.removeChild(iframe);
    }
    if(callback) callback(doc, destroy);
    else return {"doc": doc, "destroy": destroy};
}

/* @name sanitiseHTML
   @param String html  A string representing HTML code
   @return String      A new string, fully stripped of external resources.
                       All "external" attributes (href, src) are prefixed by data- */

function sanitiseHTML(html){
    /* Adds a <!-\"'--> before every matched tag, so that unterminated quotes
        aren't preventing the browser from splitting a tag. Test case:
       '<input style="foo;b:url(0);><input onclick="<input type=button onclick="too() href=;>">' */
    var prefix = "<!--\"'-->";
    /*Attributes should not be prefixed by these characters. This list is not
     complete, but will be sufficient for this function.
      (see http://www.w3.org/TR/REC-xml/#NT-NameChar) */
    var att = "[^-a-z0-9:._]";
    var tag = "<[a-z]";
    var any = "(?:[^<>\"']*(?:\"[^\"]*\"|'[^']*'))*?[^<>]*";
    var etag = "(?:>|(?=<))";

    /*
      @name ae
      @description          Converts a given string in a sequence of the
                             original input and the HTML entity
      @param String string  String to convert
      */
    var entityEnd = "(?:;|(?!\\d))";
    var ents = {" ":"(?:\\s|&nbsp;?|&#0*32"+entityEnd+"|&#x0*20"+entityEnd+")",
                "(":"(?:\\(|&#0*40"+entityEnd+"|&#x0*28"+entityEnd+")",
                ")":"(?:\\)|&#0*41"+entityEnd+"|&#x0*29"+entityEnd+")",
                ".":"(?:\\.|&#0*46"+entityEnd+"|&#x0*2e"+entityEnd+")"};
                /*Placeholder to avoid tricky filter-circumventing methods*/
    var charMap = {};
    var s = ents[" "]+"*"; /* Short-hand space */
    /* Important: Must be pre- and postfixed by < and >. RE matches a whole tag! */
    function ae(string){
        var all_chars_lowercase = string.toLowerCase();
        if(ents[string]) return ents[string];
        var all_chars_uppercase = string.toUpperCase();
        var RE_res = "";
        for(var i=0; i<string.length; i++){
            var char_lowercase = all_chars_lowercase.charAt(i);
            if(charMap[char_lowercase]){
                RE_res += charMap[char_lowercase];
                continue;
            }
            var char_uppercase = all_chars_uppercase.charAt(i);
            var RE_sub = [char_lowercase];
            RE_sub.push("&#0*" + char_lowercase.charCodeAt(0) + entityEnd);
            RE_sub.push("&#x0*" + char_lowercase.charCodeAt(0).toString(16) + entityEnd);
            if(char_lowercase != char_uppercase){
                RE_sub.push("&#0*" + char_uppercase.charCodeAt(0) + entityEnd);   
                RE_sub.push("&#x0*" + char_uppercase.charCodeAt(0).toString(16) + entityEnd);
            }
            RE_sub = "(?:" + RE_sub.join("|") + ")";
            RE_res += (charMap[char_lowercase] = RE_sub);
        }
        return(ents[string] = RE_res);
    }
    /*
      @name by
      @description  second argument for the replace function.
      */
    function by(match, group1, group2){
        /* Adds a data-prefix before every external pointer */
        return group1 + "data-" + group2 
    }
    /*
      @name cr
      @description            Selects a HTML element and performs a
                                  search-and-replace on attributes
      @param String selector  HTML substring to match
      @param String attribute RegExp-escaped; HTML element attribute to match
      @param String marker    Optional RegExp-escaped; marks the prefix
      @param String delimiter Optional RegExp escaped; non-quote delimiters
      @param String end       Optional RegExp-escaped; forces the match to
                                  end before an occurence of <end> when 
                                  quotes are missing
     */
    function cr(selector, attribute, marker, delimiter, end){
        if(typeof selector == "string") selector = new RegExp(selector, "gi");
        marker = typeof marker == "string" ? marker : "\\s*=";
        delimiter = typeof delimiter == "string" ? delimiter : "";
        end = typeof end == "string" ? end : "";
        var is_end = end && "?";
        var re1 = new RegExp("("+att+")("+attribute+marker+"(?:\\s*\"[^\""+delimiter+"]*\"|\\s*'[^'"+delimiter+"]*'|[^\\s"+delimiter+"]+"+is_end+")"+end+")", "gi");
        html = html.replace(selector, function(match){
            return prefix + match.replace(re1, by);
        });
    }
    /* 
      @name cri
      @description            Selects an attribute of a HTML element, and
                               performs a search-and-replace on certain values
      @param String selector  HTML element to match
      @param String attribute RegExp-escaped; HTML element attribute to match
      @param String front     RegExp-escaped; attribute value, prefix to match
      @param String flags     Optional RegExp flags, default "gi"
      @param String delimiter Optional RegExp-escaped; non-quote delimiters
      @param String end       Optional RegExp-escaped; forces the match to
                                  end before an occurence of <end> when 
                                  quotes are missing
     */
    function cri(selector, attribute, front, flags, delimiter, end){
        if(typeof selector == "string") selector = new RegExp(selector, "gi");
        flags = typeof flags == "string" ? flags : "gi";
         var re1 = new RegExp("("+att+attribute+"\\s*=)((?:\\s*\"[^\"]*\"|\\s*'[^']*'|[^\\s>]+))", "gi");

        end = typeof end == "string" ? end + ")" : ")";
        var at1 = new RegExp('(")('+front+'[^"]+")', flags);
        var at2 = new RegExp("(')("+front+"[^']+')", flags);
        var at3 = new RegExp("()("+front+'(?:"[^"]+"|\'[^\']+\'|(?:(?!'+delimiter+').)+)'+end, flags);

        var handleAttr = function(match, g1, g2){
            if(g2.charAt(0) == '"') return g1+g2.replace(at1, by);
            if(g2.charAt(0) == "'") return g1+g2.replace(at2, by);
            return g1+g2.replace(at3, by);
        };
        html = html.replace(selector, function(match){
             return prefix + match.replace(re1, handleAttr);
        });
    }

    /* <meta http-equiv=refresh content="  ; url= " > */
    html = html.replace(new RegExp("<meta"+any+att+"http-equiv\\s*=\\s*(?:\""+ae("refresh")+"\""+any+etag+"|'"+ae("refresh")+"'"+any+etag+"|"+ae("refresh")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "gi"), "<!-- meta http-equiv=refresh stripped-->");

    /* Stripping all scripts */
    html = html.replace(new RegExp("<script"+any+">\\s*//\\s*<\\[CDATA\\[[\\S\\s]*?]]>\\s*</script[^>]*>", "gi"), "<!--CDATA script-->");
    html = html.replace(/<script[\S\s]+?<\/script\s*>/gi, "<!--Non-CDATA script-->");
    cr(tag+any+att+"on[-a-z0-9:_.]+="+any+etag, "on[-a-z0-9:_.]+"); /* Event listeners */

    cr(tag+any+att+"href\\s*="+any+etag, "href"); /* Linked elements */
    cr(tag+any+att+"src\\s*="+any+etag, "src"); /* Embedded elements */

    cr("<object"+any+att+"data\\s*="+any+etag, "data"); /* <object data= > */
    cr("<applet"+any+att+"codebase\\s*="+any+etag, "codebase"); /* <applet codebase= > */

    /* <param name=movie value= >*/
    cr("<param"+any+att+"name\\s*=\\s*(?:\""+ae("movie")+"\""+any+etag+"|'"+ae("movie")+"'"+any+etag+"|"+ae("movie")+"(?:"+ae(" ")+any+etag+"|"+etag+"))", "value");

    /* <style> and < style=  > url()*/
    cr(/<style[^>]*>(?:[^"']*(?:"[^"]*"|'[^']*'))*?[^'"]*(?:<\/style|$)/gi, "url", "\\s*\\(\\s*", "", "\\s*\\)");
    cri(tag+any+att+"style\\s*="+any+etag, "style", ae("url")+s+ae("(")+s, 0, s+ae(")"), ae(")"));

    /* IE7- CSS expression() */
    cr(/<style[^>]*>(?:[^"']*(?:"[^"]*"|'[^']*'))*?[^'"]*(?:<\/style|$)/gi, "expression", "\\s*\\(\\s*", "", "\\s*\\)");
    cri(tag+any+att+"style\\s*="+any+etag, "style", ae("expression")+s+ae("(")+s, 0, s+ae(")"), ae(")"));
    return html.replace(new RegExp("(?:"+prefix+")+", "g"), prefix);
}

Explanation of the code

The sanitiseHTML function is based on my replace_all_rel_by_abs function (see this answer). The sanitiseHTML function is completely rewritten though, in order to achieve maximum efficiency and reliability.

Additionally, a new set of RegExps are added to remove all scripts and event handlers (including CSS expression(), IE7-). To make sure that all tags are parsed as expected, the adjusted tags are prefixed by . This prefix is necessary to correctly parse nested "event handlers" in conjunction with unterminated quotes: <a id="><input onclick="<div onmousemove=evil()>">.

These RegExps are dynamically created using an internal function cr/cri (Create Replace [Inline]). These functions accept a list of arguments, and create and execute an advanced RE replacement. To make sure that HTML entities aren't breaking a RegExp (refresh in <meta http-equiv=refresh> could be written in various ways), the dynamically created RegExps are partially constructed by function ae (Any Entity).
The actual replacements are done by function by (replace by). In this implementation, by adds data- before all matched attributes.

All <script>//<[CDATA[ .. //]]></script> occurrences are striped. This step is necessary, because CDATA sections allow </script> strings inside the code. After this replacement has been executed, it's safe to go to the next replacement:
The remaining <script>...</script> tags are removed.
The <meta http-equiv=refresh .. > tag is removed
All event listeners and external pointers/attributes (href, src, url()) are prefixed by data-, as described previously.
An IFrame object is created. IFrames are less likely to leak memory (contrary to the htmlfile ActiveXObject). The IFrame becomes invisible, and is appended to the document, so that the DOM can be accessed. document.write() are used to write HTML to the IFrame. document.open() and document.close() are used to empty the previous contents of the document, so that the generated document is an exact copy of the given html string.
If a callback function has been specified, the function will be called with two arguments. The first argument is a reference to the generated document object. The second argument is a function, which destroys the generated DOM tree when called. This function should be called when you don't need the tree any more.
If the callback function isn't specified, the function returns an object consisting of two properties (doc and destroy), which behave the same as the previously mentioned arguments.

Additional notes

Setting the designMode property to "On" will stop a frame from executing scripts (not supported in Chrome). If you have to preserve the <script> tags for a specific reason, you can use iframe.designMode = "On" instead of the script stripping feature.
I wasn't able to find a reliable source for the htmlfile activeXObject. According to this source, htmlfile is slower than IFrames, and more susceptible to memory leaks.
All affected attributes (href, src, ...) are prefixed by data-. An example of getting/changing these attributes is shown for data-href:
elem.getAttribute("data-href") and elem.setAttribute("data-href", "...")
elem.dataset.href and elem.dataset.href = "...".
External resources have been disabled. As a result, the page may look completely different:
~~<link rel="stylesheet" href="main.css" />~~ No external styles
~~<script>document.body.bgColor="red";</script>~~ No scripted styles
<img src="128x128.png" /> No images: the size of the element may be completely different.

Examples

sanitiseHTML(html)
Paste this bookmarklet in the location's bar. It will offer an option to inject a textarea, showing the sanitised HTML string.

javascript:void(function(){var s=document.createElement("script");s.src="http://rob.lekensteyn.nl/html-sanitizer.js";document.body.appendChild(s)})();

Code examples - string2dom(html):

string2dom("<html><head><title>Test</title></head></html>", function(doc, destroy){
    alert(doc.title); /* Alert: "Test" */
    destroy();
});

var test = string2dom("<div id='secret'></div>");
alert(test.doc.getElementById("secret").tagName); /* Alert: "DIV" */
test.destroy();

Notable references

SO: JS RE to change all relative to absolute URLs - Function sanitiseHTML(html) is based on my previously created replace_all_rel_by_abs(html) function.
Elements - Embedded content - A full list of standard embedded elements
Elements - Previous HTML elements - An additional list of (deprecated) elements (such as <applet>)
The htmlfile ActiveX object - "Slower than iframe sandboxes. Leaks memory if not managed"

Tonetic answered 24/9, 2011 at 13:4 Comment(18)

+1 from me, too. I had already come to the same conclusions regarding fragments (after extensive researching and testing). The interesting part is setting designMode to on to prevent script execution. Anyway, thanks a lot... this is more the kind of answer I was after. The only real shame is the many potential holes so I need to give this a bit more thought. – Loera 24/9, 2011 at 16:42

See point 3 (+ the corresponding replace function) and the first two references at the end. If you're absolutely certain that a certain tag (<applet>?) won't appear, there's not need to implement it. If you don't have to keep the embedded elements for a specific goal, removing them through a RE is easy. Eg.: .replace(/<object[\S\s]+?<\/object\s*>/gi, ""). Some embedded objects may have an omitted close tag. In that case, use: .replace(/<embed[^>]+>[\S\s]*?<\/embed\s*>/gi, "").replace(/<embed[^>]*>/gi, ""). – Tonetic 24/9, 2011 at 17:13

@AndyE I'm currently at an advanced stage of developing a function to reliably parse external sources. Does it matter if (example) the src attribute has to be referenced through data-src instead of src? I could modify native Element objects, although they're not very reliable. – Tonetic 28/9, 2011 at 21:33

+1 Great answer. Also would it could be valuable to fix <stylesheet> and/or <style>? They could potentially have expressions or -moz-behaviors. – Mentor 29/9, 2011 at 2:8

@lime Adding filters for behavior, -moz-binding or expression() will rarely lead to the desired results, because external stylesheets cannot be validated (unless all resources are going through a proxy) JS-based proxy: see the third number at my profile's list. – Tonetic 29/9, 2011 at 7:5

@RobW: I guess it doesn't really matter. -moz-binding isn't applicable to me and designMode should stop any expressions from executing. – Loera 29/9, 2011 at 8:47

@AndyE What's your answer regarding "Does it matter if (example) the src attribute has to be referenced through data-src instead of src?"? (4 comments before this one). – Tonetic 29/9, 2011 at 8:51

@Rob: no, I don't think that matters either. – Loera 29/9, 2011 at 8:53

@RobW I was referring to changing the href to data-href and remove <style>s altogether. Yeah proxing the <stylsheet>s would be really annoying ;) – Mentor 29/9, 2011 at 15:19

@Rob - jsfiddle.net/JFSKe/2 is a trivial attack on your sanitizer. I know of at least one other trivial way to defeat it and I'm not even vaguely an XSS expert. – Stevestevedore 29/9, 2011 at 20:12

I have updated my answer, and merged my solution for this question with the previous string2dom function. @Stevestevedore Check my new code ;) – Tonetic 29/9, 2011 at 21:42

@RobW: a quick scan over your code indicates that, if present, the <base> tag isn't honoured for those relative URLs. – Loera 29/9, 2011 at 22:7

@Rob - Your code doesn't seem to be sanitizing on* attributes correctly at all now. This input "<html><head><title>Test</title></head><body onload='alert(\"XSS\")'></html>" displays an "XSS" alert. I strongly recommend that you build yourself a very thorough test suite. – Stevestevedore 30/9, 2011 at 0:7

@Stevestevedore Change on to on[-a-z0-9:_.]+. The event listener selector should now correctly be replaced. @AndyE, <base> tags are selected through the href RE. – Tonetic 30/9, 2011 at 8:16

@Rob - Try this: "<html><head><title>Test</title></head><body onx='>' onload='alert(\"XSS\")'></html>" – Stevestevedore 30/9, 2011 at 12:11

@Stevestevedore Thanks. I've "just" rewritten my sanitiseHTML function. – Tonetic 30/9, 2011 at 16:31

@BoltClock Regarding your edit. I have reverted your edit, because the white-space between the list items are intended. They separate the blocks in different sections. – Tonetic 23/10, 2011 at 15:51

I used the pure iframe trick and was surprised to discover that it wasn't always synchronous in Chrome 46! When parsing a complex page, I actually had to wait (setTimeout) before the <body> element appeared inside the DOM. Presumably the page wanted to load something over the network before it could fully realise. However in Firefox it was synchronous. – Kalagher 16/11, 2015 at 1:15

Not sure why you're messing with documentFragments, you can just set the HTML text as the innerHTML of a new div element. Then you can use that div element for getElementsByTagName etc without adding the div to DOM:

var htmlText= '<html><head><title>Test</title></head><body><div id="test_ele1">this is test_ele1 content</div><div id="test_ele2">this is test_ele content2</div></body></html>';

var d = document.createElement('div');
d.innerHTML = htmlText;

console.log(d.getElementsByTagName('div'));

If you're really married to the idea of a documentFragment, you can use this code, but you'll still have to wrap it in a div to get the DOM functions you're after:

function makeDocumentFragment(htmlText) {
    var range = document.createRange();
    var frag = range.createContextualFragment(htmlText);
    var d = document.createElement('div');
    d.appendChild(frag);
    return d;
}

Skyeskyhigh answered 19/9, 2011 at 19:6 Comment(2)

This strips out the <head> element before appending to the newly created div. I know I didn't specify that I need stuff from the head too, but I do (specifically <link> elements). I'm messing with document fragments as it seems like the most likely method to work if this is possible. createContextualFragment doesn't help me, it's not supported in IE. – Loera 19/9, 2011 at 19:43

I researched this quite a bit - without access to stuff like developer.mozilla.org/En/DOM/DOMImplementation.createDocument and without using an iFrame, there really isn't another way to do this strictly client-side. Wasn't sure about the support for Range/createContextualFragment in IE 7, but after I got to looking at the results I realized that it isn't any different than just plunking the HTML into a new div element. Since document fragments don't have the DOM functions you want and divs cannot validly contain HTML/BODY, I am not sure what option you have. – Skyeskyhigh 19/9, 2011 at 19:46

I'm not sure if IE supports document.implementation.createHTMLDocument, but if it does, use this algorithm (adapted from my DOMParser HTML extension). Note that the DOCTYPE will not be preserved.:

var
      doc = document.implementation.createHTMLDocument("")
    , doc_elt = doc.documentElement
    , first_elt
;
doc_elt.innerHTML = your_html_here;
first_elt = doc_elt.firstElementChild;
if ( // are we dealing with an entire document or a fragment?
       doc_elt.childElementCount === 1
    && first_elt.tagName.toLowerCase() === "html"
) {
    doc.replaceChild(first_elt, doc_elt);
}

// doc is an HTML document
// you can now reference stuff like doc.title, etc.

Empirin answered 25/9, 2011 at 19:35 Comment(1)

IE 9 supports it but IE 8 and lower don't, unfortunately. – Loera 25/9, 2011 at 23:10

Assuming the HTML is valid XML too, you may use loadXML()

Marvismarwin answered 19/9, 2011 at 18:33 Comment(1)

I can't assume that, unfortunately. The HTML loaded could (in theory) be from any site on the web. – Loera 19/9, 2011 at 18:34

DocumentFragment doesn't support getElementsByTagName -- that's only supported by Document.

You may need to use a library like jsdom, which provides an implementation of the DOM and through which you can search using getElementsByTagName and other DOM APIs. And you can set it to not execute scripts. Yes, it's 'heavy' and I don't know if it works in IE 7.

Aeneid answered 23/9, 2011 at 20:51 Comment(2)

Interesting... IE supports getElementsByTagName for document fragments (which is what I based that point on in my question). – Loera 23/9, 2011 at 21:19

Odd, but I guess I shouldn't be surprised that IE doesn't follow the spec. Here's a discussion that implies that createDocumentFragment on IE actually creates a Document rather than DocumentFragment, which would explain why it supports getElementsByTagName. – Aeneid 23/9, 2011 at 21:26

Just wandered across this page, am a bit late to be of any use :) but the following should help anyone with a similar problem in future... however IE7/8 should really be ignored by now and there are much better methods supported by the more modern browsers.

The following works across nearly eveything I've tested - the only two down sides are:

I've added bespoke getElementById and getElementsByName functions to the root div element, so these wont appear as expected futher down the tree (unless the code is modified to cater for this).
The doctype will be ignored - however I don't think this will make much difference as my experience is that the doctype wont effect how the dom is structured, just how it is rendered (which obviously wont happen with this method).

Basically the system relies on the fact that <tag> and <namespace:tag> are treated differently by the useragents. As has been found certain special tags can not exist within a div element, and so therefore they are removed. Namespaced elements can be placed anywhere (unless there is a DTD stating otherwise). Whilst these namespace tags wont actually behave as the real tags in question, considering we are only really using them for their structural position in the document it doesn't really cause a problem.

markup and code are as follows:

<!DOCTYPE html>
<html>
<head>
<script>

  /// function for parsing HTML source to a dom structure
  /// Tested in Mac OSX, Win 7, Win XP with FF, IE 7/8/9, 
  /// Chrome, Safari & Opera.
  function parseHTML(src){

    /// create a random div, this will be our root
    var div = document.createElement('div'),
        /// specificy our namespace prefix
        ns = 'faux:',
        /// state which tags we will treat as "special"
        stn = ['html','head','body','title'];
        /// the reg exp for replacing the special tags
        re = new RegExp('<(/?)('+stn.join('|')+')([^>]*)?>','gi'),
        /// remember the getElementsByTagName function before we override it
        gtn = div.getElementsByTagName;

    /// a quick function to namespace certain tag names
    var nspace = function(tn){
      if ( stn.indexOf ) {
        return stn.indexOf(tn) != -1 ? ns + tn : tn;
      }
      else {
        return ('|'+stn.join('|')+'|').indexOf(tn) != -1 ? ns + tn : tn;
      }
    };

    /// search and replace our source so that special tags are namespaced
    /// &nbsp; required for IE7/8 to render tags before first text found
    /// <faux:check /> tag added so we can test how namespaces work
    src = '&nbsp;<'+ns+'check />' + src.replace(re,'<$1'+ns+'$2$3>');
    /// inject to the div
    div.innerHTML = src;
    /// quick test to see how we support namespaces in TagName searches
    if ( !div.getElementsByTagName(ns+'check').length ) {
      ns = '';
    }

    /// create our replacement getByName and getById functions
    var createGetElementByAttr = function(attr, collect){
      var func = function(a,w){
        var i,c,e,f,l,o; w = w||[];
        if ( this.nodeType == 1 ) {
          if ( this.getAttribute(attr) == a ) {
            if ( collect ) {
              w.push(this);
            }
            else {
              return this;
            }
          }
        }
        else {
          return false;
        }
        if ( (c = this.childNodes) && (l = c.length) ) {
          for( i=0; i<l; i++ ){
            if( (e = c[i]) && (e.nodeType == 1) ) {
              if ( (f = func.call( e, a, w )) && !collect ) {
                return f;
              }
            }
          }
        }
        return (w.length?w:false);
      }
      return func;
    }

    /// apply these replacement functions to the div container, obviously 
    /// you could add these to prototypes for browsers the support element 
    /// constructors. For other browsers you could step each element and 
    /// apply the functions through-out the node tree... however this would  
    /// be quite messy, far better just to always call from the root node - 
    /// or use div.getElementsByTagName.call( localElement, 'tag' );
    div.getElementsByTagName = function(t){return gtn.call(this,nspace(t));}
    div.getElementsByName    = createGetElementByAttr('name', true);
    div.getElementById       = createGetElementByAttr('id', false);

    /// return the final element
    return div;
  }

  window.onload = function(){

    /// parse the HTML source into a node tree
    var dom = parseHTML( document.getElementById('source').innerHTML );

    /// test some look ups :)
    var a = dom.getElementsByTagName('head'),
        b = dom.getElementsByTagName('title'),
        c = dom.getElementsByTagName('script'),
        d = dom.getElementById('body');

    /// alert the result
    alert(a[0].innerHTML);
    alert(b[0].innerHTML);
    alert(c[0].innerHTML);
    alert(d.innerHTML);

  }
</script>
</head>
<body>
  <xmp id="source">
    <!DOCTYPE html>
    <html>
    <head>
      <!-- Comment //-->
      <meta charset="utf-8">
      <meta name="robots" content="index, follow">
      <title>An example</title>
      <link href="test.css" />
      <script>alert('of parsing..');</script>
    </head>
    <body id="body">
      <b>in a similar way to createDocumentFragment</b>
    </body>
    </html>
  </xmp>
</body>
</html>

Vegetative answered 13/10, 2012 at 1:8 Comment(0)

To use full HTML DOM abilities without triggering requests, without having to deal with incompatibilities:

var doc = document.cloneNode();
if (!doc.documentElement) {
    doc.appendChild(doc.createElement('html'));
    doc.documentElement.appendChild(doc.createElement('head'));
    doc.documentElement.appendChild(doc.createElement('body'));
}

All set ! doc is an html document, but it is not online.

Smallwood answered 25/1, 2018 at 14:41 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++