How do I get the original innerHTML source without the Javascript generated contents?
Asked Answered
G

9

24

Is it possible to get in some way the original HTML source without the changes made by the processed Javascript? For example, if I do:

<div id="test">
    <script type="text/javascript">document.write("hello");</script>
</div>

If I do:

alert(document.getElementById('test').innerHTML);

it shows:

<script type="text/javascript">document.write("hello");</script>hello

In simple terms, I would like the alert to show only:

<script type="text/javascript">document.write("hello");</script>

without the final hello (the result of the processed script).

Graviton answered 9/12, 2010 at 11:14 Comment(9)
In which browser did you test this? In FF4b7 and Chrome 8 I get <script type="text/javascript">document.write("hello");</script>helloPicaroon
@Marcel: IE7 and IE8 (also IE6)Graviton
@Marcel: I updated the question, I forgot a piece. Sorry for that.Graviton
And I fear you don't know in advance what text is added, do you?Picaroon
@Marcel: what do you mean? The text added in the example is hello coz it's created by the document.write("hello"). I'm looking for a general purpose solution not dependent on the code inside the DIV, something that returns always the original source code without the modifications made by the Javascript engine.Graviton
Yeah, that's what I feared. But when elements are added to the DOM, there's no way to distinguish between original markup and dynamically added elements/nodes (unless you mark them as such), at least not as far as I know.Picaroon
Why do you need to do this? I'm sure there's a workaround to whatever you're trying to do if you tell us what that is.Architrave
@musicfreak: let's say you have a simple CMS, innerHTML for DIVs on your page can be changed using javascript by final user, and than when he saves the page the innerHTML contents of each DIV is sent to server to be stored on DB. When the innerHTML containes <script> the innerHTML would be screwed and saved into the DB screwed.Graviton
It's a bit of a hack but why not just download the current url using AJAX? You should get the original source with a couple of caveats (POST data would be ignored and anything random or time-dependent might be different)Ploce
T
6

I don't think there's a simple solution to just "grab original source" as it'll have to be something that's supplied by the browser. But, if you are only interested in doing this for a section of the page, then I have a workaround for you.

You can wrap the section of interest inside a "frozen" script:

<script id="frozen" type="text/x-frozen-html">

The type attribute I just made up, but it will force the browser to ignore everything inside it. You then add another script tag (proper javascript this time) immediately after this one - the "thawing" script. This thawing script will get the frozen script by ID, grab the text inside it, and do a document.write to add the actual contents to the page. Whenever you need the original source, it's still captured as text inside the frozen script.

And there you have it. The downside is that I wouldn't use this for the whole page... (SEO, syntax highlighting, performance...) but it's quite acceptable if you have a special requirement on part of a page.


Edit: Here is some sample code. Also, as @FlashXSFX correctly pointed out, any script tags within the frozen script will need to be escaped. So in this simple example, I'll make up a <x-script> tag for this purpose.

<script id="frozen" type="text/x-frozen-html">
   <div id="test">
      <x-script type="text/javascript">document.write("hello");</x-script>
   </div>
</script>
<script type="text/javascript">
   // Grab contents of frozen script and replace `x-script` with `script`
   function getSource() {
      return document.getElementById("frozen")
         .innerHTML.replace(/x-script/gi, "script");
   }
   // Write it to the document so it actually executes
   document.write(getSource());
</script>

Now whenever you need the source:

alert(getSource());

See the demo: http://jsbin.com/uyica3/edit

Terbecki answered 10/12, 2010 at 0:25 Comment(2)
Could you plz show a short piece of code. I don't understand.Graviton
I thought that this might actually work, so I gave it a go. The main problem I saw was when you are trying to put script tags inside the frozen tag. (I used the original poster's snippets) You will need to do some escaping and some string replacing to get that to work.Papism
C
4

A simple way is to fetch it form the server again. It will be in the cache most probably. Here is my solution using jQuery.get(). It takes the original uri of the page and loads the data with an ajax call:

$.get(document.location.href, function(data,status,jq) {console.log(data);})

This will print the original code without any javascript. It does not do any error handling!

If don't want to use jQuery to fetch the source, consult the answer to this question: How to make an ajax call without jquery?

Coaler answered 8/8, 2014 at 0:3 Comment(1)
Excellent idea! I had an issue where scraping a site without a web browser was impossible, but at the same time the site was destroying some data (which I needed) after loading up. With this approach, the slow and inefficient loading is done once, whereas the actual reading of the site html is very efficiently done from the same browser session, so it solves two problems at once.Shiloh
A
2

Could you send an Ajax request to the same page you're currently on and use the result as your original HTML? This is foolproof given the right conditions, since you are literally getting the original HTML document. However, this won't work if the page changes on every request (with dynamic content), or if, for whatever reason, you cannot make a request to that specific page.

Architrave answered 16/12, 2010 at 2:41 Comment(0)
B
1

Brute force approach

var orig = document.getElementById("test").innerHTML;
alert(orig.replace(/<\/script>[.\n\r]*.*/i,"</script>"));

EDIT:

This could be better

var orig = document.getElementById("test").innerHTML + "<<>>";
alert(orig.replace( /<\/script>[^(<<>>)]+<<>>/i, "<\/script>"));
Burgee answered 10/12, 2010 at 5:45 Comment(2)
Beside the fact that the you forgot a slash replace(/<\/script>[.\n\r]*.*/i,"<\/script>") and that I don't understand why you placed a dot inside the [.\n\r], it might anyway be a good attempt and a possible approach, so +1. Anyway it's still very specific, i.e. if a add a simple new line document.write("hello\nchina"); your regex would replace only hello, and live china where it is.Graviton
@Marco, thanks for correcting the regex. As I said it is a brute force approach (not an elegant/generic one).Burgee
P
0

If you override document.write to add some identifiers at the beginning and end of everything written to the document by the script, you will be able to remove those writes with a regular expression.

Here's what I came up with:

    <script type="text/javascript" language="javascript">
        var docWrite = document.write;
        document.write = myDocWrite;

        function myDocWrite(wrt) {
            docWrite.apply(document, ['<!--docwrite-->' + wrt + '<!--/docwrite-->']);
        }
    </script>

Added your example somewhere in the page after the initial script:

    <div id="test">
        <script type="text/javascript">     document.write("hello");</script>
    </div>

Then I used this to alert what was inside:

    var regEx = /<!--docwrite-->(.*?)<!--\/docwrite-->/gm;
    alert(document.getElementById('test').innerHTML.replace(regEx, ''));
Papism answered 10/12, 2010 at 21:4 Comment(1)
Please be more specific. Original post was asking how to use document.write, and still get the original source.Papism
B
0

If you want the pristine document, you'll need to fetch it again. There's no way around that. If it weren't for the document.write() (or similar code that would run during the load process) you could load the original document's innerHTML into memory on load/domready, before you modify it.

Bunt answered 10/12, 2010 at 21:39 Comment(0)
T
0

I can't think of a solution that would work the way you're asking. The only code that Javascript has access to is via the DOM, which only contains the result after the page has been processed.

The closest I can think of to achieve what you want is to use Ajax to download a fresh copy of the raw HTML for your page into a Javascript string, at which point since it's a string you can do whatever you like with it, including displaying it in an alert box.

Togetherness answered 10/12, 2010 at 22:0 Comment(0)
A
0

A tricky way is using <style> tag for template. So that you do not need rename x-script any more.

console.log(document.getElementById('test').innerHTML);
<style id="test" type="text/html+template">
    <script type="text/javascript">document.write("hello");</script>
</style>

But I do not like this ugly solution.

Accountancy answered 2/1, 2018 at 2:21 Comment(0)
P
-1

I think you want to traverse the DOM nodes:

var childNodes = document.getElementById('test').childNodes, i, output = [];

for (i = 0; i < childNodes.length; i++)
    if (childNodes[i].nodeName == "SCRIPT")
        output.push(childNodes[i].innerHTML);

return output.join('');
Picaroon answered 9/12, 2010 at 11:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.