Find body tag in an ajax HTML response
Asked Answered
I

5

24

I'm making an ajax call to fetch content and append this content like this:

$(function(){
    var site = $('input').val();
    $.get('file.php', { site:site }, function(data){
        mas = $(data).find('a');
        mas.map(function(elem, index) {
            divs = $(this).html();
            $('#result').append('' + divs + '');
        })
    }, 'html');
});

The problem is that when I change a in body I get nothing (no error, just no html). Im assuming body is a tag just like 'a' is? What am I doing wrong?

So this works for me:

 mas = $(data).find('a');

But this doesn't:

 mas = $(data).find('body');
Insectarium answered 20/1, 2013 at 9:34 Comment(7)
Please add a sample response you're getting from querying file.phpThrenody
@Threnody You mean my console log?Insectarium
It can be console.log(data) or anything that shows the complete string you received with the ajax call.Threnody
I just checked, with simplified code, and different pages, and can confirm I am experiencing the same issue. It works to select elements within the body but not to select the body itself.Chronological
@Threnody Im not sure but I think it has to be an url (fom input.val) This could be any url.Insectarium
yes, so please add the sample response to your post. Boaz's tip might help in your case (when body is the first/root tag in your response), but to be sure, we need to know how the reponse looks like.Threnody
@Threnody maybe this: mywebsite.com/file.php?site=http%3A%2F%2Fnu.nl Im not sure how this can help..Insectarium
H
12

Parsing the returned HTML through a jQuery object (i.e $(data)) in order to get the body tag is doomed to fail, I'm afraid.

The reason is that the returned data is a string (try console.log(typeof(data))). Now, according to the jQuery documentation, when creating a jQuery object from a string containing complex HTML markup, tags such as body are likely to get stripped. This happens since in order to create the object, the HTML markup is actually inserted into the DOM which cannot allow such additional tags.

Relevant quote from the documentation:

If a string is passed as the parameter to $(), jQuery examines the string to see if it looks like HTML.

[...] If the HTML is more complex than a single tag without attributes, as it is in the above example, the actual creation of the elements is handled by the browser's innerHTML mechanism. In most cases, jQuery creates a new element and sets the innerHTML property of the element to the HTML snippet that was passed in. When the parameter has a single tag (with optional closing tag or quick-closing) — $( "< img / >" ) or $( "< img >" ), $( "< a >< /a >" ) or $( "< a >" ) — jQuery creates the element using the native JavaScript createElement() function.

When passing in complex HTML, some browsers may not generate a DOM that exactly replicates the HTML source provided. As mentioned, jQuery uses the browser"s .innerHTML property to parse the passed HTML and insert it into the current document. During this process, some browsers filter out certain elements such as < html >, < title >, or < head > elements. As a result, the elements inserted may not be representative of the original string passed.

Holloway answered 20/1, 2013 at 9:39 Comment(3)
If you find a relevant workaround, post it as an answer as well.Holloway
I disagree that it's doomed to fail! The solution that I've posted to this answer works perfectly and is as convenient as anything else in jquery.Biron
@GershomMaes The issue raised by the OP is about directly parsing the returned HTML string. Your solution, while being a neat trick, works around this issue by indirectly parsing the HTML string as an XML document first. This does not negate the fact that directly parsing the HTML strips the body tag.Holloway
R
13

I ended up with this simple solution:

var body = data.substring(data.indexOf("<body>")+6,data.indexOf("</body>"));
$('body').html(body);

Works also with head or any other tag.

(A solution with xml parsing would be nicer but with an invalid XML response you have to do some "string parsing".)

Rocky answered 2/2, 2016 at 15:44 Comment(1)
That won't work if the body tag has anything extra like you get from MS Word, e.g. <body lang=EN-GB style='tab-interval:36.0pt'>.Corduroy
H
12

Parsing the returned HTML through a jQuery object (i.e $(data)) in order to get the body tag is doomed to fail, I'm afraid.

The reason is that the returned data is a string (try console.log(typeof(data))). Now, according to the jQuery documentation, when creating a jQuery object from a string containing complex HTML markup, tags such as body are likely to get stripped. This happens since in order to create the object, the HTML markup is actually inserted into the DOM which cannot allow such additional tags.

Relevant quote from the documentation:

If a string is passed as the parameter to $(), jQuery examines the string to see if it looks like HTML.

[...] If the HTML is more complex than a single tag without attributes, as it is in the above example, the actual creation of the elements is handled by the browser's innerHTML mechanism. In most cases, jQuery creates a new element and sets the innerHTML property of the element to the HTML snippet that was passed in. When the parameter has a single tag (with optional closing tag or quick-closing) — $( "< img / >" ) or $( "< img >" ), $( "< a >< /a >" ) or $( "< a >" ) — jQuery creates the element using the native JavaScript createElement() function.

When passing in complex HTML, some browsers may not generate a DOM that exactly replicates the HTML source provided. As mentioned, jQuery uses the browser"s .innerHTML property to parse the passed HTML and insert it into the current document. During this process, some browsers filter out certain elements such as < html >, < title >, or < head > elements. As a result, the elements inserted may not be representative of the original string passed.

Holloway answered 20/1, 2013 at 9:39 Comment(3)
If you find a relevant workaround, post it as an answer as well.Holloway
I disagree that it's doomed to fail! The solution that I've posted to this answer works perfectly and is as convenient as anything else in jquery.Biron
@GershomMaes The issue raised by the OP is about directly parsing the returned HTML string. Your solution, while being a neat trick, works around this issue by indirectly parsing the HTML string as an XML document first. This does not negate the fact that directly parsing the HTML strips the body tag.Holloway
C
6

I experimented a little, and have identified the cause to a point, so pending a real answer which I would be interested in, here is a hack to help understand the issue

$.get('/',function(d){
    // replace the `HTML` tags with `NOTHTML` tags
    // and the `BODY` tags with `NOTBODY` tags
    d = d.replace(/(<\/?)html( .+?)?>/gi,'$1NOTHTML$2>',d)
    d = d.replace(/(<\/?)body( .+?)?>/gi,'$1NOTBODY$2>',d)
    // select the `notbody` tag and log for testing
    console.log($(d).find('notbody').html())
})

Edit: further experimentation

It seems it is possible if you load the content into an iframe, then you can access the frame content through some dom object hierarchy...

// get a page using AJAX
$.get('/',function(d){

    // create a temporary `iframe`, make it hidden, and attach to the DOM
    var frame = $('<iframe id="frame" src="/" style="display: none;"></iframe>').appendTo('body')

    // check that the frame has loaded content
    $(frame).load(function(){

        // grab the HTML from the body, using the raw DOM node (frame[0])
        // and more specifically, it's `contentDocument` property
        var html = $('body',frame[0].contentDocument).html()

        // check the HTML
        console.log(html)

        // remove the temporary iframe
        $("#frame").remove()

    })
})

Edit: more research

It seems that contentDocument is the standards compliant way to get hold of the window.document element of an iFrame, but of course IE don't really care for standards, so this is how to get a reference to the iFrame's window.document.body object in a cross platform way...

var iframeDoc = iframe.contentDocument || iframe.contentWindow.document;
var iframeBody = iframeDoc.body;
// or for extra caution, to support even more obsolete browsers
// var iframeBody = iframeDoc.getElementsByTagName("body")[0]

See: contentDocument for an iframe

Chronological answered 20/1, 2013 at 10:0 Comment(9)
additionally, it does not seem to make any diference what syntax you use for the selector, as it seems to be a restriction in the jQuery core, so $('body',d) has the same results as $(d).find('body').Chronological
Hi, thanks for sticking around. However I want to use my code for any given website, as we know some websites do not support iframes..Insectarium
Maybe it doesnt work in 'jquery environment' and I would have to result to plain javascript. I have been trying variations with document.getElementsByTagName("body")[0]; with no luck so farInsectarium
I think the problem is, that you can't add another HTML, HEAD or BODY to the DOM. If you try to set the .innerHTML of a DIV tag to include any of these forbidden elements, it simply won't add them - which is why I expect jQuery is not able to then select them.Chronological
@Insectarium could you explain to me what websites don't support iframes? I had thought they were pretty much universally supported these days.Chronological
I mean the embedding of website in iframe(google), but maybe I misunderstood your answer..Insectarium
@Insectarium my answer uses a hidden iframe, only to temporarily load content, so it can be accessed and manipulated as a cohesive document. It allows you to create the HTML, HEAD and BODY DOM elements. None of it is shown on the screen, and it is destroyed as soon as the DOM access and manipulation is finished, and the result stored in a variable. The existing page does not need to be changed, so the only issue with iFrames is if they are not supported at all. I am genuinely interested in a good solution for this problem, as it seems like a very basic and useful function. Keep this thread going!Chronological
Hi, I found an answer at: #7840389 But I can't seem to implement in my code...I would appreciate if you could help me outInsectarium
@Insectarium it seems to be pretty much the same solution as my first proposal, using regex to strip out the offending tags, which is not ideal, because regular expressions are not able to handle fringe cases, and malformed HTML - a parser would be required for that. Most front-end parsers are based on using the browser's built in parser, which is the problem, because it won't let you add already existing BODY tags. This is what led me to my second proposal, which is to load the HTML into an iFrame, allowing the BODY tag to be added. This might not be ideal, because it depends on iFrames.Chronological
B
4

I FIGURED OUT SOMETHING WONDERFUL (I think!)

Got your html as a string?

var results = //probably an ajax response

Here's a jquery object that will work exactly like the elements currently attached to the DOM:

var superConvenient = $($.parseXML(response)).children('html');

Nothing will be stripped from superConvenient! You can do stuff like superConvenient.find('body') or even

superConvenient.find('head > script');

superConvenient works exactly like the jquery elements everyone is used to!!!!

NOTE

In this case the string results needs to be valid XML because it is fed to JQuery's parseXML method. A common feature of an HTML response may be a <!DOCTYPE> tag, which would invalidate the document in this sense. <!DOCTYPE> tags may need to be stripped before using this approach! Also watch out for features such as <!--[if IE 8]>...<![endif]-->, tags without closing tags, e.g.:

<ul>
    <li>content...
    <li>content...
    <li>content...
</ul>

... and any other features of HTML that will be interpreted leniently by browsers, but will crash the XML parser.

Biron answered 26/5, 2014 at 19:41 Comment(8)
Great! I'm glad that anyone's getting some use out of this since I was personally browbeaten by the time I stumbled across this solution :)Biron
+1 Though there's an obvious overhead, since the HTML string is being parsed twice, instead of once. With large HTML documents this might be costly.Holloway
The jQuery XML parser says the html starting with '<!DOCTYPE HTML> <!--[if lt IE 7]><html ...' is invalid. I tried to remove the DOCTYPE bit to be sure it starts with a HTML tag without much success. Still +1 for the neat idea.Churlish
I hadn't thought of that, and it certainly makes sense to me that the DOCTYPE could break the parser - although I would imagine that if you only take the component of the results including and beyond the "<html" token that this method ought to still work.Biron
I used the same code but got an error Uncaught Error: Invalid XML: <head>Kickoff
Are you sure that the response you are parsing is syntactically valid? JQuery's xml parser won't be able to handle malformed html (or xml)Biron
I also got this error at the <html> tag, I don't understand why. However if there were an error for a <br/> I would understand...Leblanc
Hmm I would print out the html string you're working with. Then copy/paste it into an online xml validator - that should give good feedback as to where the xml syntax error is!Biron
T
2

Regex solution that worked for me:

var head = res.match(/<head.*?>.*?<\/head.*?>/s);
var body = res.match(/<body.*?>.*?<\/body.*?>/s);

Detailed explanation: https://regex101.com/r/kFkNeI/1

Trellas answered 11/8, 2019 at 12:0 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.