DOMDocument::loadHTML(): warning - htmlParseEntityRef: no name in Entity
Asked Answered
C

9

28

I have found several similar questions, but so far, none have been able to help me.

I am trying to output the 'src' of all images in a block of HTML, so I'm using DOMDocument(). This method is actully working, but I'm getting a warning on some pages, and I can't figure out why. Some posts suggested surpressing the warning, but I'd much rather find out why the warning is being generated.

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: no name in Entity, line: 10

One example of post->post_content that is generating the error is -

On Wednesday 21st November specialist rights of way solicitor Jonathan Cheal of Dyne Drewett will be speaking at the Annual Briefing for Rural Practice Surveyors and Agricultural Valuers in Petersfield.
<br>
Jonathan is one of many speakers during the day and he is specifically addressing issues of public rights of way and village greens.
<br>
Other speakers include:-
<br>
<ul>
<li>James Atrrill, Chairman of the Agricultural Valuers Associates of Hants, Wilts and Dorset;</li>
<li>Martin Lowry, Chairman of the RICS Countryside Policies Panel;</li>
<li>Angus Burnett, Director at Martin & Company;</li>
<li>Esther Smith, Partner at Thomas Eggar;</li>
<li>Jeremy Barrell, Barrell Tree Consultancy;</li>
<li>Robin Satow, Chairman of the RICS Surrey Local Association;</li>
<li>James Cooper, Stnsted Oark Foundation;</li>
<li>Fenella Collins, Head of Planning at the CLA; and</li>
<li>Tom Bodley, Partner at Batcheller Monkhouse</li>
</ul>

I can post some more examples of what post->post_content contains if that would be helpful?

I have allowed access to a development site temporarily, so you can see some examples [Note - links no longer accessable as question has been answered] -

Any tips on how to resolve this? Thanks.

$dom = new DOMDocument();
$dom->loadHTML(apply_filters('the_content', $post->post_content)); // Have tried stripping all tags but <img>, still generates warning
$nodes = $dom->getElementsByTagName('img');
foreach($nodes as $img) :
    $images[] = $img->getAttribute('src');
endforeach;
Chamorro answered 1/2, 2013 at 14:25 Comment(10)
Showing the line that caused the error would definitely make debugging it easier.Marvelous
??? The warning is on DOMDocument::loadHTML();, so the line causing the error is dom->loadHTML(apply_filters('the_content', $post->post_content));Chamorro
Line 10 of the content you're parsing...Marvelous
Ok, with you. In one case, it's James Cooper, Stnsted Oark Foundation;. I did think it could be the ; causing the issue, but rempving them all (there were several before) didn't help.Chamorro
"I can post some example of what post->post_content contains if that would be helpful?". Yeah definitely! Not an example though, I want the exact HTML that is generating the error.Croom
Have updated for you. Thanks.Chamorro
@DavidGard My best guess then is that there is an unescaped ampersand (&) somewhere in the HTML. This will make the parser think we're in an entity reference (e.g. &copy;). When it gets to ;, it thinks the entity is over. It then realises what it has doesn't conform to an entity, so it sends out a warning and returns the content as plain text.Marvelous
Ok, that makes sense. And & is on line 10 form the looks of it. Will do some testing to fix and see what occurs... Thanks.Chamorro
Beautiful, that was indeed the problem. I will accept as soon as you post as an answer. Thanks for the help.Chamorro
might want to phrase the question in the form of a question. better jeopardy payback.Tantalic
C
42

This correct answer comes from a comment from @lonesomeday.

My best guess then is that there is an unescaped ampersand (&) somewhere in the HTML. This will make the parser think we're in an entity reference (e.g. ©). When it gets to ;, it thinks the entity is over. It then realises what it has doesn't conform to an entity, so it sends out a warning and returns the content as plain text.

Chamorro answered 12/2, 2013 at 12:3 Comment(4)
So how do I fix it? I cant call htmlentities on whole html string.Kleiman
@Kleiman I know this is many years later, but I just stubbled into this same issue. The simplest option I found was just to do a string replace str_replace(' & ', ' &amp; ', $string) as htmlentities and htmlspecialcharacters caused the < and > of the HTML tags to be converted. Now I am 100% sure there is a better way to do this, but that sorted what I needed on a simple one off parse job.Rapallo
@Rapallo a little more restrictive: preg_replace("/&(?!\S+;)/", "&amp;", $string).Airfoil
This saves my day, I was struggling and later on finds that the contents generated by a user include & in a name and that was a source of all errors. ThanksAnnelieseannelise
G
27

As mentionned here

Warning: DOMDocument::loadHTML(): htmlParseEntityRef: expecting ';' in Entity,

you can use :

libxml_use_internal_errors(true);

see http://php.net/manual/en/function.libxml-use-internal-errors.php

Gid answered 10/11, 2014 at 22:6 Comment(3)
And loading html as this @$dom->loadHTML($html); helps me.Leuko
This fixed my problemAnastassia
Great, again stackoverflow saved me ;)Gormand
K
4

An unescaped "&" somewhere in the HTML and replace "&" with &amp. Here is my solution!

 $html = preg_replace('/&(?!amp)/', '&amp;', $html);

It will replace the single ampersand with "&amp" but current "&amp" will still remain the same.

Khosrow answered 26/7, 2022 at 15:45 Comment(0)
S
3

Check "&" character in your HTML code anywhere.I had that issue because of that scenario.

Sideman answered 2/3, 2020 at 9:49 Comment(1)
And replace & with &amp;Denominative
R
1

I don't have the reputation required to leave a comment above, but using htmlspecialchars solved this problem in my case:

$inputHTML = htmlspecialchars($post->post_content);
$dom = new DOMDocument();
$dom->loadHTML(apply_filters('the_content', $inputHTML)); // Have tried stripping all tags but <img>, still generates warning
$nodes = $dom->getElementsByTagName('img');
foreach($nodes as $img) :
    $images[] = $img->getAttribute('src');
endforeach;

For my purposes, I'm also using strip_tags($inputHTML, "<strong><em><br>"), so all image tags are stripped out as well - I'm not sure if this would be a problem otherwise.

Rhapsody answered 1/6, 2016 at 17:2 Comment(0)
J
0

I eventually solved this problem the right way, using tidy

// Configuration
$config = array(
    'indent'         => true,
    'output-xhtml'   => true,
    'wrap'           => 200);

// Tidy to avoid errors during load html
$tidy = new tidy;
$tidy->parseString($bill->bill_text, $config, 'utf8');
$tidy->cleanRepair();

$domDocument = new DOMDocument();
$domDocument->loadHTML(mb_convert_encoding($tidy, 'HTML-ENTITIES', 'UTF-8'));
Jaleesa answered 1/9, 2019 at 20:33 Comment(2)
Welcome to StackOverflow. please explain how your code solves the problem.Carri
I believe that loadHTML method has trouble dealing with malformed HTML. Using tidy helped me solve this issue.Jaleesa
B
0

For laravel,

Use {{ }} instead of {!! !!}

I faced this and I managed to solved it.

Brant answered 22/7, 2020 at 10:17 Comment(0)
I
0

I found there was an error in my table tags. There was an extra </td> that I removed and bingo.

Intervention answered 20/9, 2020 at 2:21 Comment(0)
B
-8

just replace "&" with "and" in your string. do that for all the other symbols

Bobbette answered 6/2, 2014 at 8:46 Comment(1)
No, that's a terrible suggestion. The use of & is for a specific purpose, and simply replacing it with and doesn't conform in most cases. Company names are one obvious example.Chamorro

© 2022 - 2025 — McMap. All rights reserved.