How to replace all XHTML/HTML line breaks (<br>) with new lines?
Asked Answered
S

5

43

I am looking for the best br2nl function. I would like to replace all instances of <br> and <br /> with newlines \n. Much like the nl2br() function but the opposite.

I know there are several solutions in the PHP manual comments but I'm looking for feedback from the SO community on possible solutions.

Sarmatia answered 12/3, 2010 at 21:51 Comment(2)
Are you sure you want to replace the HTML/XHTML line break elements with physical line breaks? Because nl2br does not replace the physical line breakts but just adds HTML/XHTML line break elements.Harcourt
I'm not using this function to negate or recover a string that was returned from nl2br. I am using it to sanitize text in a legacy database (from a webapp that allowed html) before I import it into my database. I just said the opposite of nl2br because people generally know that function.Sarmatia
C
104

I would generally say "don't use regex to work with HTML", but, on this one, I would probably go with a regex, considering that <br> tags generally look like either :

  • <br>
  • or <br/>, with any number of spaces before the /


I suppose something like this would do the trick :

$html = 'this <br>is<br/>some<br />text <br    />!';
$nl = preg_replace('#<br\s*/?>#i', "\n", $html);
echo $nl;

Couple of notes :

  • starts with <br
  • followed by any number of white characters : \s*
  • optionnaly, a / : /?
  • and, finally, a >
  • and this using a case-insensitive match (#i), as <BR> would be valid in HTML
Cuticula answered 12/3, 2010 at 21:57 Comment(5)
To be very nit-picky =] : <input type="text" value="<br />"> is allowed in html (not xhtml). And in a CDATA section <br /> is "normal" text.Foremast
@Foremast : humph, true :-) ;; I was writting this using DOM, and when I finished, I saw you posted the same kind of solution I would have proposed (excepts I used getElementsByName, and not XPath), so didn't post it -- maybe I should edit my answer, though, for the sake of completness, as it's been accepted...Cuticula
But this solution is faster and less memory consuming (if this is a matter). If you don't have completely arbitrary documents I'd probably consider these edge-cases acceptable.Foremast
Shouldn't the second argument be "\\n"? this is the only thing that works on my setup here.Customer
My HTML looks like <br style="color: rgb(34, 34, 34); font-family: &quot;Open Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, sans-serif; font-size: 15px;">Sorkin
C
8

You should be using PHP_EOL constant to have platform independent newlines.

In my opinion, using non-regexp functions whenever possible makes the code more readable.

$newlineTags = array(
  '<br>',
  '<br/>',
  '<br />',
);
$html = str_replace($newlineTags, PHP_EOL, $html);

I am aware this solution has some flaws, but wanted to share my insights still.

Carbylamine answered 25/8, 2014 at 12:16 Comment(2)
And regular expressions require usually heavier computations.Ekaterinodar
@BenBITDesign Regarding your suggested edit, please note that it is absolutely not true that regex in general requires more computation. In fact, without having timed this specific case it’s quite likely that the PCRE engine can perform this replacement more efficiently than str_replace, especially when just-in-time compilation is enabled.Ingemar
F
2

If the document is well-formed (or at least well-formed-ish) you can use the DOM extension and xpath to find and replace all br elements by a \n text node.

$in = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html><head><title>...</title></head><body>abc<br />def<p>ghi<br />jkl</p></body></html>';

$doc = new DOMDOcument;
$doc->loadhtml($in);
$xpath = new DOMXPath($doc);

$toBeReplaced = array();
foreach($xpath->query('//br') as $node) {
    $toBeReplaced[] = $node;
}

$linebreak = $doc->createTextNode("\n");
foreach($toBeReplaced as $node) {
    $node->parentNode->replaceChild($linebreak->cloneNode(), $node);
}

echo $doc->savehtml();

prints

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head><title>...</title></head>
<body>abc
def<p>ghi
jkl</p>
</body>
</html>

edit: shorter version with only one iteration

$in = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html><head><title>...</title></head><body>abc<br />def<p>ghi<br />jkl</p></body></html>';

$doc = new DOMDOcument;
$doc->loadhtml($in);
$xpath = new DOMXPath($doc);

$linebreak = $doc->createTextNode("\n");
foreach($xpath->query('//br') as $node) {
  $node->parentNode->removeChild($node);
}

echo $doc->savehtml();
Foremast answered 12/3, 2010 at 22:13 Comment(4)
You don’t need to do two rounds. You can replace the nodes with the first foreach.Harcourt
That seems to be so ;-) For some (unknown) reason I remembered it to break the xpath iterator.Foremast
Shorter version doesn't add the $linebreak node. Anyway this is exactly what I needed, thanks.Solfa
The same with the linebreak replacement and without xpath: 3v4l.org/UiJ1m#v8.2.7Clavicembalo
O
1

From the nl2br comments:

<?php
function br2nl($string){
  $return=eregi_replace('<br[[:space:]]*/?'.
    '[[:space:]]*>',chr(13).chr(10),$string);
  return $return;
}
?> 
Ordinand answered 12/3, 2010 at 22:15 Comment(1)
the posix regular expression module has been deprecated. From the ereg_replace manual page: "This function has been DEPRECATED as of PHP 5.3.0 and REMOVED as of PHP 6.0.0. Relying on this feature is highly discouraged."Foremast
R
0

Thanks to @antti for accepted answer, @konstantin-xflash-stratigenas who pointed to a defect
<br style="color:#FFA;" />
i try write a better regex to cover them too:

$html = 'this <br>is<br/>some<br />text <br    />, <br//>, <br xyz/>!';

$html_nl = preg_replace('/<br[^>]*>/i', "\n", $html);

echo htmlspecialchars($html_nl);

although this not cover the @volkerk pointed defect yet

Roundish answered 5/3 at 11:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.