How to replace different newline styles in PHP the smartest way?
Asked Answered
S

5

43

I have a text which might have different newline styles. I want to replace all newlines '\r\n', '\n','\r' with the same newline (in this case \r\n ).

What's the fastest way to do this? My current solution looks like this which is way sucky:

    $sNicetext = str_replace("\r\n",'%%%%somthing%%%%', $sNicetext);
    $sNicetext = str_replace(array("\r","\n"),array("\r\n","\r\n"), $sNicetext);
    $sNicetext = str_replace('%%%%somthing%%%%',"\r\n", $sNicetext);

Problem is that you can't do this with one replace because the \r\n will be duplicated to \r\n\r\n .

Thank you for your help!

Surinam answered 20/10, 2011 at 13:26 Comment(3)
you can try PHP_EOL, i believe that will change the newline based on the OSSisneros
Where did you get that if you replace '\r\n' with '\r\n' that you get '\r\n\r\n'?Nesbitt
@Nesbitt codepad.org/Qy6HpnLjTimmytimocracy
G
105
$string = preg_replace('~\R~u', "\r\n", $string);

If you don't want to replace all Unicode newlines but only CRLF style ones, use:

$string = preg_replace('~(*BSR_ANYCRLF)\R~', "\r\n", $string);

\R matches these newlines, u is a modifier to treat the input string as UTF-8.


From the PCRE docs:

What \R matches

By default, the sequence \R in a pattern matches any Unicode newline sequence, whatever has been selected as the line ending sequence. If you specify

     --enable-bsr-anycrlf

the default is changed so that \R matches only CR, LF, or CRLF. Whatever is selected when PCRE is built can be overridden when the library functions are called.

and

Newline sequences

Outside a character class, by default, the escape sequence \R matches any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the following:

    (?>\r\n|\n|\x0b|\f|\r|\x85)

This is an example of an "atomic group", details of which are given below. This particular group matches either the two-character sequence CR followed by LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next line, U+0085). The two-character sequence is treated as a single unit that cannot be split.

In UTF-8 mode, two additional characters whose codepoints are greater than 255 are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). Unicode character property support is not needed for these characters to be recognized.

It is possible to restrict \R to match only CR, LF, or CRLF (instead of the complete set of Unicode line endings) by setting the option PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. (BSR is an abbrevation for "backslash R".) This can be made the default when PCRE is built; if this is the case, the other behaviour can be requested via the PCRE_BSR_UNICODE option. It is also possible to specify these settings by starting a pattern string with one of the following sequences:

    (*BSR_ANYCRLF)   CR, LF, or CRLF only
    (*BSR_UNICODE)   any Unicode newline sequence

These override the default and the options given to pcre_compile() or pcre_compile2(), but they can be overridden by options given to pcre_exec() or pcre_dfa_exec(). Note that these special settings, which are not Perl-compatible, are recognized only at the very start of a pattern, and that they must be in upper case. If more than one of them is present, the last one is used. They can be combined with a change of newline convention; for example, a pattern can start with:

    (*ANY)(*BSR_ANYCRLF)

They can also be combined with the (*UTF8) or (*UCP) special sequences. Inside a character class, \R is treated as an unrecognized escape sequence, and so matches the letter "R" by default, but causes an error if PCRE_EXTRA is set.

Garate answered 20/10, 2011 at 13:31 Comment(1)
You can also use PHP_EOL to match the installations end of file. preg_replace('~\R~u', PHP_EOL, $string);Cranach
S
21

To normalize newlines I always use:

$str = preg_replace('~\r\n?~', "\n", $str);

It replaces the old Mac (\r) and the Windows (\r\n) newlines with the Unix equivalent (\n).

I preffer using \n because it only takes one byte instead of two, but you can easily change it to \r\n.

Stanwinn answered 2/11, 2011 at 2:21 Comment(0)
C
11

How about

$sNicetext = preg_replace('/\r\n|\r|\n/', "\r\n", $sNicetext);
Comintern answered 20/10, 2011 at 13:30 Comment(6)
Second string needs double quotes ;)Garate
No, it doesn't. ideone.com/xZLsx - Why do you think it should need double quotes?Comintern
Because escape sequences are parsed only in double quotes. It's okay for regex because that parses escape sequences extra, but not for all other strings. Thus the second argument needs double quotes, otherwise you will output plain \r\n, instead of a newlineGarate
False. Variables are only interpolated in double quotes. Escape sequences work everywhere.Comintern
@Tomalek codepad.viper-7.com/XJpmkP (And as you don't trust me, have a look at the documentation: php.net/manual/en/language.types.string.php)Garate
Wow. This... you just put that there! ;) Honestly, I'm totally surprised. I was so sure that variable interpolation was the only difference, I didn't even bother looking it up. Okay, thanks. Learned something. :-) -- and fixed my code sample.Comintern
M
5

i think the smartest/simplest way to convert to CRLF is:

$output = str_replace("\n", "\r\n", str_replace("\r", '', $input));

to convert to LF only:

$output = str_replace("\r", '', $input);

it's much more easier than regular expressions.

Mayapple answered 22/2, 2017 at 7:49 Comment(0)
A
1
$sNicetext = str_replace(["\r\n", "\r"], "\n", $sNicetext);

also works

Amorist answered 12/1, 2021 at 3:14 Comment(1)
Please add some explanation to your answer such that others can learn from itTestament

© 2022 - 2024 — McMap. All rights reserved.