PHP: is the implode() function safe for multibyte strings?
Asked Answered
L

1

17

The explode() function has a correlating multibyte-safe function in mb_split().

I don't see a correlating function for implode(). Does this imply that implode is already safe for multibyte strings?

Ligulate answered 19/12, 2011 at 17:20 Comment(7)
I'm having a hard time understanding why there needs to be a multi-byte safe split() in the first place - splitting a string is multi-byte safe by default, no? But that's a different question.Pozsony
PHP stores all strings (AFAIK) as raw binary byte sequences, so in theory it should be possible to use explode() with multibyte strings as well, as long as you pass the correct binary representation of the split token. The same therefore applies to implode() - the binary sequence passed as the join delimiter will be used literally, so as long as your delimiter is correctly stored, there should be no problems.Gooseneck
@DaveRandom: isn't it possible that a multibyte character might look like two single-byte characters? If one of those single-byte characters happens to be the delimiter, isn't it possible that you might end up splitting on a multibyte character unintentionally?Ligulate
Why would your string contain multibyte and single byte characters? Wouldn't that be a corrupt string anyway?Gooseneck
Oh I see what you mean, where the boundary of two characters overlaps to create the sequence... Well in that case yes, I suppose it could - but that is getting into a depth at which I am not qualified to comment.Gooseneck
@daniel but in that case, you would have to be mixing two character sets, which is a circumstance that shouldn't happen? I can't quite get my head around it, but what you say probably points in the right direction. Maybe one needs to look beyond UTF-8 to understand this? I may ask a question about it laterPozsony
@Gooseneck Except that explode() will not return a string as an array if you try to split on the empty string, which makes explode limited.Camillecamilo
M
10

As long as your delimiter and the strings in the array contain only well-formed multibyte sequences there should not be any issues.

implode basically is a fancy concatenation operator and I couldn't imagine a scenario where concatenation is not multibyte safe ;)

Mantling answered 19/12, 2011 at 17:23 Comment(7)
I'm not completely sure what you mean by "well-formed multibyte sequence" in this context? (I agree with the rest, though)Pozsony
Thanks. I'm using a space as a delimiter: mb_split(' ', $mbstring). Does this constitute a well-formed multibyte sequence?Ligulate
@danielfaraday it depends if your script is stored in the multibyte charset that your string uses. If it isn't, then no it isn't.Gooseneck
@DaveRandom: could you expound? I'm not sure what you mean by storing the script in a charset.Ligulate
Well, if your script was stored (i.e. saved to disk by your editor, or whatever) in a single byte character set, then the ' ' would be a single byte space, which is probably not valid in the target charsetGooseneck
@Pozsony Well, that the code point sequence is valid. I don't know how one could express that correctly. E.g. the delimiter should not have the first two bits set (in UTF-8), because it would form a character together with the next code point.Mantling
@Pozsony Crap, I obviously mean code unit sequence, not code point sequence, sorry.Mantling

© 2022 - 2024 — McMap. All rights reserved.