How can I detect non-western characters?
Asked Answered
P

1

18

I want to disallow certain UTF-8 input (server-side), e.g. eastern languages, where example input might be " 伊 ".

However, I do want to continue supporting other latin or "latin-like" characters, such as the welsh ŵ and ŷ, so checking against latin-1 is not possible.

What are my options? (if language specific, PHP preferred)

Thanks very much.


Reasoning: browser support for a lot of non-western characters is often missing (e.g. on a different browser I just see a box in the question above), so for things like display names sometimes it's appropriate to restrict it even if it's not appropriate for message bodies

Pyrogallol answered 5/8, 2010 at 3:35 Comment(4)
Do you mind if I ask why you don't want to allow some languages on an internationalized site?Bibliomania
Fair question. It's just necessary for one field of a table; the rest of the website will support it.Pyrogallol
So what is the subset of characters you're allowing? Does it fit an existing character set? If so, you can just iconv the string to the target encoding, discarding all invalid characters.Deodar
browser support for a lot of non-western characters is often missing (e.g. on a different browser I just see a box in the question above), so for things like display names sometimes it's appropriate to restrict it even if it's not appropriate for message bodiesPyrogallol
I
40

Just do

preg_match('/[^\\p{Common}\\p{Latin}]/u', $string)

where $string is an UTF-8 string. This will return "1" if there are non-latin characters and will return "0" otherwise.

Example:

var_dump(preg_match('/[^\\p{Common}\\p{Latin}]/u', 'sf..ŷaás??'));  //int(0)
var_dump(preg_match('/[^\\p{Common}\\p{Latin}]/u', 'sf..ŷݤaás??')); //int(1)
It answered 5/8, 2010 at 3:42 Comment(1)
Is there a list of named subpatterns anywhere?Pyrogallol

© 2022 - 2024 — McMap. All rights reserved.