What are the unicode ranges for Hindi accented characters?
Asked Answered
C

3

4

I'm trying to gather a Unicode list of all the 'o' like shapes in the Hindi character-set. In fact, a list of any characters (in any language) that makes uses of separate characters to indicate an accent would be better.

I intend to use this unicode-list in a RegExp.

I been trying to edit a list of character-ranges by outputting them in an Input TextField, but editing this text causes weird issues (the keyboard-cursor isn't place on the correct character, selections suddenly dissappear / incorrectly warps... in other words... HINDI HELL!)

I've tried this with Notepad++ too, but although it was more responsive, it eventually crapped out on me like it did in the Flash Player textfield. This seems to occur especially while removing the [] block (nulls?) characters. Some of them trigger odd behaviors.

Anyways, all I want is a list of the accents. An example of a few are in the image below (but I would need ALL accents):

enter image description here

Thanks!

Crabwise answered 1/3, 2012 at 20:47 Comment(4)
In a language with proper Unicode regex, this would be [\p{IsDevanagari}&&\p{M}]... unfortunately I think only Java (and maybe Perl) support this.Ethiop
@Porges PCRE are used in php. So if Perl is correct, then php is too. also: regular-expressions.infoStucco
@kirilloid: PCRE doesn't support character class intersection, and it doesn't support everything Perl does either. (You can emulate intersection with lookahead anyway.) But... this doesn't matter since he's using AS. :)Ethiop
This is useful information though. Something like that would be useful in AS, would just be a matter of gathering these characters in an XML file and distribute it to the world :)Crabwise
D
6

You can find pdf's containing lists of unicode ranges, grouped by language, here: http://unicode.org/charts/

For Hindi, you probably want Devanagari or Devanagari Extended.

Dagnydago answered 1/3, 2012 at 20:55 Comment(0)
E
3

Here is the character class for Devanagari combining marks:

[\u901\u902\u903\u93c\u93e\u93f\u940\u941\u942\u943
 \u944\u945\u946\u947\u948\u949\u94a\u94b\u94c\u94d
 \u951\u952\u953\u954\u962\u963]

This is only the basic Devanagari block (not Devanagari Extended).

Ethiop answered 1/3, 2012 at 21:48 Comment(1)
Slightly more compact: [\u901-\u903\u93c\u93e-\u949\u94a-\u94d\u951-\u954\u962\u963]Ashley
N
0

If you want the complete set (for all languages), you can do it problematically. You start from the Unicode date file at ftp://ftp.unicode.org/Public/6.1.0/ucd/UnicodeData.txt, described by TR-44 (http://unicode.org/reports/tr44/#Property_Definitions)

You can use the Canonical_Combining_Class field (see at http://unicode.org/reports/tr44/#Canonical_Combining_Class_Values) to filter the exact characters you want. Can't be more precise, because "accent" a bit vague :-) You might even have to also look at General_Category to get the filter right (and exclude certain marks, or symbols, or punctuation).

And a script doing this would definitely be better than trying to mess with text editors. One of the characteristics of combining characters is that they combine :-) So you might get all kind of puzzling results (like this: http://www.siao2.com/2006/02/17/533929.aspx :-)

Novick answered 4/3, 2012 at 11:45 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.