How to retrieve the unicode decimal representation of the chars in a string containing hindi text?
Asked Answered
G

5

2

I am using visual studio 2010 in c# for converting text into unicodes. Like i have a string abc= "मेरा" . there are 4 characters in this string. i need all the four unicode characters. Please help me.

Galina answered 5/5, 2011 at 19:30 Comment(1)
See unicodelookup.com/#मेरा/1Revis
M
3

When you write a code like string abc= "मेरा";, you already have it as Unicode (specifically, UTF-16), so you don't have to convert anything. If you want to access the singular characters, you can do that using normal index: e.g. abc[1] is (DEVANAGARI VOWEL SIGN E).

If you want to see the numeric representations of those characters, just cast them to integers. For example

abc.Select(c => (int)c)

gives the sequence of numbers 2350, 2375, 2352, 2366. If you want to see the hexadecimal representation of those numbers, use ToString():

abc.Select(c => ((int)c).ToString("x4"))

returns the sequence of strings "092e", "0947", "0930", "093e".

Note that when I said numeric representations, I actually meant their encoding using UTF-16. For characters in the Basic Multilingual Plane, this is the same as their Unicode code point. The vast majority of used characters lie in BMP, including those 4 Hindi characters presented here.

If you wanted to handle characters in other planes too, you could use code like the following.

byte[] bytes = Encoding.UTF32.GetBytes(abc);

int codePointCount = bytes.Length / 4;

int[] codePoints = new int[codePointCount];

for (int i = 0; i < codePointCount; i++)
    codePoints[i] = BitConverter.ToInt32(bytes, i * 4);

Since UTF-32 encodes all (21-bit) code points directly, this will give you them. (Maybe there is a more straightforward solution, but I haven't found one.)

Moving answered 5/5, 2011 at 19:56 Comment(6)
this is what i was looking for. please tell me how the abc.Select(c => (int)c) can be used to get the 4 values in a variable.Galina
@Deepak, what do you mean? The result of that is a sequence with those 4 values. If you want to put them in a variable, just do var chars = abc.Select(c => (int)c); like with any other code.Moving
You can then for example use foreach and Console.WriteLine() to write them out to console.Moving
abc.Select() gives an error, 'string' does not contain a definition for 'Select'... How are you able to run this code?Zane
@Zane You're probably missing using System.Linq; at the top of your C# file.Moving
@svick, It seems that using System.Linq requires .NET FW 3.5 or later. My target framework is set to 2.0Zane
P
3

Since a .Net char is a Unicode character (at least, for the BMP code point), you can simply enumerate all characters in a string:

var abc = "मेरा";

foreach (var c in abc)
{
    Console.WriteLine((int)c);
}

resulting in

2350
2375
2352
2366
Payton answered 5/5, 2011 at 19:57 Comment(0)
C
1

use

System.Text.Encoding.UTF8.GetBytes(abc)

that will return your unicode values.

Crist answered 5/5, 2011 at 19:34 Comment(3)
thanks but can u give me the full code so that i can store it in a hexadecimal no.Galina
You're wrong. This will not return “Unicode values”, by which I assume you mean Unicode code points. This will return they bytes which represent given string in UTF-8.Moving
please anyone help me to get unicode values as for म the value is 2350 in decimal.Galina
T
1

If you are trying to convert files from a legacy encoding into Unicode:

Read the file, supplying the correct encoding of the source files, then write the file using the desired Unicode encoding scheme.

    using (StreamReader reader = new StreamReader(@"C:\MyFile.txt", Encoding.GetEncoding("ISCII")))
    using (StreamWriter writer = new StreamWriter(@"C:\MyConvertedFile.txt", false, Encoding.UTF8))
    {
        writer.Write(reader.ReadToEnd());
    }

If you are looking for a mapping of Devanagari characters to the Unicode code points:

You can find the chart at the Unicode Consortium website here.

Note that Unicode code points are traditionally written in hexidecimal. So rather than the decimal number 2350, the code point would be written as U+092E, and it appears as 092E on the code chart.

Tman answered 5/5, 2011 at 19:46 Comment(0)
F
1

If you have the string s = मेरा then you already have the answer.

This string contains four code points in the BMP which in UTF-16 are represented by 8 bytes. You can access them by index with s[i], with a foreach loop etc.

If you want the underlying 8 bytes you can access them as so:

string str = @"मेरा";
byte[] arr = System.Text.UnicodeEncoding.GetBytes(str);
Fabian answered 5/5, 2011 at 19:57 Comment(4)
The string contains 4 code points and it's represented as 4 chars or 8 bytes. Your code (when fixed) returns an array of 8 bytes.Moving
@Moving I can only see 2 code points. Can you explain where 4 comes from?Fabian
@Moving It seems my Hindi is not very good!! I've corrected my answer.Fabian
yes, when rendered, you can only see two glyphs. That's because the string contains two non-spacing marks: U+0947 and U+093E.Moving

© 2022 - 2024 — McMap. All rights reserved.