The Grammar
Let's cover the grammar very quickly: The Devanagari Block. As a developer, there are two character classes you'll want to concern yourself with:
- Sign: This is a character that affects a previously-occurring character. Example, this character:
्
. The light-colored circle indicates the location of the center of the character it is to be placed upon.
- Letter / Vowel / Other: This is a character that may be affected by signs. Example, this character:
क
.
Combination result of ्
and क
: क्
. But combinations can extend, so क्
and षति
will actually become क्षति
(in this case, we right-rotate the first character by 90 degrees, modify some of the stylish elements, and attach it at the left side of the second character).
My answer here is not to solve the situation of these infinite (and tremendously beautiful) combinations, but simply clusters of singular letters and/or clusters of singular letters with their affecting, sign characters. If we are thinking "what are the characters of this Devanagari string?", then this is the right way to go, otherwise any combination of letters would form a unique character of a unique length, and then most of the concepts and algorithms associated with letter-systems would fail.
So, for instance, a symbol word would be...
(letter) (letter) (sign) (sign) (letter) (sign)
In this case, you'll want the result...
[
0=>(letter),
1=>(letter) (sign) (sign),
2=>(letter) (sign),
]
The Code
The logic then isn't too bad, just make a foreach loop that goes in reverse.
I understand this is JavaScript code below, but the same principles will apply. Set the sign
-types...
function getEndWordGroupings() {return {'2304':true,'2305':true,'2306':true,'2307':true,'2362':true,'2363':true,'2364':true,'2365':true,'2366':true,'2367':true,'2368':true,'2369':true,'2370':true,'2371':true,'2372':true,'2373':true,'2374':true,'2375':true,'2376':true,'2377':true,'2378':true,'2379':true,'2380':true,'2381':true,'2382':true,'2383':true,'2385':true,'2386':true,'2389':true,'2390':true,'2391':true,'2402':true,'2403':true,'2416':true,'2417':true,};}
And convert string to chars...
function stringToChars(args) {
var word = args.word;
var chars = [];
var endings = getEndWordGroupings();
var incluster = false;
var cluster = '';
var whitespace = new RegExp("\\s+");
for(var i = word.length - 1; i >= 0; i--) {
var character = word.charAt(i);
var charactercode = word.charCodeAt(i);
if(incluster) {
if(whitespace.test(character)) {
incluster = false;
chars.push(cluster);
cluster = '';
} else if(endings[charactercode]) {
chars.push(cluster);
cluster = character;
} else {
incluster = false;
cluster = character + cluster;
chars.push(cluster);
cluster = '';
}
} else if(endings[charactercode]) {
incluster = true;
cluster = character;
} else if(whitespace.test(character)) {
incluster = false;
chars.push(cluster);
cluster = '';
} else {
chars.push(character);
}
}
if(cluster.length > 0) {
chars.push(cluster);
}
return chars.reverse();
}
console.log(stringToChars({'word':'क्षऀति'}));</script>
The Results
Output:
["क्", "षऀ", "ति"]
If I had used plain parsing, the output would have been
["क", "्", "ष", "त", "ि"]
Hint: See the two signs up above with a light circle in them? That light circle indicates the location of the character that the sign affects. Looking back at the converted translation, it's very easy to see how the letters were combined into new characters. Neat!