How can I parse an Arabic Umm Al-Qura date string into a .NET DateTime object?
Asked Answered
B

2

6

I have the following Arabic date in the Umm Al-Qura calendar that I want to parse into a .NET DateTime object:

الأربعاء‏، 17‏ ذو الحجة‏، 1436

This date is equivalent to September 30th 2015 in the Gregorian calendar.

I've been trying the following "standard" C# code to parse this date, but without success:

var cultureInfo = new CultureInfo("ar-SA");
cultureInfo.DateTimeFormat.Calendar = new UmAlQuraCalendar(); // the default one anyway

var dateFormat = "dddd، dd MMMM، yyyy"; //note the ، instead of ,

var dateString = "‏الأربعاء‏، 17‏ ذو الحجة‏، 1436";
DateTime date;
DateTime.TryParseExact(dateString, dateFormat, cultureInfo.DateTimeFormat, DateTimeStyles.AllowWhiteSpaces, out date);

No matter what I do, the result of TryParseExact is always false. How do I parse this string properly in .NET?

By the way, if I start from a DateTime object, I can create the exact date string above using ToString()'s overloads on DateTime without problems. I just can't do it the other way around apparently.

Buxtehude answered 30/9, 2015 at 8:36 Comment(4)
I think you can safely replace the first two lines by CultureInfo.GetCultureInfoByIetfLanguageTag("ar-SA"); but it doesn't seem to fix the issue.Levileviable
@Gimly: That's right, nothing changes if I use your proposed method instead of the constructor.Buxtehude
Can you post the ToString() results? Can you parse the results of the ToString() back to a DateTime()? The two string can't be identical if you can parse the ToString() results. I would like to compare character by character of the two strings to make sure they are identical.Peptide
@jdweng: Yes, interesting enough, I can parse back the ToString() result into a DateTime. But the thing is, i see the date strings as identical in both cases in Visual Studio. I'm pretty sure by now that it has something to do with right-to-left text directionality...Buxtehude
B
3

Your datestring is 30 characters long and contains four UNICODE 8207 U+200F RIGHT TO LEFT MARK characters, but your dateformat does not.

// This gives a string 26 characters long
var str = new DateTime(2015,9,30).ToString(dateFormat, cultureInfo.DateTimeFormat)

RIGHT TO LEFT MARK is not whitespace.

If it only contains RLM/LRM/ALM you should probably just strip them out. Same with the isolates LRI/RLI/FSI and PDI sets, and LRE/RLE sets. You may not want to do that with LRO though. LRO is often used with legacy data where the RTL characters are stored in the opposite order, i.e. in the left-to-right order. In these cases you may want to actually reverse the characters.

Parsing dates from random places is a hard problem. You need a layered solution, try first one method, then another in priority order until you succeed. There is no 100% solution though, because people can type what they like.

See here for more information: http://www.unicode.org/reports/tr9/

Bakery answered 30/9, 2015 at 9:11 Comment(7)
So you are saying that i have to explicitly include the RTL mark characters into the date format, just like i do with the ، character?Buxtehude
Depends. Your string is hard-coded - Will the RLM always be there or only sometimes? Where is the data from? Will other Unicode directional marks appear? How will you handle LRO?Bakery
Both the date string and its format are coming from the underlying XML of an MS Word (docx) file, so I don't have much control over them. I was hoping that using the date, calendar and format I would be able to parse them into a DateTime object in a straightforward manner.Buxtehude
Parsing dates from random places is a hard problem. You need a layered solution, try first one method, then another in priority order until you succeed. There is no 100% solution though, because people can type what they like.Bakery
You were right, stripping out those special characters did the trick! Now the string can be parsed into a DateTime object. As for the data source, i wouldn't worry too much since the date format is usually chosen by the user from a predefined set given by MS Word in my situation (a date picker content control, to be specific). You might want to update your answer to include your comment.Buxtehude
You know the user can change their default date format, right?Bakery
I know, but as long as MS Word saves it correctly, it shouldn't be a problem.Buxtehude
I
2

This is a Right-To-Left culture, which means that the year will be rendered first. For example, the following code:

var cultureInfo = new CultureInfo("ar-SA");
cultureInfo.DateTimeFormat.Calendar = new UmAlQuraCalendar(); 
Console.WriteLine(String.Format(cultureInfo,"{0:dddd، dd MMMM، yyyy}",DateTime.Now));

produces الأربعاء، 17 ذو الحجة، 1436. Parsing this string works without problem:

var dateString="الأربعاء، 17 ذو الحجة، 1436";
var result=DateTime.TryParseExact(dateString, dateFormat, cultureInfo.DateTimeFormat,
                                  DateTimeStyles.AllowWhiteSpaces,out date);
Debug.Assert(result);

PS: I don't know how to write the format string to parse the original input, as changing the position of what looks like a comma to me, changes the actual characters rendered in the string.

Immunogenic answered 30/9, 2015 at 8:51 Comment(2)
I'm not sure I understood from your code how you realized that parsing that string works without problems. If i'm not mistaken, we're talking about exactly the same string as in my original post, "الأربعاء، 17 ذو الحجة، 1436".Buxtehude
I didn't realize, I run it. What I just realized though, is that copy/pasting your string to LinqPad reversed it to var dateString = "‏الأربعاء‏، 17‏ ذو الحجة‏، 1436";. Windows detects whether a Unicode string is RTL or not, and changes eg what the cursor arrows do, how text is pasted etc. As @Bakery answered though, the two strings are probably not the same.Immunogenic

© 2022 - 2024 — McMap. All rights reserved.