Reading dates from OpenXml Excel files
Asked Answered
F

4

11

I'm trying to read data from the .xlsx files using SharpZipLib to unpack it (in memory) and reading the inner xml files. Everything is fine but recognizing the dates - they're stored in julean format and I need to somehow recognize if a number is a date or only a number. In another topic (unfortunately it died and I need quick answer) I got to know some things from Mark Baker, but it's still not enough...

"Excel stores dates as a float value... the integer part being the number of days since 1/1/1900 (or 1/1/1904 depending on which calendar is being used), the fractional part being the proportion of a day (ie the time part)... made slightly more awkward by the fact that 1900 is considered a leap year.

The only thing that differentiates a data from a number is the number format mask. If you can read the format mask, you can use that to identify the value as a date rather than a number... then calculate the date value/formatting from the base date."

"But doesn't the attribute "s" for dates has always the value of "1"? I know it defines style, but maybe? ;)"

The s attribute references a style xf entry in styles.xml, and it won't always be entry 1 for dates... it all depends how many different styles are being used in the workbook. The style xf in turn references a number format mask. To identify a cell that contains a date, you need to perform the style xf -> numberformat lookup, then identify whether that numberformat mask is a date/time numberformat mask (rather than, for example, a percentage or an accounting numberformat mask)

"One more question - I'm now looking at the style.xml's content and in the section I see elements like: "<xf numFmtId="14" ... applyNumberFormat="1" />", "<xf numFmtId="1" ... applyNumberFormat="1" />", etc. but there is no <numFmts> section... Are there any "standard" formats? Or am I just missing something?"

Can someone please help me out? Thanks in advance.

Frenum answered 11/1, 2011 at 8:24 Comment(2)
Here a list of date format Ids msdn.microsoft.com/en-us/library/…Ricciardi
possible duplicate of What indicates an Office Open XML Cell contains a Date/Time value?Cutshall
P
18

You should find the numFmts section somewhere near the top of style.xml, as part of the styleSheet element

<?xml version="1.0" encoding="UTF-8" standalone="yes" ?> 
    <styleSheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
        <numFmts count="3">
            <numFmt numFmtId="164" formatCode="[$-414]mmmm\ yyyy;@" /> 
            <numFmt numFmtId="165" formatCode="0.000" /> 
            <numFmt numFmtId="166" formatCode="#,##0.000" /> 
        </numFmts>

EDIT

I've been double-checking my xlsx reader code (it's been a long while since I delved into that part of the library); and there are built-in formats. Number format codes (numFmtId) less than 164 are "built-in".

The list that I have is incomplete:

0 = 'General';
1 = '0';
2 = '0.00';
3 = '#,##0';
4 = '#,##0.00';
5 = '$#,##0;\-$#,##0';
6 = '$#,##0;[Red]\-$#,##0';
7 = '$#,##0.00;\-$#,##0.00';
8 = '$#,##0.00;[Red]\-$#,##0.00';
9 = '0%';
10 = '0.00%';
11 = '0.00E+00';
12 = '# ?/?';
13 = '# ??/??';
14 = 'mm-dd-yy';
15 = 'd-mmm-yy';
16 = 'd-mmm';
17 = 'mmm-yy';
18 = 'h:mm AM/PM';
19 = 'h:mm:ss AM/PM';
20 = 'h:mm';
21 = 'h:mm:ss';
22 = 'm/d/yy h:mm';

37 = '#,##0 ;(#,##0)';
38 = '#,##0 ;[Red](#,##0)';
39 = '#,##0.00;(#,##0.00)';
40 = '#,##0.00;[Red](#,##0.00)';

44 = '_("$"* #,##0.00_);_("$"* \(#,##0.00\);_("$"* "-"??_);_(@_)';
45 = 'mm:ss';
46 = '[h]:mm:ss';
47 = 'mmss.0';
48 = '##0.0E+0';
49 = '@';

27 = '[$-404]e/m/d';
30 = 'm/d/yy';
36 = '[$-404]e/m/d';
50 = '[$-404]e/m/d';
57 = '[$-404]e/m/d';

59 = 't0';
60 = 't0.00';
61 = 't#,##0';
62 = 't#,##0.00';
67 = 't0%';
68 = 't0.00%';
69 = 't# ?/?';
70 = 't# ??/??';
Poster answered 11/1, 2011 at 8:45 Comment(17)
That's where the numFmts should be held. Was the workbook generated using Excel itself? If you open the file in question in MS Excel, does it recognise the cell values as dates?Poster
Yes and the cells are formatted as dates (if I put a number like 40180 in there it also converts it to date).Frenum
Sorry about the slow, hesitant response... it's been a long time since I looked at that bit of code, and I'd simply forgotten about the set of "built-ins". Your question has now prompted me to track down the others that I don't have in the list.Poster
Within the rest of my code, when I test to see if a format mask is a date or time format, I check for the presence of any of the ymdHis characters in the format mask; but watch out for any of these characters in a colour name (e.g. '#,##0 ;[Cyan](#,##0)') or any format masks that begin with '_' or '0 '.Poster
About "the others" - I think all of them are listed in the "Part 4" file here: ecma-international.org/news/TC45_current_work/… (pages 2134+)Frenum
Thanks, yes! I've just been reading through the relevant section of Part 4 of the ECMA spec, and they're predominantly date format variants for far eastern languages (Chines, Korean, Japanese, etc.)Poster
I'm glad I could also help you somehow. If I knew about the predefined formats, I would probably find this file earlier ;)Frenum
<numFmt numFmtId="7" formatCode=""£"#,##0.00;\-"£"#,##0.00" /><numFmt numFmtId="8" formatCode=""£"#,##0.00;[Red]\-"£"#,##0.00" />Caprification
They're currency formats. I just wondered whether, if I were to format a cell using Excel, the generated styles would tell me which built-in formats translated into which numFmt values. As it happened, it wasn't that simple because some of the "standard" Excel formatting options are not built-in, so Excel generates 164+ ids for them. I noticed 7 & 8 and was going to post that these were the built in ids for currencies using those formats, but my <cr> to get a new line created the post... and I couldn't edit it.Caprification
Thanks for clarifying that @Ross. If these are built-in, I suspect they're perhaps slightly more dynamic (perhaps selecting a currency based on country code) but I'll check up on themPoster
Well... I'm in the UK, so I would expect that the "£" is the currency symbol of the current enviroment/culture. Depending upon where you are, I would expect perhaps a "$" symbol.Caprification
Just out of interest really, but in the above list, (ie. numFmt 57), to what does the '[$-404]' relate ? Similarly, what does the 'e' in 'e/m/d' relate to ? It seems to be year, but why 'e' ? Epoch ?Caprification
The [$-404] is a locale setting... 404 is the locale ID for "Chinese - Taiwan" (msdn.microsoft.com/en-us/goglobal/bb964664.aspx for the list of locale IDs)Poster
I'm not certain about e, but using it in a date mask instead of yyyy gives me a value of 5280, which would be a base year of 3269BC. Not sure if this is the supposed biblical creation of the earth, or start of the chinese/mayan/whatever calendarPoster
You have a typo on #14, the one the questioner asked about.Cutshall
You have a typo on #14 (the one the questioner asked about). It should read 'm/d/yyyy' instead (without the dashes).Cutshall
Microsoft link for codes: learn.microsoft.com/en-us/previous-versions/office/developer/…Bertberta
C
10

Cells may have styles. These are uints that index cellXfs in the styleSheet. Each cellXfs item contains a set of attributes. The most important is NumberFormatID. If its value falls in the range 14-22 it is a "standard" date. If it falls in the range 165 - 180, it is a "formatted" date and will have a corresponding NumberingFormat attribute.

Standard Date

[x:c r="A2" s="2"][x:v]38046[/x:v][/x:c]

[x:xf numFmtId="14" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1" /] (ordinal position 2)

Formatted Date

[x:c r="A4" s="4"][x:v]38048[/x:v][/x:c]

[x:xf numFmtId="166" fontId="0" fillId="0" borderId="0" xfId="0" applyNumberFormat="1" /](ordinal position 4)

[x:numFmt numFmtId="166" formatCode="m/d;@" /]

This code extracts a list of style IDs that correspond to these date formats.

  private void GetDateStyles()
  {
     //
     // The only way to tell dates from numbers is by looking at the style index. 
     // This indexes cellXfs, which contains NumberFormatIds, which index NumberingFormats.
     // This method creates a list of the style indexes that pertain to dates.
     WorkbookStylesPart workbookStylesPart = (WorkbookStylesPart) UriPartDictionary["/xl/styles.xml"];
     Stylesheet styleSheet = workbookStylesPart.Stylesheet;
     CellFormats  cellFormats = styleSheet.CellFormats;

     int i = 0;
     foreach (CellFormat cellFormat in cellFormats)
     {
        uint numberFormatId = cellFormat.NumberFormatId;
        if ((numberFormatId >= 14 && numberFormatId <= 22) 
        || (numberFormatId >= 165u && numberFormatId <= 180u))
        {
           _DateStyles.Add(i.ToString());
        }
        i++;
     }
Caryl answered 9/3, 2011 at 19:22 Comment(3)
How to get UriPartDictionary?Real
Ok. I already know. _document = SpreadsheetDocument.Open(filePath, true); UriPartDictionary = BuildUriPartDictionary(_document); https://mcmap.net/q/589572/-excel-quot-refresh-all-quot-with-openxmlReal
link:MSDN not documented properly. But POI claims the same in link:Issue. No official confirmation in ECMA yet. As I know.Ricciardi
S
5

I would suggest that numFmtId="14" should be considered to be "Windows Short Date format" as in Australia this format will display a date as, "dd/mm/yy", and not "mm/dd/yy".

Sepoy answered 24/10, 2011 at 3:29 Comment(1)
Yep. I agree with you on this. Seems there are couple of versions of this list all over the Internet, and none of them take into account localization on type 14. I'm guessing others are wrong as well.Busby
C
2

There are two ways to get the date format for a cell.

You start by grabbing the "s" or StyleIndex. Note the date in numeric raw format below (40667):

<row r="1">
  <c r="A1" s="1">
    <v>40667</v>
  </c>
</row>

The "s" attribute in the cells nodes points to a zero-based array of styles.xml nodes starting at 0. This is the key to locating the date format, if any, that maps to the raw numeric date data. You see s=1, that points to the 2nd xf node in the following cell formatting styles.xml section of your Excel workbook:

   <cellXfs count="2">
     <xf numFmtId="0" ... />
     <xf numFmtId="14" ... />
   </cellXfs>

In the second node you see the numFmtId="14" value. That is the numberFormatID. It tells you that that is the id needed to determine what your date number should be presented in. But that number points to two possible places for the date format. If its number is in the range 14-22 its a built in style for date. If its outside that range its (possibly) a custom date format added by the excel file owner. You wont know until you check both places.

In the first case, if its a value 14-22, you will need to map it to one of the pre-built date formats every excel file has (mm-dd-yy, etc.). You can locate that table in the OpenXML SDK. Here is a sample of those with the numFmtId mapped to the built-in date formats....

14  mm-dd-yy
15  d-mmm-yy
16  d-mmm
17  mmm-yy
18  h:mm AM/PM

At this point you know its a date and what format its to be presented in. If its not one of those values, its likely a custom number. And you now have to search the styles.xml file again for a style node with a matching numFmtId value. Those nodes will contain the custom date format as follows:

    <numFmts count="2">
        <numFmt numFmtId="164" formatCode="mm/yyyy;@" /> 
        <numFmt numFmtId="165" formatCode="0.000" /> 
        <numFmt numFmtId="166" formatCode="#,##0.000" /> 
    </numFmts>

Note that if your numFmtId was 164, you found its custom date format. So to catch all these crazy date formats, custom and built in, your best bet is to maintain a range of acceptable "formats" as strings, locate your formatCode, then see if it matches one of the acceptable ones in your code.

Good Luck!

Curvaceous answered 7/9, 2017 at 4:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.