Having trouble reading excel file with the OpenXML sdk
Asked Answered
T

3

4

I have a function that reads from an excel file and stores the results in a DataSet. I have another function that writes to an excel file. When I try to read from a regular human-generated excel file, the excel reading function returns a blank DataSet, but when I read from the excel file generated by the writing function, it works perfectly fine. The function then will not work on a regular generated excel file, even when I just copy and paste the contents of the function generated excel file. I finally tracked it down to this, but I have no idea where to go from here. Is there something wrong with my code?

Here is the excel generating function:

public static Boolean writeToExcel(string fileName, DataSet data)
{
    Boolean answer = false;
    using (SpreadsheetDocument excelDoc = SpreadsheetDocument.Create(tempPath + fileName, SpreadsheetDocumentType.Workbook))
    {
        WorkbookPart workbookPart = excelDoc.AddWorkbookPart();
        workbookPart.Workbook = new Workbook();
        WorksheetPart worksheetPart = workbookPart.AddNewPart<WorksheetPart>();
        Sheets sheets = excelDoc.WorkbookPart.Workbook.AppendChild<Sheets>(new Sheets());
        Sheet sheet = new Sheet()
        {
            Id = excelDoc.WorkbookPart.GetIdOfPart(worksheetPart),
            SheetId = 1,
            Name = "Page1"
        };

        sheets.Append(sheet);

        CreateWorkSheet(worksheetPart, data);
        answer = true;
    }

    return answer;
}

private static void CreateWorkSheet(WorksheetPart worksheetPart, DataSet data)
{
    Worksheet worksheet = new Worksheet();
    SheetData sheetData = new SheetData();
    UInt32Value currRowIndex = 1U;
    int colIndex = 0;
    Row excelRow;
    DataTable table = data.Tables[0];

    for (int rowIndex = -1; rowIndex < table.Rows.Count; rowIndex++)
    {
        excelRow = new Row();
        excelRow.RowIndex = currRowIndex++;
        for (colIndex = 0; colIndex < table.Columns.Count; colIndex++)
        {
            Cell cell = new Cell()
            {
                CellReference = Convert.ToString(Convert.ToChar(65 + colIndex)),
                DataType = CellValues.String
            };

            CellValue cellValue = new CellValue();

            if (rowIndex == -1)
            {
                cellValue.Text = table.Columns[colIndex].ColumnName.ToString();
            }
            else
            {
                cellValue.Text = (table.Rows[rowIndex].ItemArray[colIndex].ToString() != "") ? table.Rows[rowIndex].ItemArray[colIndex].ToString() : "*";
            }

            cell.Append(cellValue);
            excelRow.Append(cell);
        }

        sheetData.Append(excelRow);
    }

    SheetFormatProperties formattingProps = new SheetFormatProperties()
    {
        DefaultColumnWidth = 20D,
        DefaultRowHeight = 20D
    };

    worksheet.Append(formattingProps);
    worksheet.Append(sheetData);
    worksheetPart.Worksheet = worksheet;
}

while the reading function is as following:

public static void readInventoryExcel(string fileName, ref DataSet set)
{
    using (SpreadsheetDocument spreadsheetDocument = SpreadsheetDocument.Open(fileName, false))
    {
        WorkbookPart workbookPart = spreadsheetDocument.WorkbookPart;
        WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();
        SheetData sheetData = worksheetPart.Worksheet.Elements<SheetData>().First();
        int count = -1;
        foreach (Row r in sheetData.Elements<Row>())
        {
            if (count >= 0)
            {
                DataRow row = set.Tables[0].NewRow();
                row["SerialNumber"]         = r.ChildElements[1].InnerXml;
                row["PartNumber"]           = r.ChildElements[2].InnerXml;
                row["EntryDate"]            = r.ChildElements[3].InnerXml;
                row["RetirementDate"]       = r.ChildElements[4].InnerXml;
                row["ReasonForReplacement"] = r.ChildElements[5].InnerXml;
                row["RetirementTech"]       = r.ChildElements[6].InnerXml;
                row["IncludeInMaintenance"] = r.ChildElements[7].InnerXml;
                row["MaintenanceTech"]      = r.ChildElements[8].InnerXml;
                row["Comment"]              = r.ChildElements[9].InnerXml;
                row["Station"]              = r.ChildElements[10].InnerXml;
                row["LocationStatus"]       = r.ChildElements[11].InnerXml;
                row["AssetName"]            = r.ChildElements[12].InnerXml;
                row["InventoryType"]        = r.ChildElements[13].InnerXml;
                row["Description"]          = r.ChildElements[14].InnerXml;
                set.Tables[0].Rows.Add(row);
            }
            count++;
        }
    }
}
Trichloride answered 6/8, 2014 at 16:38 Comment(0)
B
7

I think this is caused by the fact that you have only one sheet whereas Excel has three. I'm not certain but I think the sheets are returned in reverse order so you should change the line:

WorksheetPart worksheetPart = workbookPart.WorksheetParts.First();

to

WorksheetPart worksheetPart = workbookPart.WorksheetParts.Last();

It might be safer to search for the WorksheetPart if you can identify it by the sheet name. You need to find the Sheet first then use the Id of that to find the SheetPart:

private WorksheetPart GetWorksheetPartBySheetName(WorkbookPart workbookPart, string sheetName)
{
    //find the sheet first.
    IEnumerable<Sheet> sheets = workbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>().Where(s => s.Name == sheetName);

    if (sheets.Count() > 0)
    {
        string relationshipId = sheets.First().Id.Value;
        WorksheetPart worksheetPart = (WorksheetPart)workbookPart.GetPartById(relationshipId);
        return worksheetPart;
    }

    return null;
}

You can then use:

WorksheetPart worksheetPart = GetWorksheetPartBySheetName(workbookPart, "Sheet1");

There are a couple of other things I've noticed whilst looking at your code which you may (or may not!) be interested in:

In your code you are only reading the InnerXml so it might not matter to you but the way Excel stores strings is different to the way you are writing them so reading an Excel generated file may not give you the values you expect. In your example you are writing the string directly to the cell like this:

XML of Cell value

But Excel uses a SharedStrings concept where all strings are written to a separate XML file called sharedStrings.xml. That file contains the strings used in the Excel file with a reference and it's that value that is stored in the cell value in the sheet XML.

The sharedString.xml looks like this:

Shared Strings XML

And the Cell then looks like this:

Cell value with sharedString

The 47 in the <v> element is a reference to the 47th shared string. Note that the type (the t attribute) in your generated XML is str but the type in the Excel generated file is s. This denotes yours is an inline string and theirs is a shared string.

You can read the SharedStrings just as you would any other part:

var stringTable = workbookPart.GetPartsOfType<SharedStringTablePart>().FirstOrDefault();

if (stringTable != null)
{
    sharedString = stringTable.SharedStringTable.ElementAt(int.Parse(value)).InnerText;
}

Secondly, if you look at the cell reference that your code generates and the cell reference that Excel generates you can see you are only outputting the column and not the row (e.g. you output A instead of A1). To fix this you should change the line:

CellReference = Convert.ToString(Convert.ToChar(65 + colIndex)),

to

CellReference = Convert.ToString(Convert.ToChar(65 + colIndex) + rowIndex.ToString()),

I hope that helps.

Bronny answered 6/8, 2014 at 20:45 Comment(12)
That helped a lot. But now I'm running into another problem that I did not run into before. The returned SheetData contains all the rows of the sheet, not just the ones that I am using, which was causing my app to crash. Now that I found out why it was crashing, I'm a little lost as to how I should figure out how many rows actually have information in them. As for the other things, I really just took the writing code from the Internet, unlike the reading code. So thanks for the heads up. I will take a look into it as soon as I get out of this mess.Trichloride
Could you check the number of Cells in the Row matches what you expect and ignore it if it doesn't (maybe break out of the loop if your rows are always the first in the sheet). Something like r.Elements<Cell>().Count == x; where x is your expected number of cells?Bronny
Ok, now I ran into an issue where my some of my values are not what they are in the Excel spreadsheet, and I'm pretty sure it has to do with the shared strings you where talking about. So how do I get the shared string when I have the specific cell? Or am I supposed to get it some other way?Trichloride
@Trichloride Have a look at this question https://mcmap.net/q/376141/-openxml-to-create-a-datatable-from-excel-money-cell-value-incorrect. That should give you what you need.Bronny
Never mind, I got it. Those random numbers I was getting were the "value" variables. But now how do I know when excel will use a shared string and when it will just store the value, since some of the columns had the actual value? And thanks again for your patience.Trichloride
You can look at the cells type. cell.DataType == CellValues.SharedString.Bronny
So, I saw the question, but when I tried to do something similar, but with a ternary conditional operator, the statement is always evaluated as true that the the CellType is SharedString. Am I missing something or is my Excel spreadsheet just broken? row["SerialNumber"] = (((Cell)r.ChildElements[0]).DataType.Value == CellValues.SharedString) ? stringTable.SharedStringTable.ElementAt(int.Parse(r.ChildElements[0].InnerText)).InnerText : r.ChildElements[0].InnerText; This is the serial number, which is not shared, causing the program to crash.Trichloride
I think you should probably ask a new question for that as this one was solved by selecting the right sheet. I'm happy to take a look at a new question.Bronny
If you're still interested, here is my new question. #25187520 And thanks again for all your help.Trichloride
Nice post. Thanks a lot.Miscall
@Bronny This is an incredibly well structured answer, and very informative. I had completely forgotten about this Shared Strings Table, which caused me a great deal of grief. Thank you for going into so much detail with this response, your effort is really appreciated!Slowdown
Thank you @Stevo, it's so nice to get feedback and to know that an answer is helpful.Bronny
A
0

I ran into a similar issue a while back trying to do this for Word documents (procedurally generated worked fine, but human-generated did not). I found this tool to be very helpful:

http://www.microsoft.com/en-us/download/details.aspx?id=30425

Basically, it looks at a file and shows you the code that Microsoft would generate to read it, as well as the xml structure of the file itself. As usual for Microsoft products, there are quite a few menus and it's not very intuitive, but after clicking around for a bit you will be able to see exactly what is going on with any two files. I would recommend you open a working excel file and a non-working one and compare the difference to see what's causing your issue.

Appalachian answered 6/8, 2014 at 19:7 Comment(1)
I am using the OpenXml SDK for this project. However, it still gives me errors. Should I be looking at a specific class in the SDK?Trichloride
S
0

Below is the OpenXML code that I use to read in a particular Worksheet from an Excel file, into a DataTable.

First, here's how you'd call it:

DataTable dt = OpenXMLHelper.ExcelWorksheetToDataTable("C:\\SQL Server\\SomeExcelFile.xlsx", "Mikes Worksheet");

And here's the code:

    public class OpenXMLHelper
    {
        public static DataTable ExcelWorksheetToDataTable(string pathFilename, string worksheetName)
        {
            DataTable dt = new DataTable(worksheetName);

            using (SpreadsheetDocument document = SpreadsheetDocument.Open(pathFilename, false))
            {
                // Find the sheet with the supplied name, and then use that 
                // Sheet object to retrieve a reference to the first worksheet.
                Sheet theSheet = document.WorkbookPart.Workbook.Descendants<Sheet>().Where(s => s.Name == worksheetName).FirstOrDefault();
                if (theSheet == null)
                    throw new Exception("Couldn't find the worksheet: " + worksheetName);

                // Retrieve a reference to the worksheet part.
                WorksheetPart wsPart = (WorksheetPart)(document.WorkbookPart.GetPartById(theSheet.Id));
                Worksheet workSheet = wsPart.Worksheet;

                string dimensions = workSheet.SheetDimension.Reference.InnerText;       //  Get the dimensions of this worksheet, eg "B2:F4"

                int numOfColumns = 0;
                int numOfRows = 0;
                CalculateDataTableSize(dimensions, ref numOfColumns, ref numOfRows);
                System.Diagnostics.Trace.WriteLine(string.Format("The worksheet \"{0}\" has dimensions \"{1}\", so we need a DataTable of size {2}x{3}.", worksheetName, dimensions, numOfColumns, numOfRows));

                SheetData sheetData = workSheet.GetFirstChild<SheetData>();
                IEnumerable<Row> rows = sheetData.Descendants<Row>();

                string[,] cellValues = new string[numOfColumns, numOfRows];

                int colInx = 0;
                int rowInx = 0;
                string value = "";
                SharedStringTablePart stringTablePart = document.WorkbookPart.SharedStringTablePart;

                //  Iterate through each row of OpenXML data
                foreach (Row row in rows)
                {
                    for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
                    {
                        //  *DON'T* assume there's going to be one XML element for each item in each row...
                        Cell cell = row.Descendants<Cell>().ElementAt(i);
                        if (cell.CellValue == null || cell.CellReference == null)
                            continue;                       //  eg when an Excel cell contains a blank string

                        //  Convert this Excel cell's CellAddress into a 0-based offset into our array (eg "G13" -> [6, 12])
                        colInx = GetColumnIndexByName(cell.CellReference);             //  eg "C" -> 2  (0-based)
                        rowInx = GetRowIndexFromCellAddress(cell.CellReference)-1;     //  Needs to be 0-based  

                        //  Fetch the value in this cell
                        value = cell.CellValue.InnerXml;
                        if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
                        {
                            value = stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
                        }

                        cellValues[colInx, rowInx] = value;
                    }
                    dt.Rows.Add(dataRow);
                }

                //  Copy the array of strings into a DataTable
                for (int col = 0; col < numOfColumns; col++)
                    dt.Columns.Add("Column_" + col.ToString());

                for (int row = 0; row < numOfRows; row++)
                {
                    DataRow dataRow = dt.NewRow();
                    for (int col = 0; col < numOfColumns; col++)
                    {
                        dataRow.SetField(col, cellValues[col, row]);
                    }
                    dt.Rows.Add(dataRow);
                }

#if DEBUG
                //  Write out the contents of our DataTable to the Output window (for debugging)
                string str = "";
                for (rowInx = 0; rowInx < maxNumOfRows; rowInx++)
                {
                    for (colInx = 0; colInx < maxNumOfColumns; colInx++)
                    {
                        object val = dt.Rows[rowInx].ItemArray[colInx];
                        str += (val == null) ? "" : val.ToString();
                        str += "\t";
                    }
                    str += "\n";
                }
                System.Diagnostics.Trace.WriteLine(str);
#endif
                return dt;
            }
        }

        private static void CalculateDataTableSize(string dimensions, ref int numOfColumns, ref int numOfRows)
        {
            //  How many columns & rows of data does this Worksheet contain ?  
            //  We'll read in the Dimensions string from the Excel file, and calculate the size based on that.
            //      eg "B1:F4" -> we'll need 6 columns and 4 rows.
            //
            //  (We deliberately ignore the top-left cell address, and just use the bottom-right cell address.)
            try
            {
                string[] parts = dimensions.Split(':');     // eg "B1:F4" 
                if (parts.Length != 2)
                    throw new Exception("Couldn't find exactly *two* CellAddresses in the dimension");

                numOfColumns = 1 + GetColumnIndexByName(parts[1]);     //  A=1, B=2, C=3  (1-based value), so F4 would return 6 columns
                numOfRows = GetRowIndexFromCellAddress(parts[1]);
            }
            catch
            {
                throw new Exception("Could not calculate maximum DataTable size from the worksheet dimension: " + dimensions);
            }
        }

        public static int GetRowIndexFromCellAddress(string cellAddress)
        {
            //  Convert an Excel CellReference column into a 1-based row index
            //  eg "D42"  ->  42
            //     "F123" ->  123
            string rowNumber = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^0-9 _]", "");
            return int.Parse(rowNumber);
        }

        public static int GetColumnIndexByName(string cellAddress)
        {
            //  Convert an Excel CellReference column into a 0-based column index
            //  eg "D42" ->  3
            //     "F123" -> 5
            var columnName = System.Text.RegularExpressions.Regex.Replace(cellAddress, "[^A-Z_]", "");
            int number = 0, pow = 1;
            for (int i = columnName.Length - 1; i >= 0; i--)
            {
                number += (columnName[i] - 'A' + 1) * pow;
                pow *= 26;
            }
            return number - 1;
        }
    }

Just to mention, some of our company's Excel Worksheets have one or more blank rows at the top. Strangely, this prevented some other OpenXML libraries from reading in such Worksheets properly.

This code deliberately creates a DataTable with one value for each of the cells in the Worksheet, even the blank ones at the top.

Stopwatch answered 5/5, 2017 at 8:10 Comment(1)
tested it. I see that SheetDimension is optional in the open xml spec, this is not reliable. Also SheetDimension return dimensions = "A1:XEZ64", so we need a DataTable of size 16380x64. 16380 columns. This was for some sample excelfile I had with a horizontal line spanning all cols but only a few cols with data. But a good starting point, it just need some refinement I think.Rangy

© 2022 - 2024 — McMap. All rights reserved.