C# OPEN XML: empty cells are getting skipped while getting data from EXCEL to DATATABLE
Asked Answered
C

5

26

Task

Import data from excel to DataTable

Problem

The cell that doesnot contain any data are getting skipped and the very next cell that has data in the row is used as the value of the empty colum. E.g

A1 is empty A2 has a value Tom then while importing the data A1 get the value of A2 and A2 remains empty

To make it very clear I am providing some screen shots below

This is the excel data

enter image description here

This is the DataTable after importing the data from excel enter image description here

Code

public class ImportExcelOpenXml
{
    public static DataTable Fill_dataTable(string fileName)
    {
        DataTable dt = new DataTable();

        using (SpreadsheetDocument spreadSheetDocument = SpreadsheetDocument.Open(fileName, false))
        {

            WorkbookPart workbookPart = spreadSheetDocument.WorkbookPart;
            IEnumerable<Sheet> sheets = spreadSheetDocument.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>();
            string relationshipId = sheets.First().Id.Value;
            WorksheetPart worksheetPart = (WorksheetPart)spreadSheetDocument.WorkbookPart.GetPartById(relationshipId);
            Worksheet workSheet = worksheetPart.Worksheet;
            SheetData sheetData = workSheet.GetFirstChild<SheetData>();
            IEnumerable<Row> rows = sheetData.Descendants<Row>();

            foreach (Cell cell in rows.ElementAt(0))
            {
                dt.Columns.Add(GetCellValue(spreadSheetDocument, cell));
            }

            foreach (Row row in rows) //this will also include your header row...
            {
                DataRow tempRow = dt.NewRow();

                for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
                {
                    tempRow[i] = GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
                }

                dt.Rows.Add(tempRow);
            }

        }

        dt.Rows.RemoveAt(0); //...so i'm taking it out here.

        return dt;
    }


    public static string GetCellValue(SpreadsheetDocument document, Cell cell)
    {
        SharedStringTablePart stringTablePart = document.WorkbookPart.SharedStringTablePart;
        string value = cell.CellValue.InnerXml;

        if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
        {
            return stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
        }
        else
        {
            return value;
        }
    }
}

My Thoughts

I think there is some problem with

public IEnumerable<T> Descendants<T>() where T : OpenXmlElement;

In case I want the count of columns using Descendants

IEnumerable<Row> rows = sheetData.Descendants<<Row>();
int colCnt = rows.ElementAt(0).Count();

OR

If I am getting the count of rows using Descendants

IEnumerable<Row> rows = sheetData.Descendants<<Row>();
int rowCnt = rows.Count();`

In both cases Descendants is skipping the empty cells

Is there any alternative of Descendants.

Your suggestions are highly appreciated

P.S: I have also thought of getting the cells values by using column names like A1, A2 but in order to do that I will have to get the exact count of columns and rows which is not possible by using Descendants function.

Cimex answered 19/3, 2016 at 9:23 Comment(4)
Empty cells don't have e Cell element so you can't find them.Bistro
@AlexanderDerck Then how to tackel this issue?Cimex
Using EPPlus library would be easier for this (it uses open xml sdk), see example hereBistro
You could also require the cell to always contain a value. If there is no mark, then you have the default value of zero.Brezin
J
32

Had there been some data in all the cells of a row then everything works fine. But if you happen to have even single empty cell in a row then things go haywire.

Why it is happening in first place?

The reason lies in below line of code:

row.Descendants<Cell>().Count()

Count() function gives you the number of non-empty cells in the row i.e. it will ignore all the empty cells while returning the count. So, when you pass row.Descendants<Cell>().ElementAt(i) as argument to GetCellValue method like this:

GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));

Then, it will find the content of the next non-empty cell, not necessarily the content of the cell at column index i e.g. if the first column is empty and we call ElementAt(1), it returns the value in the second column instead and our program logic gets messed up.

Solution: We need to deal with the occurrence of empty cells in the row i.e. we need to figure out the actual/effective column index of the target cell in case there were some empty cells before it in the given row. So, you need to substitute your for loop code below:

for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
      tempRow[i] = GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
}

with

for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
    Cell cell = row.Descendants<Cell>().ElementAt(i);
    int actualCellIndex = CellReferenceToIndex(cell);
    tempRow[actualCellIndex] = GetCellValue(spreadSheetDocument, cell);
}

Also, add below method in your code which is used in the above modified code snippet to obtain the actual/effective column index of any cell:

private static int CellReferenceToIndex(Cell cell)
{
    int index = 0;
    string reference = cell.CellReference.ToString().ToUpper();
    foreach (char ch in reference)
    {
        if (Char.IsLetter(ch))
        {
            int value = (int)ch - (int)'A';
            index = (index == 0) ? value : ((index + 1) * 26) + value;
        }
        else
        {
            return index;
        }
    }
    return index;
}

Note: Index in an Excel row start with 1 unlike various programming languages where it starts at 0.

Jarrod answered 19/1, 2018 at 9:37 Comment(2)
i think the method CellReferenceToIndex not okay for excel that exceeds a-z to aa,ab,... when z col exceed again it returns index from 0.... so if you have excel with many number of cols that wont workPaulinapauline
This is a very helpful answer! It's a such a shame the the OpenXML docs are such a clusterf*ck. Thank you very much.Concavoconcave
G
7
public void Read2007Xlsx()
{
    try
    {
        DataTable dt = new DataTable();
        using (SpreadsheetDocument spreadSheetDocument = SpreadsheetDocument.Open(@"D:\File.xlsx", false))
        {
            WorkbookPart workbookPart = spreadSheetDocument.WorkbookPart;
            IEnumerable<Sheet> sheets = spreadSheetDocument.WorkbookPart.Workbook.GetFirstChild<Sheets>().Elements<Sheet>();
            string relationshipId = sheets.First().Id.Value;
            WorksheetPart worksheetPart = (WorksheetPart)spreadSheetDocument.WorkbookPart.GetPartById(relationshipId);
            Worksheet workSheet = worksheetPart.Worksheet;
            SheetData sheetData = workSheet.GetFirstChild<SheetData>();
            IEnumerable<Row> rows = sheetData.Descendants<Row>();
            foreach (Cell cell in rows.ElementAt(0))
            {
                dt.Columns.Add(GetCellValue(spreadSheetDocument, cell));
            }
            foreach (Row row in rows) //this will also include your header row...
            {
                DataRow tempRow = dt.NewRow();
                int columnIndex = 0;
                foreach (Cell cell in row.Descendants<Cell>())
                {
                    // Gets the column index of the cell with data
                    int cellColumnIndex = (int)GetColumnIndexFromName(GetColumnName(cell.CellReference));
                    cellColumnIndex--; //zero based index
                    if (columnIndex < cellColumnIndex)
                    {
                        do
                        {
                            tempRow[columnIndex] = ""; //Insert blank data here;
                            columnIndex++;
                         }
                         while (columnIndex < cellColumnIndex);
                     }//end if block
                     tempRow[columnIndex] = GetCellValue(spreadSheetDocument, cell);                            
                     columnIndex++;
                 }//end inner foreach loop
                 dt.Rows.Add(tempRow);
             }//end outer foreach loop
        }//end using block
        dt.Rows.RemoveAt(0); //...so i'm taking it out here.
    }//end try
    catch (Exception ex)
    {
    }
}//end Read2007Xlsx method

/// <summary>
/// Given a cell name, parses the specified cell to get the column name.
/// </summary>
/// <param name="cellReference">Address of the cell (ie. B2)</param>
/// <returns>Column Name (ie. B)</returns>
public static string GetColumnName(string cellReference)
{
    // Create a regular expression to match the column name portion of the cell name.
    Regex regex = new Regex("[A-Za-z]+");
    Match match = regex.Match(cellReference);
    return match.Value;
} //end GetColumnName method

/// <summary>
/// Given just the column name (no row index), it will return the zero based column index.
/// Note: This method will only handle columns with a length of up to two (ie. A to Z and AA to ZZ). 
/// A length of three can be implemented when needed.
/// </summary>
/// <param name="columnName">Column Name (ie. A or AB)</param>
/// <returns>Zero based index if the conversion was successful; otherwise null</returns>
public static int? GetColumnIndexFromName(string columnName)
{                       
    //return columnIndex;
    string name = columnName;
    int number = 0;
    int pow = 1;
    for (int i = name.Length - 1; i >= 0; i--)
    {
        number += (name[i] - 'A' + 1) * pow;
        pow *= 26;
    }
    return number;
 } //end GetColumnIndexFromName method

public static string GetCellValue(SpreadsheetDocument document, Cell cell)
{
   SharedStringTablePart stringTablePart = document.WorkbookPart.SharedStringTablePart;
   if (cell.CellValue ==null)
   {
       return "";
   }
   string value = cell.CellValue.InnerXml;
   if (cell.DataType != null && cell.DataType.Value == CellValues.SharedString)
   {
        return stringTablePart.SharedStringTable.ChildElements[Int32.Parse(value)].InnerText;
    }
    else
    {
         return value;
    }
 }//end GetCellValue method
Gd answered 13/5, 2016 at 11:16 Comment(0)
I
0

Try this code. I have done little modifications and it worked for me:

public static DataTable Fill_dataTable(string filePath)
{
    DataTable dt = new DataTable();

    using (SpreadsheetDocument doc = SpreadsheetDocument.Open(filePath, false))
    {
        Sheet sheet = doc.WorkbookPart.Workbook.Sheets.GetFirstChild<Sheet>();
        Worksheet worksheet = doc.WorkbookPart.GetPartById(sheet.Id.Value) as WorksheetPart.Worksheet;
        IEnumerable<Row> rows = worksheet.GetFirstChild<SheetData>().Descendants<Row>();
        DataTable dt = new DataTable();
        List<string> columnRef = new List<string>();
        foreach (Row row in rows)
        {
            if (row.RowIndex != null)
            {
                if (row.RowIndex.Value == 1)
                {
                    foreach (Cell cell in row.Descendants<Cell>())
                    {
                        dt.Columns.Add(GetValue(doc, cell));
                            columnRef.Add(cell.CellReference.ToString().Substring(0, cell.CellReference.ToString().Length - 1));
                     }
                }
                else
                {
                    dt.Rows.Add();
                    int i = 0;
                    foreach (Cell cell in row.Descendants<Cell>())
                    {
                        while (columnRef(i) + dt.Rows.Count + 1 != cell.CellReference)
                        {
                            dt.Rows(dt.Rows.Count - 1)(i) = "";
                            i += 1;
                         }

                         dt.Rows(dt.Rows.Count - 1)(i) = GetValue(doc, cell);
                         i += 1;
                    }
                }
            }
        }
    }

    return dt;
}
Interclavicle answered 12/12, 2017 at 6:17 Comment(0)
C
0
foreach (Cell cell in row.Descendants<Cell>())
{
    while (columnRef[i] + (dt.Rows.Count + 1) != cell.CellReference)
    {
        dt.Rows[dt.Rows.Count - 1][i] = "";
        i += 1;
    }

    dt.Rows[dt.Rows.Count - 1][i] = GetValue(doc, cell);
    i++;
}
Circumfluent answered 22/12, 2022 at 10:45 Comment(2)
Please add a description to your code. Please don't post raw code answers.Normanormal
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Literator
A
-1

It's not a design flaw, but it's a result of the design. Only the cells with a value will be stored in a matrix. This saves resources due to the fact there are often a lot of empty cells in a spreadsheet. In order to avoid the problem Cell has a property CellReference with the original cellAddress.

For example, the address you need is Z13, Cell.CellAddress returns the string "Z13" The only thing you have to do is to split the address can change the column part from Z to 26. Of course you have to subtract 1 from 26 in order to change the index from a 1 based matrix (Excel) to a zero based matrix (database)

Arnulfoarny answered 22/1, 2024 at 20:20 Comment(1)
Welcome to Stack Overflow! Thank you for your answer. Please provide more details about your solution. Code snippets, high quality descriptions, or any relevant information would be great. Clear and concise answers are more helpful and easier to understand for everyone. Edit your answer with specifics to raise the quality of your answer. For more information: How To: Write good answers. Happy coding!Rubber

© 2022 - 2025 — McMap. All rights reserved.