reading Excel Open XML is ignoring blank cells

I am using the accepted solution here to convert an excel sheet into a datatable. This works fine if I have "perfect" data but if I have a blank cell in the middle of my data it seems to put the wrong data in each column.

I think this is because in the below code:

row.Descendants<Cell>().Count()

is number of populated cells (not all columns) AND:

GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));

seems to find the next populated cell (not necessarily what is in that index) so if the first column is empty and i call ElementAt(0), it returns the value in the second column.

Here is the full parsing code.

DataRow tempRow = dt.NewRow();

for (int i = 0; i < row.Descendants<Cell>().Count(); i++)
{
    tempRow[i] = GetCellValue(spreadSheetDocument, row.Descendants<Cell>().ElementAt(i));
    if (tempRow[i].ToString().IndexOf("Latency issues in") > -1)
    {
        Console.Write(tempRow[i].ToString());
    }
}

Solution

This makes sense since Excel will not store a value for a cell that is null. If you open your file using the Open XML SDK 2.0 Productivity Tool and traverse the XML down to the cell level you will see that only the cells that have data are going to be in that file.

Your options are to insert blank data in the range of cells you are going to traverse or programmatically figure out a cell was skipped and adjust your index appropriately.

I made an example excel document with a string in cell reference A1 and C1. I then opened up the excel document in the Open XML Productivity Tool and here is the XML that was stored:

<x:row r="1" spans="1:3" 
   xmlns:x="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
  <x:c r="A1" t="s">
    <x:v>0</x:v>
  </x:c>
  <x:c r="C1" t="s">
    <x:v>1</x:v>
  </x:c>
</x:row>

Here you will see that the data corresponds to the first row and that only two cells worth of data are saved for that row. The data saved corresponds to A1 and C1 and that no cells with null values are saved.

To get the functionality that you need, you can traverse over the Cells as you are doing above, but you will need to check what the value the Cell is referencing and determine if any Cells have been skipped. to do that you will need two utility functions to get the Column Name from the cell reference and to then translate that column name into a zero based index:

    private static List<char> Letters = new List<char>() { 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', ' ' };

    /// <summary>
    /// Given a cell name, parses the specified cell to get the column name.
    /// </summary>
    /// <param name="cellReference">Address of the cell (ie. B2)</param>
    /// <returns>Column Name (ie. B)</returns>
    public static string GetColumnName(string cellReference)
    {
        // Create a regular expression to match the column name portion of the cell name.
        Regex regex = new Regex("[A-Za-z]+");
        Match match = regex.Match(cellReference);

        return match.Value;
    }

    /// <summary>
    /// Given just the column name (no row index), it will return the zero based column index.
    /// Note: This method will only handle columns with a length of up to two (ie. A to Z and AA to ZZ). 
    /// A length of three can be implemented when needed.
    /// </summary>
    /// <param name="columnName">Column Name (ie. A or AB)</param>
    /// <returns>Zero based index if the conversion was successful; otherwise null</returns>
    public static int? GetColumnIndexFromName(string columnName)
    {
        int? columnIndex = null;

        string[] colLetters = Regex.Split(columnName, "([A-Z]+)");
        colLetters = colLetters.Where(s => !string.IsNullOrEmpty(s)).ToArray();

        if (colLetters.Count() <= 2)
        {
            int index = 0;
            foreach (string col in colLetters)
            {
                List<char> col1 = colLetters.ElementAt(index).ToCharArray().ToList();
                int? indexValue = Letters.IndexOf(col1.ElementAt(index));

                if (indexValue != -1)
                {
                    // The first letter of a two digit column needs some extra calculations
                    if (index == 0 && colLetters.Count() == 2)
                    {
                        columnIndex = columnIndex == null ? (indexValue + 1) * 26 : columnIndex + ((indexValue + 1) * 26);
                    }
                    else
                    {
                        columnIndex = columnIndex == null ? indexValue : columnIndex + indexValue;
                    }
                }

                index++;
            }
        }

        return columnIndex;
    }

Then you can iterate over the Cells and check to see what the cell reference is compared to the columnIndex. If it is less than then you add blank data to your tempRow, otherwise just read in the value contained in the cell. (Note: I did not test the code below, but the general idea should help):

DataRow tempRow = dt.NewRow();

int columnIndex = 0;
foreach (Cell cell in row.Descendants<Cell>())
{
   // Gets the column index of the cell with data
   int cellColumnIndex = (int)GetColumnIndexFromName(GetColumnName(cell.CellReference));

   if (columnIndex < cellColumnIndex)
   {
      do
      {
         tempRow[columnIndex] = //Insert blank data here;
         columnIndex++;
      }
      while(columnIndex < cellColumnIndex);
    }
    tempRow[columnIndex] = GetCellValue(spreadSheetDocument, cell);

    if (tempRow[i].ToString().IndexOf("Latency issues in") > -1)
    {
       Console.Write(tempRow[i].ToString());
    }
    columnIndex++;
}