I am reading excel sheet via python and trying to read only visible rows in python (not hidden or collapsed). I went through documentation of OPENPYXL and found that it has "hidden" and "collapsed" property. But once I read the excel sheet not always "hidden" or "collapsed" is true when column is hidden. My code is following
def read_visible_data_from_sheet(sheet):
data = []
# Iterate through rows
for row in sheet.iter_rows():
row_num = row[0].row
# Check if the row is hidden or has height set to 0
row_hidden = sheet.row_dimensions[row_num].hidden
row_height = sheet.row_dimensions[row_num].height
row_level = sheet.row_dimensions[row_num].outlineLevel
if row_hidden or (row_height is not None and row_height == 0):
continue # Skip hidden rows
# Check if any parent row is collapsed
is_collapsed = False
for parent_row_num in range(1, row_num):
if sheet.row_dimensions[parent_row_num].outlineLevel < row_level and sheet.row_dimensions[parent_row_num].hidden == True:
is_collapsed = True
break
if is_collapsed:
continue # Skip collapsed rows
visible_row = []
# Iterate through columns in the row
for cell in row:
col_letter = cell.column_letter
col_dim = sheet.column_dimensions.get(col_letter)
if col_dim:
col_hidden = col_dim.hidden
col_width = col_dim.width
else:
continue
# Check if the column is hidden or has width set to 0
if col_hidden or (col_width is not None and col_width == 0):
continue # Skip hidden columns
visible_row.append(cell.value)
# Append the visible row to the data list
if visible_row: # Avoid adding empty rows
data.append(visible_row)
# Convert to a DataFrame
df = pd.DataFrame(data)
return df, sheet
In some cases the sheet.column_dimensions contains columns which are visible while in other cases it doesn't contain columns in sheet.column_dimensions even if it is visible.
Is there a better way to deal with such cases? I am open on exploring any other library if necessary.
There may be some inconsistencies in how Openpyxl handles the hidden row and column information however there is a couple of things to keep in mind when searching for this detail.
As mentioned in the example since the Columns status is not going to change it seems wasteful to keep checking the status for every row your are copying.
Check the columns once make a list and then use that when cycling through the rows.
The following Code example creates a Visible Columns list and then uses it to copy all rows for the dataframe.
The code loops the Rows and on the first iteration creates the Visible Columns List visible_columns
from the first row's column status. Once the List is created it is used for the first and all subsequent rows.
The criteria for a visible columns is based on;
column_group
list and whether they are hidden or not since only the first column in a collapsed group is marked as hidden.The code then processes each row starting from the first row (used to create the list). After checking if the row is visible, copies the values from the row
per the visible_columns
List.
A row's visibility status is slightly different to columns. The row is determined to be visible purely by that row's hidden status in row_dimensions
Example Sheet
The following example Sheet was used;
The Sheet has the following Grouped and Hidden Rows and Columns;
"""
Hidden Rows & Columns
Columns;
C, width was set to 0
E & G, Hidden
J - L, Grouped, Collapsed
N & O, Grouped, Not collapsed
Rows;
7, Height was set to 0
15 - 19, Grouped, Collapsed
27, Empty Row
31 & 32, Hidden
35 - 37, Grouped, Not collapsed
"""
Code Sample
import pandas as pd
import openpyxl
from openpyxl.utils.cell import range_boundaries as rb
def read_visible_data_from_sheet(sheet):
data = []
visible_columns = []
# Get the number of hidden (collapsed) columns
hidden_grouped_cols = [] # The List of Grouped Columns that are hidden
for range_string in sheet.column_groups:
group_collapsed = sheet.column_dimensions[range_string[:1]].hidden
rng_boundaries = rb(range_string)
hidden_grouped_cols += [x for x in range(rng_boundaries[0], rng_boundaries[2]+1) if group_collapsed]
get_visible_columns = True # First Row iteration build a List of visible Columns
# Iterate rows (all rows or set min and max row to iterate over)
for row in sheet.iter_rows():
visible_row = []
# Build list of visible columns from first row checked
if get_visible_columns: # get_visible_columns is True for first row
for cell in row: # Looping the columns
col_info = sheet.column_dimensions[cell.column_letter]
# Create List from non hidden Columns including those in uncollapsed groups
if not col_info.hidden and cell.col_idx not in hidden_grouped_cols:
visible_columns.append(cell.column_letter)
get_visible_columns = False # First row builds the Visible Columns List. Subsequent rows can skip this step
# Add values from Row if it's not hidden
if not sheet.row_dimensions[row[0].row].hidden:
# Get List of values from all visible columns
visible_row = [cell.value for cell in row if cell.column_letter in visible_columns]
# Empty Row check
if all(e is None for e in visible_row): # Drop row if all values obtained are None
visible_row = None
# If visible_row contains data add to the data List
if visible_row: # Avoid adding no value row
data.append(visible_row)
return data
# Open Excel file and get Sheet
excelfile = 'foo.xlsx'
wb = openpyxl.load_workbook(excelfile)
ws = wb['Sheet1']
# Create dataframe from visible rows and columns
df = pd.DataFrame(read_visible_data_from_sheet(ws))
print(df)
Output
Visible Columns List; ['A', 'B', 'D', 'F', 'H', 'I', 'M', 'N', 'O', 'P']
Dataframe;
0 1 2 3 4 5 6 7 8 9
0 ROW# B D F H I M N O P
1 2 Date DMA Code Car Model Visitors Revenue col4 col5 col6 col7
2 3 2017-02-22 00:00:00 500 Focus 3 47 3 4 5 6
3 4 2017-02-11 00:00:00 500 Avenger 3 27 3 4 5 6
4 5 2017-03-22 00:00:00 500 Explorer 1 16 3 4 5 6
5 6 2017-07-03 00:00:00 500 3 Series 1 12 3 4 5 6
6 8 2017-07-15 00:00:00 500 Civic 16 100 3 4 5 6
7 9 2017-07-22 00:00:00 500 Accent 1 16 3 4 5 6
8 10 2017-07-24 00:00:00 500 ML 214 39 3 4 5 6
9 11 2017-08-28 00:00:00 500 RX 11 30 3 4 5 6
10 12 2017-06-03 00:00:00 500 Silverado 1500 3 30 3 4 5 6
11 13 2017-08-18 00:00:00 500 Neon 1 10 3 4 5 6
12 14 2017-03-04 00:00:00 501 Sonata 1 9 3 4 5 6
13 20 2017-02-02 00:00:00 501 Integra 16 112 3 4 5 6
14 21 2017-02-27 00:00:00 501 S4 3 31 3 4 5 6
15 22 2017-02-12 00:00:00 501 3 Series 10 107 3 4 5 6
16 23 2017-02-16 00:00:00 501 X3 8 72 3 4 5 6
17 24 2017-02-06 00:00:00 501 Camaro 12 246 3 4 5 6
18 25 2017-02-04 00:00:00 501 Impala 5 47 3 4 5 6
19 26 2017-02-07 00:00:00 501 Accord 10 909 3 4 5 6
20 28 2017-02-26 00:00:00 501 Aztek 10 7 3 4 5 6
21 29 2017-02-22 00:00:00 501 Ion 3 47 3 4 5 6
22 30 2017-02-22 00:00:00 501 Aerio 3 47 3 4 5 6
23 33 2017-03-06 00:00:00 501 Altima 2 16 3 4 5 6
24 34 2017-06-18 00:00:00 501 RDX 5 47 3 4 5 6
25 35 2017-06-19 00:00:00 501 M3 3 93 3 4 5 6
26 36 2017-06-17 00:00:00 501 Mustang 6 124 3 4 5 6
27 37 2017-06-09 00:00:00 501 Pilot 4 47 3 4 5 6
28 38 2017-06-17 00:00:00 501 Mark LT 3 47 3 4 5 6