pythonopenpyxlxlsxxlsxlsm

Is there a way of differentiating collapsed column in excel sheet via python


I am reading excel sheet via python and trying to read only visible rows in python (not hidden or collapsed). I went through documentation of OPENPYXL and found that it has "hidden" and "collapsed" property. But once I read the excel sheet not always "hidden" or "collapsed" is true when column is hidden. My code is following

def read_visible_data_from_sheet(sheet):
    data = []
    # Iterate through rows
    for row in sheet.iter_rows():
        row_num = row[0].row
        # Check if the row is hidden or has height set to 0
        row_hidden = sheet.row_dimensions[row_num].hidden
        row_height = sheet.row_dimensions[row_num].height
        row_level = sheet.row_dimensions[row_num].outlineLevel

        if row_hidden or (row_height is not None and row_height == 0):
            continue  # Skip hidden rows

        # Check if any parent row is collapsed
        is_collapsed = False
        for parent_row_num in range(1, row_num):
            if sheet.row_dimensions[parent_row_num].outlineLevel < row_level and sheet.row_dimensions[parent_row_num].hidden == True:
                is_collapsed = True
                break
        if is_collapsed:
            continue # Skip collapsed rows

        visible_row = []
        # Iterate through columns in the row
        for cell in row:
            col_letter = cell.column_letter
            col_dim = sheet.column_dimensions.get(col_letter)
            if col_dim:
                col_hidden = col_dim.hidden
                col_width = col_dim.width
            else:
                continue
            # Check if the column is hidden or has width set to 0
            if col_hidden or (col_width is not None and col_width == 0):
                continue  # Skip hidden columns

            visible_row.append(cell.value)
        # Append the visible row to the data list
        if visible_row:  # Avoid adding empty rows
            data.append(visible_row)

    # Convert to a DataFrame
    df = pd.DataFrame(data)
    return df, sheet

In some cases the sheet.column_dimensions contains columns which are visible while in other cases it doesn't contain columns in sheet.column_dimensions even if it is visible.

Is there a better way to deal with such cases? I am open on exploring any other library if necessary.


Solution

  • There may be some inconsistencies in how Openpyxl handles the hidden row and column information however there is a couple of things to keep in mind when searching for this detail.

    As mentioned in the example since the Columns status is not going to change it seems wasteful to keep checking the status for every row your are copying.
    Check the columns once make a list and then use that when cycling through the rows.

    The following Code example creates a Visible Columns list and then uses it to copy all rows for the dataframe.

    The code loops the Rows and on the first iteration creates the Visible Columns List visible_columns from the first row's column status. Once the List is created it is used for the first and all subsequent rows.
    The criteria for a visible columns is based on;

    The code then processes each row starting from the first row (used to create the list). After checking if the row is visible, copies the values from the row per the visible_columns List.

    A row's visibility status is slightly different to columns. The row is determined to be visible purely by that row's hidden status in row_dimensions

    Example Sheet
    The following example Sheet was used; enter image description here

    The Sheet has the following Grouped and Hidden Rows and Columns;

    """
    Hidden Rows & Columns
    
    Columns;
    C, width was set to 0
    E & G, Hidden
    J - L, Grouped, Collapsed
    N & O, Grouped, Not collapsed
    
    
    Rows;
    7, Height was set to 0
    15 - 19, Grouped, Collapsed
    27, Empty Row 
    31 & 32, Hidden
    35 - 37, Grouped, Not collapsed 
    """
    

    Code Sample

    import pandas as pd
    import openpyxl
    from openpyxl.utils.cell import range_boundaries as rb
    
    
    
    def read_visible_data_from_sheet(sheet):
        data = []
        visible_columns = []
    
        # Get the number of hidden (collapsed) columns
        hidden_grouped_cols = []  # The List of Grouped Columns that are hidden
        for range_string in sheet.column_groups:
            group_collapsed = sheet.column_dimensions[range_string[:1]].hidden
            rng_boundaries = rb(range_string)
            hidden_grouped_cols += [x for x in range(rng_boundaries[0], rng_boundaries[2]+1) if group_collapsed]
    
        get_visible_columns = True  # First Row iteration build a List of visible Columns
        # Iterate rows (all rows or set min and max row to iterate over)
        for row in sheet.iter_rows():
            visible_row = []
            # Build list of visible columns from first row checked
            if get_visible_columns:  # get_visible_columns is True for first row
                for cell in row:  # Looping the columns
                    col_info = sheet.column_dimensions[cell.column_letter]
                    # Create List from non hidden Columns including those in uncollapsed groups
                    if not col_info.hidden and cell.col_idx not in hidden_grouped_cols:
                        visible_columns.append(cell.column_letter)
    
                get_visible_columns = False  # First row builds the Visible Columns List. Subsequent rows can skip this step
    
            # Add values from Row if it's not hidden
            if not sheet.row_dimensions[row[0].row].hidden:
                # Get List of values from all visible columns
                visible_row = [cell.value for cell in row if cell.column_letter in visible_columns]
                # Empty Row check
                if all(e is None for e in visible_row):  # Drop row if all values obtained are None
                    visible_row = None
    
            # If visible_row contains data add to the data List
            if visible_row:  # Avoid adding no value row
                data.append(visible_row)
    
        return data
    
    
    # Open Excel file and get Sheet
    excelfile = 'foo.xlsx'
    wb = openpyxl.load_workbook(excelfile)
    ws = wb['Sheet1']
    
    # Create dataframe from visible rows and columns
    df = pd.DataFrame(read_visible_data_from_sheet(ws))
    
    print(df)
    

    Output

    Visible Columns List;  ['A', 'B', 'D', 'F', 'H', 'I', 'M', 'N', 'O', 'P']
    
    Dataframe;
           0                    1         2               3         4        5     6     7     8     9
    0   ROW#                    B         D               F         H        I    M      N     O     P
    1      2                 Date  DMA Code       Car Model  Visitors  Revenue  col4  col5  col6  col7
    2      3  2017-02-22 00:00:00       500           Focus         3       47     3     4     5     6
    3      4  2017-02-11 00:00:00       500         Avenger         3       27     3     4     5     6
    4      5  2017-03-22 00:00:00       500        Explorer         1       16     3     4     5     6
    5      6  2017-07-03 00:00:00       500        3 Series         1       12     3     4     5     6
    6      8  2017-07-15 00:00:00       500           Civic        16      100     3     4     5     6
    7      9  2017-07-22 00:00:00       500          Accent         1       16     3     4     5     6
    8     10  2017-07-24 00:00:00       500              ML       214       39     3     4     5     6
    9     11  2017-08-28 00:00:00       500              RX        11       30     3     4     5     6
    10    12  2017-06-03 00:00:00       500  Silverado 1500         3       30     3     4     5     6
    11    13  2017-08-18 00:00:00       500            Neon         1       10     3     4     5     6
    12    14  2017-03-04 00:00:00       501          Sonata         1        9     3     4     5     6
    13    20  2017-02-02 00:00:00       501         Integra        16      112     3     4     5     6
    14    21  2017-02-27 00:00:00       501              S4         3       31     3     4     5     6
    15    22  2017-02-12 00:00:00       501        3 Series        10      107     3     4     5     6
    16    23  2017-02-16 00:00:00       501              X3         8       72     3     4     5     6
    17    24  2017-02-06 00:00:00       501          Camaro        12      246     3     4     5     6
    18    25  2017-02-04 00:00:00       501          Impala         5       47     3     4     5     6
    19    26  2017-02-07 00:00:00       501          Accord        10      909     3     4     5     6
    20    28  2017-02-26 00:00:00       501           Aztek        10        7     3     4     5     6
    21    29  2017-02-22 00:00:00       501             Ion         3       47     3     4     5     6
    22    30  2017-02-22 00:00:00       501           Aerio         3       47     3     4     5     6
    23    33  2017-03-06 00:00:00       501          Altima         2       16     3     4     5     6
    24    34  2017-06-18 00:00:00       501             RDX         5       47     3     4     5     6
    25    35  2017-06-19 00:00:00       501              M3         3       93     3     4     5     6
    26    36  2017-06-17 00:00:00       501         Mustang         6      124     3     4     5     6
    27    37  2017-06-09 00:00:00       501           Pilot         4       47     3     4     5     6
    28    38  2017-06-17 00:00:00       501         Mark LT         3       47     3     4     5     6