excelpython-2.7pandasstyleframe

Subsetting a dataframe based on cell color and text color in excel sheet


I have an excel sheet with more than 1000 columns and 300 rows. Some these cells have normal data, while some cells have a background color of red and some cells have normal white background but the text is red in color. For example, my excel sheet looks like below:

enter image description here

I am reading this excel sheet into Python (pandas) to use it as a dataframe and perform further actions on it. However, the red text and red cells need to be treated differently than the normal cells.

Therefore, I would like to split the above table into 3 tables, such that: Table one has all the cells but the red background cells are empty. Table 2 has only those rows and columns where the text is red. Table 3 has only those rows and columns where the background is red.

I guess it cannot be done in Pandas. I tried using StyleFrame but failed.

Can anyone help in this regard? Is there any python package that is helpful in this case?


Solution

  • This is pretty much the way to achieve this. It is not pretty as StyleFrame wasn't really designed to be used this way.

    Reading the source Excel file

    import numpy as np
    from styleframe import StyleFrame, utils
    
    sf = StyleFrame.read_excel('test.xlsx', read_style=True, use_openpyxl_styles=False)
    

    1) All cells but cells with red background are empty

    def empty_red_background_cells(cell):
        if cell.style.bg_color in {utils.colors.red, 'FFFF0000'}:
            cell.value = np.nan
        return cell
    
    sf_1 = StyleFrame(sf.applymap(empty_red_background_cells))    
    print(sf_1)
    #      C1       C2 C3    C4      C5      C6
    # 0    a1      1.0  s   nan  1001.0  1234.0
    # 1    a2     12.0  s   nan  1001.0  4322.0
    # 2    a3      nan  s   nan  1001.0  4432.0
    # 3    a4    232.0  s   nan  1001.0  4432.0
    # 4    a5    343.0  s  99.0     nan     nan
    # 5    a6      3.0  s  99.0  1001.0  4432.0
    # 6    a7     34.0  s  99.0  1001.0  4432.0
    # 7    a8      5.0  s   nan  1001.0  4432.0
    # 8    a9      6.0  s  99.0  1001.0  4432.0
    # 9   a10    565.0  s  99.0     nan  4432.0
    # 10  a11   5543.0  s  99.0  1001.0  4432.0
    # 11  a12    112.0  s  99.0  1001.0     nan
    # 12  a13  34345.0  s  99.0  1001.0  4432.0
    # 13  a14      0.0  s  99.0     nan     nan
    # 14  a15    453.0  s  99.0  1001.0     nan
    

    2) Only cells with red text

    def only_cells_with_red_text(cell):
        return cell if cell.style.font_color in {utils.colors.red, 'FFFF0000'} else np.nan
          
    sf_2 = StyleFrame(sf.applymap(only_cells_with_red_text).dropna(axis=(0, 1), how='all'))
    # passing a tuple to pandas.dropna is deprecated since pandas 0.23.0, but this can be
    # avoided by simply calling dropna twice, once with axis=0 and once with axis=1
    
    print(sf_2)
    #         C2      C6
    # 7     nan   4432.0
    # 8     nan   4432.0
    # 9    565.0     nan
    # 10  5543.0     nan
    # 11   112.0     nan
    

    3) Only cells with red background

    def only_cells_with_red_background(cell):
        return cell if cell.style.bg_color in {utils.colors.red, 'FFFF0000'} else np.nan
    
    sf_3 = StyleFrame(sf.applymap(only_cells_with_red_background).dropna(axis=(0, 1), how='all'))
    # passing a tuple to pandas.dropna is deprecated since pandas 0.23.0, but this can be
    # avoided by simply calling dropna twice, once with axis=0 and once with axis=1
    
    print(sf_3)
    #        C4      C6
    # 0    99.0     nan
    # 1    99.0     nan
    # 2    99.0     nan
    # 3    99.0     nan
    # 13    nan  4432.0
    # 14    nan  4432.0