Subsetting a dataframe based on cell color and text color in excel sheet

I have an excel sheet with more than 1000 columns and 300 rows. Some these cells have normal data, while some cells have a background color of red and some cells have normal white background but the text is red in color. For example, my excel sheet looks like below:

I am reading this excel sheet into Python (pandas) to use it as a dataframe and perform further actions on it. However, the red text and red cells need to be treated differently than the normal cells.

Therefore, I would like to split the above table into 3 tables, such that: Table one has all the cells but the red background cells are empty. Table 2 has only those rows and columns where the text is red. Table 3 has only those rows and columns where the background is red.

I guess it cannot be done in Pandas. I tried using StyleFrame but failed.

Can anyone help in this regard? Is there any python package that is helpful in this case?

Solution

This is pretty much the way to achieve this. It is not pretty as StyleFrame wasn't really designed to be used this way.

Reading the source Excel file

import numpy as np
from styleframe import StyleFrame, utils

sf = StyleFrame.read_excel('test.xlsx', read_style=True, use_openpyxl_styles=False)

1) All cells but cells with red background are empty

def empty_red_background_cells(cell):
    if cell.style.bg_color in {utils.colors.red, 'FFFF0000'}:
        cell.value = np.nan
    return cell

sf_1 = StyleFrame(sf.applymap(empty_red_background_cells))    
print(sf_1)
#      C1       C2 C3    C4      C5      C6
# 0    a1      1.0  s   nan  1001.0  1234.0
# 1    a2     12.0  s   nan  1001.0  4322.0
# 2    a3      nan  s   nan  1001.0  4432.0
# 3    a4    232.0  s   nan  1001.0  4432.0
# 4    a5    343.0  s  99.0     nan     nan
# 5    a6      3.0  s  99.0  1001.0  4432.0
# 6    a7     34.0  s  99.0  1001.0  4432.0
# 7    a8      5.0  s   nan  1001.0  4432.0
# 8    a9      6.0  s  99.0  1001.0  4432.0
# 9   a10    565.0  s  99.0     nan  4432.0
# 10  a11   5543.0  s  99.0  1001.0  4432.0
# 11  a12    112.0  s  99.0  1001.0     nan
# 12  a13  34345.0  s  99.0  1001.0  4432.0
# 13  a14      0.0  s  99.0     nan     nan
# 14  a15    453.0  s  99.0  1001.0     nan

2) Only cells with red text

def only_cells_with_red_text(cell):
    return cell if cell.style.font_color in {utils.colors.red, 'FFFF0000'} else np.nan
      
sf_2 = StyleFrame(sf.applymap(only_cells_with_red_text).dropna(axis=(0, 1), how='all'))
# passing a tuple to pandas.dropna is deprecated since pandas 0.23.0, but this can be
# avoided by simply calling dropna twice, once with axis=0 and once with axis=1

print(sf_2)
#         C2      C6
# 7     nan   4432.0
# 8     nan   4432.0
# 9    565.0     nan
# 10  5543.0     nan
# 11   112.0     nan

3) Only cells with red background

def only_cells_with_red_background(cell):
    return cell if cell.style.bg_color in {utils.colors.red, 'FFFF0000'} else np.nan

sf_3 = StyleFrame(sf.applymap(only_cells_with_red_background).dropna(axis=(0, 1), how='all'))
# passing a tuple to pandas.dropna is deprecated since pandas 0.23.0, but this can be
# avoided by simply calling dropna twice, once with axis=0 and once with axis=1

print(sf_3)
#        C4      C6
# 0    99.0     nan
# 1    99.0     nan
# 2    99.0     nan
# 3    99.0     nan
# 13    nan  4432.0
# 14    nan  4432.0