pythonpandaspdfpdfminer

Pandas merge_asof / combine on both index and columns


i have a dataframe with 5 columns - the first 4 are coordinates (x1, x2, y2, y1) and last column is text. these data came from a PDF text extraction module called pdfminer.

data_rows = [ [1, 2, 9, 4, 'A'],
              [4, 6, 13, 8, 'B'],
              [9, 10, 17, 12, 'C'] ]

df = pd.DataFrame(data_rows)
df

   0   1   2   3  4
0  1   2   9   4  A
1  4   6  13   8  B
2  9  10  17  12  C

page = df.pivot(index=[0],columns=[3],values=[4])

After removing multi-index from columns:

page 
 
3   4    8    12
0               
1    A  NaN  NaN
4  NaN    B  NaN
9  NaN  NaN    C

The part I am encountering challenges is when I need to merge some rows/columns. For example if I have the following structure:

3   4    8    9    12
0               
1    A  NaN    D  NaN
4  NaN    B  NaN  NaN
7    E  NaN  NaN  NaN  
9  NaN  NaN    F    C

And I am attempting to get to the following - combine columns with a tolerance of 1 (cols 8 and 9 combined) and combine rows with a tolerance of 2 (rows 7 and 9 combined)

3   4    8    12
0               
1    A    D  NaN
4  NaN    B  NaN  
7    E    F    C  

is there an efficient way to do such combinations? combine_first doesn't have a tolerance parameter. merge_asof has a tolerance parameter but seems like only work on rows. thanks in advance!

edit: long story short - how to transform 2nd last table to last table with a tolerance parameter to control which columns/rows to combine


Solution

  • Use customer groupers and groupby.first:

    # group columns by diff of 1
    cols = df.columns.to_series()
    cols = cols.mask(cols.astype(int).diff().le(1)).ffill()
    
    # group indices by diff of 2
    idx = df.index.to_series()
    idx = idx.mask(idx.astype(int).diff().le(2)).ffill()
    
    (df.groupby(cols, axis=1, sort=False).first()
       .groupby(idx, sort=False).first()
     )
    

    NB. index and columns must be sorted by increasing values.

    Output:

          4  8    12
    1     A  D  None
    4  None  B  None
    7     E  F     C