i have a dataframe with 5 columns - the first 4 are coordinates (x1, x2, y2, y1) and last column is text. these data came from a PDF text extraction module called pdfminer.
data_rows = [ [1, 2, 9, 4, 'A'],
[4, 6, 13, 8, 'B'],
[9, 10, 17, 12, 'C'] ]
df = pd.DataFrame(data_rows)
df
0 1 2 3 4
0 1 2 9 4 A
1 4 6 13 8 B
2 9 10 17 12 C
page = df.pivot(index=[0],columns=[3],values=[4])
After removing multi-index from columns:
page
3 4 8 12
0
1 A NaN NaN
4 NaN B NaN
9 NaN NaN C
The part I am encountering challenges is when I need to merge some rows/columns. For example if I have the following structure:
3 4 8 9 12
0
1 A NaN D NaN
4 NaN B NaN NaN
7 E NaN NaN NaN
9 NaN NaN F C
And I am attempting to get to the following - combine columns with a tolerance of 1 (cols 8 and 9 combined) and combine rows with a tolerance of 2 (rows 7 and 9 combined)
3 4 8 12
0
1 A D NaN
4 NaN B NaN
7 E F C
is there an efficient way to do such combinations? combine_first doesn't have a tolerance parameter. merge_asof has a tolerance parameter but seems like only work on rows. thanks in advance!
edit: long story short - how to transform 2nd last table to last table with a tolerance parameter to control which columns/rows to combine
Use customer groupers and groupby.first
:
# group columns by diff of 1
cols = df.columns.to_series()
cols = cols.mask(cols.astype(int).diff().le(1)).ffill()
# group indices by diff of 2
idx = df.index.to_series()
idx = idx.mask(idx.astype(int).diff().le(2)).ffill()
(df.groupby(cols, axis=1, sort=False).first()
.groupby(idx, sort=False).first()
)
NB. index and columns must be sorted by increasing values.
Output:
4 8 12
1 A D None
4 None B None
7 E F C