I have a dataset with articles from pubmed
. DataFrame looks like this:
df = pd.DataFrame({"section_names":[["introduction","methods","section1","another section","discussion"],
["introduction","methods","discussion","other section","one more section","conclusion"]],
"sections":[[["intro text","another sentence"],["some text","some text", "more text"],["some text","some text"],["some text","some text"],["some text","some text"]],
[["intro text","another sentence"],["some text","some text"],["some text","more text","some text","more text"],["some text","some text"],["some text","some text"],["some text","some text"]]]})
So basically, the column section_names
has has name of all the sections in an article. In column "sections", there is actual text in a list for each section names in section_names
. As a first step I wanted to have each section in a column. So, I did this:
df_col = pd.DataFrame([dict(zip(*pair)) for pair in zip(df['section_names'], df['sections'])]):
The Value NaN
makes sense because those section are not available for the particular column, for each column there will be at least one non NaN value. For a lot of articles with different section names, the number of columns will increase drastically. In the original dataset, I actually have around 10,000 columns.
What I now want is to merge the columns and have max 4 columns (Introduction, methods, discussion, conclusion). I want to say something like:
After a section name
methods
, merge all other sections untildiscussion
withmethods
and aftermethods
merge all untilconclusion
withdiscussion
With this rule in our df
, for first article, section1
and another section
will be merged with methods
. For second article, other section
and one more section
should be merged with discussion
.
How do I do this?
One option is to create a column index based on where the desired columns are, then aggregate the rows of each group into lists:
desired_columns = ['introduction', 'methods', 'discussion', 'conclusion']
new_df = df.groupby(df.columns.isin(desired_columns).cumsum(), axis=1).agg(
lambda x: x.agg(
lambda r: list(itertools.chain.from_iterable(r.dropna()))
or np.nan,
axis=1)
)
new_df.columns = desired_columns
new_df
:
introduction methods discussion conclusion
0 [intro text, another sentence] [some text, some text, more text, some text, some text, some text, some text] [some text, some text] NaN
1 [intro text, another sentence] [some text, some text] [some text, more text, some text, more text, some text, some text, some text, some text] [some text, some text]
The column index is created using:
df.columns.isin(desired_columns).cumsum()
Which produces groups like:
[1 2 2 2 3 3 3 4]
Complete Working Example:
import itertools
import numpy as np
import pandas as pd
df = pd.DataFrame({
"section_names": [
["introduction", "methods", "section1", "anothersection", "discussion"],
["introduction", "methods", "discussion", "othersection",
"onemoresection", "conclusion"]], "sections": [
[["introtext", "anothersentence"], ["sometext", "sometext", "moretext"],
["sometext", "sometext"], ["sometext", "sometext"],
["sometext", "sometext"]],
[["introtext", "anothersentence"], ["sometext", "sometext"],
["sometext", "moretext", "sometext", "moretext"],
["sometext", "sometext"], ["sometext", "sometext"],
["sometext", "sometext"]]]
})
df = pd.DataFrame(
[dict(zip(*pair)) for pair in zip(df['section_names'], df['sections'])])
desired_columns = ['introduction', 'methods', 'discussion', 'conclusion']
new_df = df.groupby(df.columns.isin(desired_columns).cumsum(), axis=1).agg(
lambda x: x.agg(
lambda r: list(itertools.chain.from_iterable(r.dropna()))
or np.nan,
axis=1)
)
new_df.columns = desired_columns
print(new_df.to_string())