pythonpandaspubmed

Pandas rule based column merging


I have a dataset with articles from pubmed. DataFrame looks like this:

df = pd.DataFrame({"section_names":[["introduction","methods","section1","another section","discussion"],
                                ["introduction","methods","discussion","other section","one  more section","conclusion"]],
               "sections":[[["intro text","another sentence"],["some text","some text", "more text"],["some text","some text"],["some text","some text"],["some text","some text"]],
                          [["intro text","another sentence"],["some text","some text"],["some text","more text","some text","more text"],["some text","some text"],["some text","some text"],["some text","some text"]]]})

So basically, the column section_names has has name of all the sections in an article. In column "sections", there is actual text in a list for each section names in section_names. As a first step I wanted to have each section in a column. So, I did this:

df_col = pd.DataFrame([dict(zip(*pair)) for pair in zip(df['section_names'], df['sections'])]):

resulting dataframe

The Value NaN makes sense because those section are not available for the particular column, for each column there will be at least one non NaN value. For a lot of articles with different section names, the number of columns will increase drastically. In the original dataset, I actually have around 10,000 columns.

What I now want is to merge the columns and have max 4 columns (Introduction, methods, discussion, conclusion). I want to say something like:

After a section name methods, merge all other sections until discussion with methods and after methods merge all until conclusion with discussion

With this rule in our df, for first article, section1 and another section will be merged with methods. For second article, other section and one more section should be merged with discussion.

How do I do this?


Solution

  • One option is to create a column index based on where the desired columns are, then aggregate the rows of each group into lists:

    desired_columns = ['introduction', 'methods', 'discussion', 'conclusion']
    new_df = df.groupby(df.columns.isin(desired_columns).cumsum(), axis=1).agg(
        lambda x: x.agg(
            lambda r: list(itertools.chain.from_iterable(r.dropna()))
                      or np.nan,
            axis=1)
    )
    new_df.columns = desired_columns
    

    new_df:

                         introduction                                                                        methods                                                                                discussion              conclusion
    0  [intro text, another sentence]  [some text, some text, more text, some text, some text, some text, some text]                                                                    [some text, some text]                     NaN
    1  [intro text, another sentence]                                                         [some text, some text]  [some text, more text, some text, more text, some text, some text, some text, some text]  [some text, some text]
    

    The column index is created using:

    df.columns.isin(desired_columns).cumsum()
    

    Which produces groups like:

    [1 2 2 2 3 3 3 4]
    

    Complete Working Example:

    import itertools
    
    import numpy as np
    import pandas as pd
    
    df = pd.DataFrame({
        "section_names": [
            ["introduction", "methods", "section1", "anothersection", "discussion"],
            ["introduction", "methods", "discussion", "othersection",
             "onemoresection", "conclusion"]], "sections": [
            [["introtext", "anothersentence"], ["sometext", "sometext", "moretext"],
             ["sometext", "sometext"], ["sometext", "sometext"],
             ["sometext", "sometext"]],
            [["introtext", "anothersentence"], ["sometext", "sometext"],
             ["sometext", "moretext", "sometext", "moretext"],
             ["sometext", "sometext"], ["sometext", "sometext"],
             ["sometext", "sometext"]]]
    })
    
    df = pd.DataFrame(
        [dict(zip(*pair)) for pair in zip(df['section_names'], df['sections'])])
    
    desired_columns = ['introduction', 'methods', 'discussion', 'conclusion']
    new_df = df.groupby(df.columns.isin(desired_columns).cumsum(), axis=1).agg(
        lambda x: x.agg(
            lambda r: list(itertools.chain.from_iterable(r.dropna()))
                      or np.nan,
            axis=1)
    )
    new_df.columns = desired_columns
    print(new_df.to_string())