pythonpandasdataframeinformation-retrieval

Extract the values of a dataframe that correspond to a single element of another df


I have 2 pandas dfs (df1 & df2) as seen here:

df1 col1 col2 col3 col4 col5
row1 Dog Cat Bird Tree Lion
row2 Cat Dragon Bird Dog Tree
row3 Cat Dog Bird Tree Hippo
row4 Cat Tree Bird Ant Fish
row5 Cat Tree Monkey Dragon Ant
df2 col1 col2 col3 col4 col5
row1 3.219843 1.3631996 1.0051135 0.89303696 0.4313375
row2 2.8661892 1.4396228 0.7863044 0.539315 0.48167187
row3 2.5679462 1.3657334 0.9470184 0.79186934 0.48637152
row4 3.631389 0.94815284 0.7561722 0.6743943 0.5441728
row5 2.4727197 1.5941181 1.4069512 1.064051 0.48297918

The string elements of the df1 correspond to the values of df2. For both dataframes the condition exists that an element (or a value) does not repeat on the same row. But can be repeated on different rows.

For example Dog of row1 = 3.219843, bird of row3 = 0.9470184, bird of row4 = 0.7561722 etc.

I would like to extract the values for all unique elements of the 1st df into different arrays. Like:

Dog = [3.219843, 0.539315, 1.3657334]

Cat = [1.3631996, 2.8661892, 2.5679462, 3.631389, 2.4727197]

etc...

Any ideas?

Many thanks!


Solution

  • Assuming that your first columns df1 and df2 are the index of their respective df, we can extract the values for each unique animal in df1 by using the first df as a mask to extract all wanted values from the second one (the result is a new df with NaN in irrelevant cells, which can be turned into a 1-dimensional array with .stack().values).

    Construct the dataframes

    First of, create some test data. Please provide it in a form like this in future posts. That's what @mozway was talking about in the comments. It is greatly appreciated.

    (It's not always the case that somebody is willing to do all the copy-and-pasting necessary to get dataframes up and running for testing.)

    import pandas as pd
    import numpy as np
    
    index = ['row1', 'row2', 'row3', 'row4', 'row5']
    
    data1 = {'col1': ['Dog', 'Cat', 'Cat', 'Cat', 'Cat'],
             'col2': ['Cat', 'Dragon', 'Dog', 'Tree', 'Tree'],
             'col3': ['Bird', 'Bird', 'Bird', 'Bird', 'Monkey'],
             'col4': ['Tree', 'Dog', 'Tree', 'Ant', 'Dragon'],
             'col5': ['Lion', 'Tree', 'Hippo', 'Fish', 'Ant']}
    
    data2 = {'col1': [3.219843, 2.8661892, 2.5679462, 3.631389, 2.4727197],
             'col2': [1.3631996, 1.4396228, 1.3657334, 0.94815284, 1.5941181],
             'col3': [1.0051135, 0.7863044, 0.9470184, 0.7561722, 1.4069512],
             'col4': [0.89303696, 0.539315, 0.79186934, 0.6743943, 1.064051],
             'col5': [0.4313375, 0.48167187, 0.48637152, 0.5441728, 0.48297918]}
    
    df1 = pd.DataFrame(data1, index=index)
    df2 = pd.DataFrame(data2, index=index)
    

    Extract the data

    Since you didn't specify what data structure you need, this is the strategy outlined above in a dict comprehension:

    {animal: df2[df1.eq(animal)].stack().values for animal in np.unique(df1)}
    

    The result looks like this:

    {'Ant': array([0.6743943 , 0.48297918]),
     'Bird': array([1.0051135, 0.7863044, 0.9470184, 0.7561722]),
     'Cat': array([1.3631996, 2.8661892, 2.5679462, 3.631389 , 2.4727197]),
     'Dog': array([3.219843 , 0.539315 , 1.3657334]),
     'Dragon': array([1.4396228, 1.064051 ]),
     'Fish': array([0.5441728]),
     'Hippo': array([0.48637152]),
     'Lion': array([0.4313375]),
     'Monkey': array([1.4069512]),
     'Tree': array([0.89303696, 0.48167187, 0.79186934, 0.94815284, 1.5941181 ])}