I need to get all descendants point of links represented with side_a - side_b (in one dataframe) until reach for each side_a their end_point (in other dataframe). So:
df1:
side_a side_b
a b
b c
c d
k l
l m
l n
p q
q r
r s
df2:
side_a end_point
a c
b c
c c
k m
k n
l m
l n
p s
q s
r s
The point is to get all points for each side_a value until reach end_point from df2 for that value. If it has two end_point values (like "k" does) that it should be two lists.
I have some code but it's not written with this approach, it drops all rows from df1 if df1['side_a'] == df2['end_points']
and that causes certain problems. But if someone wants me to post the code I will, of course.
The desired output would be something like this:
side_a end_point
a [b, c]
b [c]
c [c]
k [l, m]
k [l, n]
l [m]
l [n]
p [q, r, s]
q [r, s]
r [s]
And one more thing, if there is the same both side, that point doesn't need to be listed at all, I can append it later, whatever it's easier.
import pandas as pd
import numpy as np
import itertools
def get_child_list(df, parent_id):
list_of_children = []
list_of_children.append(df[df['side_a'] == parent_id]['side_b'].values)
for c_, r_ in df[df['side_a'] == parent_id].iterrows():
if r_['side_b'] != parent_id:
list_of_children.append(get_child_list(df, r_['side_b']))
# to flatten the list
list_of_children = [item for sublist in list_of_children for item in sublist]
return list_of_children
new_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
for index, row in df1.iterrows():
temp_df = pd.DataFrame(columns=['side_a', 'list_of_children'])
temp_df['list_of_children'] = pd.Series(get_child_list(df1, row['side_a']))
temp_df['side_a'] = row['side_a']
new_df = new_df.append(temp_df)
So, the problem with this code is that works if I drop rows where side_a is equal to end_point from df2. I don't know how to implement condition that if catch the df2 in side_b column, then stop, don't go further.
Any help or hint is welcomed here, truly. Thanks in advance.
You can use networkx library and graphs:
import networkx as nx
G = nx.from_pandas_edgelist(df, source='side_a',target='side_b')
df2.apply(lambda x: [nx.shortest_path(G, x.side_a,x.end_point)[0],
nx.shortest_path(G, x.side_a,x.end_point)[1:]], axis=1)
Output:
side_a end_point
0 a [b, c]
1 b [c]
2 c []
3 k [l, m]
4 k [l, n]
5 l [m]
6 l [n]
7 p [q, r, s]
8 q [r, s]
9 r [s]