I'm trying to construct a simple edge list for an undirected graph containing all possible edges. I used to do it using the cartesian product of the node list by itself & then filtering out duplicated & self edges. This time the size of the input is too large to afford to store unnecessary edges momentarily. Thus, I'm trying to use nested loops to get the needed edges directly from the first time.
Here is the code I wrote:
node_list = ['A', 'B', 'C', 'D']
for i in node_list:
for j in node_list:
if i < j:
source.append(i)
target.append(j)
loop_data = pd.DataFrame({'source': source, 'target':target})
print(loop_data)
The output I get is pretty unexpected. Instead of saving the source & target nodes in their respective lists, the program is keeping both the source & the target nodes in both source & target columns. Here is the current state of the output.
source target
0 A A
1 B B
2 A A
3 C C
4 A A
5 D D
6 B B
7 C C
8 B B
9 D D
10 C C
11 D D
This is the expected form of output (ignore row indexing) :
source target
1 A B
2 A C
3 A D
6 B C
7 B D
11 C D
I can't find where the problem exists. The issue seems to be with appending to both of the source & target lists.
If you initialise each variable as the empty list ([]
) the code works as expected. So I suggest you review how you are initialising the source
and target
variables:
source = [] # empty list
target = [] # different empty list
node_list = ['A', 'B', 'C', 'D']
for i in node_list:
for j in node_list:
if i < j:
source.append(i)
target.append(j)
loop_data = pd.DataFrame({'source': source, 'target': target})
print(loop_data)
Output
source target
0 A B
1 A C
2 A D
3 B C
4 B D
5 C D
Since you want to construct an undirected graph, I suggest you use itertools.combinations
to generate the edges:
import pandas as pd
from itertools import combinations
node_list = ['A', 'B', 'C', 'D']
res = pd.DataFrame(data=combinations(node_list, r=2), columns=["source", "target"])
print(res)
Output
source target
0 A B
1 A C
2 A D
3 B C
4 B D
5 C D
The reason for using combinations
is that it avoids the Python level for
-loop therefore for larger lists it should be faster. It is also in my opinion more pythonic.
Timings
For a node_list
of 676 elements, I get the following timings:
%timeit c = combinations_approach(node_list)
19.7 ms ± 212 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit o = original_approach(node_list)
41.2 ms ± 64.1 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
where the functions combinations_approach
and original_approach
are:
def original_approach(node_list):
source = []
target = []
for i in node_list:
for j in node_list:
if i < j:
source.append(i)
target.append(j)
return pd.DataFrame({'source': source, 'target': target})
def combinations_approach(node_list):
return pd.DataFrame(combinations(sorted(node_list), r=2), columns=["source", "target"])