My data set a list of people either working together or alone.
I have have a row for each project and columns with names of all the people who worked on that project. If column 2 is the first empty column given a row it was a solo job, if column 4 is the first empty column given a row then there were 3 people working together.
My goal is to find which people have worked together, and how many times, so I want all pairs in the data set, treating A working with B the same as B working with A.
From this a square N x N would be created with every actor labeling the column and row and in cell (A,B) and (B,A) would have how many times that pair worked together, and this would be done for every pair.
I know of a 'pretty' quick way to do it in Excel but I want it automated, hopefully in Stata or Python, just in case projects are added or removed I can just 1-click the re-run and not have to re-do it every time.
An example of the data, in a comma delimited fashion:
A
A,B
B,C,E
B,F
D,F
A,B,C
D,B
E,C,B
X,D,A
Hope that helps!
Brice. F,D B F F,X,C C,F,D
Maybe something like this would get you started?
import csv
import collections
import itertools
grid = collections.Counter()
with open("connect.csv", "r", newline="") as fp:
reader = csv.reader(fp)
for line in reader:
# clean empty names
line = [name.strip() for name in line if name.strip()]
# count single works
if len(line) == 1:
grid[line[0], line[0]] += 1
# do pairwise counts
for pair in itertools.combinations(line, 2):
grid[pair] += 1
grid[pair[::-1]] += 1
actors = sorted(set(pair[0] for pair in grid))
with open("connection_grid.csv", "w", newline="") as fp:
writer = csv.writer(fp)
writer.writerow([''] + actors)
for actor in actors:
line = [actor,] + [grid[actor, other] for other in actors]
writer.writerow(line)
[edit: modified to work under Python 3.2]
The key modules are (1)csv
, which makes reading and writing csv files much simpler; (2) collections
, which provides an object called a Counter
-- like a defaultdict(int)
, which you could use if your Python doesn't have Counter
, it's a dictionary which automatically generates default values so you don't have to, and here the default count is 0; and (3) itertools
, which has a combinations
function to get all the pairs.
which produces
,A,B,C,D,E,F,X
A,1,2,1,1,0,0,1
B,2,1,3,1,2,1,0
C,1,3,0,1,2,2,1
D,1,1,1,0,0,3,1
E,0,2,2,0,0,0,0
F,0,1,2,3,0,1,1
X,1,0,1,1,0,1,0
You could use itertools.product
to make building the array a little more compact, but since it's only a line or two I figured it was as simple to do it manually.