I am trying to count each row in Pandas dataframe that contain two specified values, but the problem is that the values can be in any column. How can I loop the part with;
for col in con.columns:
counts += len(con[(con[col]==name1)])
...so that I have two conditions, that require name1 and name2 to exist on the same row? I would use the cross tabulation, but the dataset is way too large for that.
con = pd.read_csv("keyword connections.txt", sep="\t")
key = pd.read_csv("keyword short list.txt", sep="\t")
text_file = open("Output.txt", "w")
i = 0
j = 0
k = 0
for i in range(0,key.size):
name1 = key.iloc[i]["Keyword"]
for j in range(0,key.size):
name2 = key.iloc[j]["Keyword"]
counts = 0
for col in con.columns:
counts += len(con[(con[col]==name1)])
text_file.write(name1+"\t"+name2+"\t"+str(counts)+"\n")
j += 1
k += 1
i += 1
text_file.close()
print(k)
Assuming this example:
col1 col2 col3
0 name1 other name2
1 other name2 other
2 name2 name1 other
3 other other other
You could use several masks built with eq
and any
and combine them with &
to perform boolean indexing:
m1 = df.eq('name1').any(axis=1)
m2 = df.eq('name2').any(axis=1)
out = df[m1 & m2]
Or, from a list of values:
names = ['name1', 'name2']
out = df[np.logical_and.reduce([df.eq(n).any(axis=1)
for n in names])]
Alternatively, you could use a set
of the wanted values and agg
:
S = {'name1', 'name2'}
out = df[df.agg(S.issubset, axis=1)]
Output:
col1 col2 col3
0 name1 other name2
2 name2 name1 other
Intermediates:
col1 col2 col3 m1 m2 m1&m2 as_set S.issubset
0 name1 other name2 True True True {name1, other, name2} True
1 other name2 other False True False {other, name2} False
2 name2 name1 other True True True {name1, other, name2} True
3 other other other False False False {other} False