I have a dataframe
sample1 0 0 0 0 0 1 1 1 1 1 1 1 1 L1
sample2 0 0 0 0 0 1 1 1 1 1 0 0 0 L1-1
sample3 0 0 0 0 0 1 1 0 0 0 0 0 0 L1-1-1
sample4 0 0 0 0 0 1 0 0 0 0 0 0 0 L1-1-1-1
sample5 0 0 0 0 0 0 0 1 1 0 0 0 0 L1-1-2
sample6 0 0 0 0 0 0 0 1 0 0 0 0 0 L1-1-2-1
sample7 0 0 0 0 0 0 0 0 0 1 0 0 0 L1-1-3
sample8 0 0 0 0 0 0 0 0 0 0 1 1 1 L1-2
sample9 0 0 0 0 0 0 0 0 0 0 1 1 0 L1-2-1
sample10 0 0 0 0 0 0 0 0 0 0 0 0 1 L1-2-2
sample11 1 1 1 1 1 0 0 0 0 0 0 0 0 L2
sample12 1 1 1 0 0 0 0 0 0 0 0 0 0 L2-1
sample13 1 1 0 0 0 0 0 0 0 0 0 0 0 L2-1-1
sample14 1 0 0 0 0 0 0 0 0 0 0 0 0 L2-1-1-1
sample15 0 0 0 1 0 0 0 0 0 0 0 0 0 L2-2
sample16 0 0 0 0 1 0 0 0 0 0 0 0 0 L2-3
As you can see, each row is clustered.
I want to name "lineage-based" labeling to each sample.
For example, sample1 will be lin1 because it is first to appear, sample2 will be lin1-1.
Sample3 will be lin1-1-1, sample4 will be lin1-1-1-1.
Next, sample5 will be lin1-2, sample6 will be lin1-2-1...
Sample11 will be a new start for the lineage, lin2.
My original idea for the naming was.
"sample1 is lin1, if next sample is included in the previous sample, lin1 + "-1" if not, lin(1+1)"
sample1 -> lin1
sample2 -> lin1-1 (sample2 is included in sample1)
sample3 -> lin1-1-1 (sample3 is included in sample2)
sample4 -> lin1-1-1-1 (sample4 is included in sample3)
sample5 -> lin1-1-2 (sample5 is not included in sample4) .... logic like this.
I couldn't make this logic into a python script.
This can be done in several steps.
Step 1. Data preprocessing
Sort the data in descending order and remove duplicate, otherwise it may not work. Assume done.
import numpy as np
data = '''sample1 0 0 0 0 0 1 1 1 1 1 1 1 1
sample2 0 0 0 0 0 1 1 1 1 1 0 0 0
sample3 0 0 0 0 0 1 1 0 0 0 0 0 0
sample4 0 0 0 0 0 1 0 0 0 0 0 0 0
sample5 0 0 0 0 0 0 0 1 1 0 0 0 0
sample6 0 0 0 0 0 0 0 1 0 0 0 0 0
sample7 0 0 0 0 0 0 0 0 0 1 0 0 0
sample8 0 0 0 0 0 0 0 0 0 0 1 1 1
sample9 0 0 0 0 0 0 0 0 0 0 1 1 0
sample10 0 0 0 0 0 0 0 0 0 0 0 0 1
sample11 1 1 1 1 1 0 0 0 0 0 0 0 0
sample12 1 1 1 0 0 0 0 0 0 0 0 0 0
sample13 1 1 0 0 0 0 0 0 0 0 0 0 0
sample14 1 0 0 0 0 0 0 0 0 0 0 0 0
sample15 0 0 0 1 0 0 0 0 0 0 0 0 0
sample16 0 0 0 0 1 0 0 0 0 0 0 0 0'''
data = [x.split() for x in data.split('\n')]
data = [x[1:] for x in data]
data = np.array(data, dtype=int)
data
array([[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1],
[0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]])
Step 2. Encode the sample to position. Each element is a frozenset.
nrow, ncol = data.shape
def to_position(sample):
ncol = len(sample)
return frozenset(i for i in range(ncol) if sample[i] == 1)
position = [to_position(data[i]) for i in range(nrow)]
# print(position)
Step 3. Assign each sample position to a cluster, where the cluster is represented as a tuple for now.
def assign_cluster(sample, clusters, parent):
if parent not in clusters:
clusters[parent] = sample
elif sample < clusters[parent]:
# Find child
parent = parent + (0,)
assign_cluster(sample, clusters, parent)
else:
# Find siblings
parent = parent[:-1] + (parent[-1] + 1, )
assign_cluster(sample, clusters, parent)
clusters = {}
root = (0,)
clusters[root] = position[0]
for i in range(1, nrow):
sample = position[i]
assign_cluster(sample, clusters, parent=root)
# print(clusters)
Step 4. Convert cluster to string and show result.
def cluster_to_string(c):
c = [str(_ + 1) for _ in c]
return 'L' + '-'.join(c)
position_dict = {v: k for k, v in clusters.items()}
for sample in data:
sample = to_position(sample)
c = position_dict[sample]
print(cluster_to_string(c))
L1
L1-1
L1-1-1
L1-1-1-1
L1-1-2
L1-1-2-1
L1-1-3
L1-2
L1-2-1
L1-2-2
L2
L2-1
L2-1-1
L2-1-1-1
L2-2
L2-3