I have longitudinal data as follows:
import pandas as pd
# Define the updated data with samples only in 'sample_A' or 'sample_B'
data = {
'gene_id': ['gene_1', 'gene_1', 'gene_1', 'gene_1', 'gene_1',
'gene_1', 'gene_1', 'gene_1', 'gene_1', 'gene_1',
'gene_2', 'gene_2', 'gene_2', 'gene_2', 'gene_2',
'gene_2', 'gene_2', 'gene_2', 'gene_2', 'gene_2',
'gene_3', 'gene_3', 'gene_3', 'gene_3', 'gene_3',
'gene_3', 'gene_3', 'gene_3', 'gene_3', 'gene_3'],
'position': [1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5],
'value': [5.1, 5.5, 5.7, 6.0, 6.3,
6.3, 6.5, 6.7, 6.8, 5.1,
2.3, 2.5, 2.7, 3.0, 3.1,
3.1, 3.2, 3.3, 3.4, 2.3,
3.7, 3.8, 3.9, 4.0, 4.0,
4.0, 4.1, 4.2, 4.3, 3.7],
'sample': ['sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A',
'sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A',
'sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A']
}
# Create the DataFrame
df = pd.DataFrame(data)
My goal is to cluster gene value profiles then see how those clusters correspond to samples. So for example here, a profile is defined as follows: take a sample, take a gene_id, now take all (position, value) tuples within the resulting subset.
By clustering here, I am interested in understanding how the shape and amplitudes of the curves plotted by profiles cluster. As a start, a simple KMeans would be fine with me.
After clustering the idea would be to restore to each profile the sample it came from, and then plot the cluster space and see how samples gets distributed.
I've seen solutions in R for this, but haven't seen any solutions in python. Any help is appreciated.
Don't pivot the dataframe. This is possible with a call to kmeans2
. How many clusters you want is up to you.
There's an infinite number of ways to visualise this, so let's randomly pick one: plot original points by all four variables, with position and value spatial; plot the cluster centroids as crosses; and then circle all points in a colour corresponding to their cluster:
import numpy as np
import scipy
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
df = pd.DataFrame({
'gene_id': ('gene_1', 'gene_1', 'gene_1', 'gene_1', 'gene_1',
'gene_1', 'gene_1', 'gene_1', 'gene_1', 'gene_1',
'gene_2', 'gene_2', 'gene_2', 'gene_2', 'gene_2',
'gene_2', 'gene_2', 'gene_2', 'gene_2', 'gene_2',
'gene_3', 'gene_3', 'gene_3', 'gene_3', 'gene_3',
'gene_3', 'gene_3', 'gene_3', 'gene_3', 'gene_3'),
'position': (1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5,
1, 2, 3, 4, 5),
'value': (5.1, 5.5, 5.7, 6.0, 6.3,
6.3, 6.5, 6.7, 6.8, 5.1,
2.3, 2.5, 2.7, 3.0, 3.1,
3.1, 3.2, 3.3, 3.4, 2.3,
3.7, 3.8, 3.9, 4.0, 4.0,
4.0, 4.1, 4.2, 4.3, 3.7),
'sample': ('sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A',
'sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A',
'sample_A', 'sample_A', 'sample_A', 'sample_A', 'sample_B',
'sample_B', 'sample_B', 'sample_B', 'sample_B', 'sample_A')
})
centroid_data, df['cluster_label'] = scipy.cluster.vq.kmeans2(
data=df[['position', 'value']], k=4, seed=0,
)
centroids = pd.DataFrame(
index=pd.RangeIndex(name='cluster_label', stop=len(centroid_data)),
columns=('position', 'value'),
data=centroid_data,
)
print(df)
print(centroids)
fig, ax = plt.subplots()
sns.scatterplot(ax=ax, data=df, x='position', y='value', hue='sample', style='gene_id')
cmap = plt.cm.rainbow(np.linspace(0, 1, len(centroids)))
for (label, cluster), color in zip(df.groupby('cluster_label'), cmap):
ax.scatter(
[centroids.loc[label, 'position']],
[centroids.loc[label, 'value']], s=60, color=color, marker='+',
)
ax.scatter(
cluster['position'], cluster['value'], s=120, color=color, marker='o', facecolors='none',
)
plt.show()
gene_id position value sample cluster_label
0 gene_1 1 5.1 sample_A 1
1 gene_1 2 5.5 sample_A 1
2 gene_1 3 5.7 sample_A 1
3 gene_1 4 6.0 sample_A 0
4 gene_1 5 6.3 sample_B 0
5 gene_1 1 6.3 sample_B 1
6 gene_1 2 6.5 sample_B 1
7 gene_1 3 6.7 sample_B 1
8 gene_1 4 6.8 sample_B 0
9 gene_1 5 5.1 sample_A 0
10 gene_2 1 2.3 sample_A 3
11 gene_2 2 2.5 sample_A 3
12 gene_2 3 2.7 sample_A 3
13 gene_2 4 3.0 sample_A 2
14 gene_2 5 3.1 sample_B 2
15 gene_2 1 3.1 sample_B 3
16 gene_2 2 3.2 sample_B 3
17 gene_2 3 3.3 sample_B 3
18 gene_2 4 3.4 sample_B 2
19 gene_2 5 2.3 sample_A 2
20 gene_3 1 3.7 sample_A 3
21 gene_3 2 3.8 sample_A 3
22 gene_3 3 3.9 sample_A 3
23 gene_3 4 4.0 sample_A 2
24 gene_3 5 4.0 sample_B 2
25 gene_3 1 4.0 sample_B 3
26 gene_3 2 4.1 sample_B 3
27 gene_3 3 4.2 sample_B 3
28 gene_3 4 4.3 sample_B 2
29 gene_3 5 3.7 sample_A 2
position value
cluster_label
0 4.5 6.050000
1 2.0 5.966667
2 4.5 3.475000
3 2.0 3.400000