How to generate clusters based on multiples values of a data frame in R

I have a data obtained from a blastn analyses, to generate it, I used genes from 3 samples (using prokka) and then generate to generate the database, and then generate a for loop to generate the blastn analysis of each sample vs DB.fa (all vs all ) it generated S01, S02, S03, similar to:

S01[c("qseqid", "sseqid", "pident", "qcovs", "lrqs", "sample")] %>% .[1:20, ]
           qseqid         sseqid  pident qcovs      lrqs sample
1  LHCELNOF_00001 LHCELNOF_00001 100.000   100 1.0000000    S01
2  LHCELNOF_00001 GDFMIKHC_00335  98.580   100 0.9695652    S01
3  LHCELNOF_00001 ACBMEBGO_00001  96.487   100 1.0000000    S01
4  LHCELNOF_00002 LHCELNOF_00002 100.000   100 1.0000000    S01
5  LHCELNOF_00002 GDFMIKHC_00336  98.910   100 1.0000000    S01
6  LHCELNOF_00002 ACBMEBGO_00002  95.913   100 1.0000000    S01
7  LHCELNOF_00003 LHCELNOF_00003 100.000   100 1.0000000    S01
8  LHCELNOF_00003 GDFMIKHC_00337  99.259   100 1.0000000    S01
9  LHCELNOF_00003 ACBMEBGO_00003  98.519   100 1.0000000    S01
10 LHCELNOF_00004 LHCELNOF_00004 100.000   100 1.0000000    S01
11 LHCELNOF_00004 GDFMIKHC_00338  98.386    99 0.9995864    S01
12 LHCELNOF_00004 ACBMEBGO_00004  97.600    99 0.9995864    S01
13 LHCELNOF_00005 LHCELNOF_00005 100.000   100 1.0000000    S01
14 LHCELNOF_00005 ACBMEBGO_00005  97.691    99 0.9954023    S01
15 LHCELNOF_00005 GDFMIKHC_00339  97.011   100 1.0000000    S01
16 LHCELNOF_00006 LHCELNOF_00006 100.000   100 1.0000000    S01
17 LHCELNOF_00006 GDFMIKHC_00340  97.075   100 0.9672856    S01
18 LHCELNOF_00006 ACBMEBGO_00006  91.320   100 0.9754902    S01
19 LHCELNOF_00007 LHCELNOF_00007 100.000   100 1.0000000    S01
20 LHCELNOF_00007 GDFMIKHC_00341  97.932   100 1.0000000    S01

so I want to generate a cluster of ids based in 3 values, here I just show the columns that I´m using:

from the imported files in R

S01[c("qseqid", "sseqid", "pident", "qcovs", "lrqs", "sample")] %>% head()
          qseqid         sseqid  pident qcovs      lrqs sample
1 LHCELNOF_00001 LHCELNOF_00001 100.000   100 1.0000000    S01
2 LHCELNOF_00001 GDFMIKHC_00335  98.580   100 0.9695652    S01
3 LHCELNOF_00001 ACBMEBGO_00001  96.487   100 1.0000000    S01
4 LHCELNOF_00002 LHCELNOF_00002 100.000   100 1.0000000    S01
5 LHCELNOF_00002 GDFMIKHC_00336  98.910   100 1.0000000    S01
6 LHCELNOF_00002 ACBMEBGO_00002  95.913   100 1.0000000    S01

the rest of the samples are similar !!!

to generate a single file:

SAMPLES <- rbind(S01, S02, S03)

so, to generate a cluster based in similarity of 3 values:

pident >= 90, qcovs >= 90, lrqs >= 0.9

if some ids presented those values are considered a cluster: example

SAMPLES[c("qseqid", "sseqid", "pident", "qcovs", "lrqs", "sample")] %>% head()
          qseqid         sseqid  pident qcovs      lrqs sample
1 LHCELNOF_00001 LHCELNOF_00001 100.000   100 1.0000000    S01
2 LHCELNOF_00001 GDFMIKHC_00335  98.580   100 0.9695652    S01
3 LHCELNOF_00001 ACBMEBGO_00001  96.487   100 1.0000000    S01
4 LHCELNOF_00002 LHCELNOF_00002 100.000   100 1.0000000    S01
5 LHCELNOF_00002 GDFMIKHC_00336  98.910   100 1.0000000    S01
6 LHCELNOF_00002 ACBMEBGO_00002  95.913   100 1.0000000    S01

if LHCELNOF_00001 in qseqid presented pident, qcovs and lrqs 90 and 0.9 extract the ids in sseqid and those represent a cluster:

filter(SAMPLES, qseqid== "LHCELNOF_00001" & pident>=90 & qcovs >=90 & lrqs >=0.9 ) %>% .[c("qseqid", "sseqid", "pident", "qcovs", "lrqs", "sample")]
          qseqid         sseqid  pident qcovs      lrqs sample
1 LHCELNOF_00001 LHCELNOF_00001 100.000   100 1.0000000    S01
2 LHCELNOF_00001 GDFMIKHC_00335  98.580   100 0.9695652    S01
3 LHCELNOF_00001 ACBMEBGO_00001  96.487   100 1.0000000    S01

it means that LHCELNOF_00001 are equal to GDFMIKHC_00335 and ACBMEBGO_00001 (the 3 ids present in sseqid column), so those ids are in the same cluster, so I want to add an extra column based in those ids, similar to

in S01 will be like:

          qseqid         sseqid  pident qcovs      lrqs sample  cluster
1 LHCELNOF_00001 LHCELNOF_00001 100.000   100 1.0000000    S01  cluster_1
2 LHCELNOF_00001 GDFMIKHC_00335  98.580   100 0.9695652    S01  cluster_1
3 LHCELNOF_00001 ACBMEBGO_00001  96.487   100 1.0000000    S01  cluster_1

so I made a loop with SAMPLE to extract the clusters and then use join with each sample (S01, S02, S03) first tried:

U1 <- unique(SAMPLES$qseqid)
U2 <- unique(SAMPLES$sseqid)
U <- c(U1, U2)
U <- unique(U)

l<- c()
for(i in 1:length(U)){
    x <- filter(SAMPLES, qseqid==U[i] & pident>=90 & qcovs >=90 & lrqs >=0.9 ) %>% .["sseqid"]
    x["cluster"] <- paste("cluster", i, sep = "_")
    l[[i]] <- x
    names(l)[i] <- paste("cluster", i, sep = "_") 
}

ldf <- do.call(rbind.data.frame, l)
rownames(ldf) <- NULL

head(ldf)
          sseqid   cluster
1 LHCELNOF_00001 cluster_1
2 GDFMIKHC_00335 cluster_1
3 ACBMEBGO_00001 cluster_1
4 LHCELNOF_00002 cluster_2
5 GDFMIKHC_00336 cluster_2
6 ACBMEBGO_00002 cluster_2

but this method generate multiples clusters with the same ids due to all ids of the cluster 1
are in qseqid, if I have 3 samples, I will have 3 clusters with the same ids (LHCELNOF_00001, GDFMIKHC_00335 and ACBMEBGO_00001), just with different name of the cluster (cluster 1, cluster 4584, cluster 9485)

filter(ldf, cluster=="cluster_1")
          sseqid   cluster
1 LHCELNOF_00001 cluster_1
2 GDFMIKHC_00335 cluster_1
3 ACBMEBGO_00001 cluster_1

LHCELNOF_00001, GDFMIKHC_00335, ACBMEBGO_00001 are in 3 different cluster

here I just show 1 id

filter(ldf, sseqid=="LHCELNOF_00001")
          sseqid      cluster
1 LHCELNOF_00001    cluster_1
2 LHCELNOF_00001 cluster_4584
3 LHCELNOF_00001 cluster_9485

because the id LHCELNOF_00001 generate the cluster_1, then GDFMIKHC_00335 generate cluster 4584, and ACBMEBGO 00001 the cluster 9485

so I tried to make a filter to avoid redundancy:

  rm(U1, U2, U, l, i, ldf, x)

U1 <- unique(SAMPLES$qseqid)
U2 <- unique(SAMPLES$sseqid)
U <- c(U1, U2)
U <- unique(U)


l<- c()
ids <- matrix(nrow=length(U), ncol=1)

for(i in 1:length(U)){
    if(i==1){
        x <- filter(SAMPLES, qseqid == U[i]) %>% .["sseqid"]
        x["cluster"] <- paste("cluster", i, sep = "_")
        l[[i]] <- x
        names(l)[i] <- paste("cluster", i, sep = "_")
        ids[i] <- x$sseqid
    }
    else{
# filter the ids previously annexed in a cluster 1
        f <- filter(SAMPLES, !qseqid %in% ids[,1])
# then continue with the clustering !!!
        x <- filter(f, qseqid == U[i]) %>% .["sseqid"]
        x["cluster"] <- paste("cluster", i, sep = "_")
        l[[i]] <- x
        names(l)[i] <- paste("cluster", i, sep = "_")
        ids[i] <- x$sseqid
        
    }    
}

with this method generate some warnings and it take some minutes to complete when I made it with 20 or more samples !!!

Error in `[<-.data.frame`(`*tmp*`, "cluster", value = "cluster_1547") : 
  replacement has 1 row, data has 0
In addition: There were 50 or more warnings (use warnings() to see the first 50)

other problem, is that some ids don’t generate the clusters

ldf <- do.call(rbind.data.frame, l)
rownames(ldf) <- NULL

colnames(ldf)[1] <- "qseqid"

S01x <- left_join(S01, ldf, by ="qseqid")


S01x %>% .[c("qseqid", "sseqid", "pident", "qcovs", "lrqs", "sample", "cluster")] %>% .[1:10, ]
           qseqid         sseqid  pident qcovs      lrqs sample   cluster
1  LHCELNOF_00001 LHCELNOF_00001 100.000   100 1.0000000    S01 cluster_1
2  LHCELNOF_00001 GDFMIKHC_00335  98.580   100 0.9695652    S01 cluster_1
3  LHCELNOF_00001 ACBMEBGO_00001  96.487   100 1.0000000    S01 cluster_1
4  LHCELNOF_00002 LHCELNOF_00002 100.000   100 1.0000000    S01 cluster_2
5  LHCELNOF_00002 GDFMIKHC_00336  98.910   100 1.0000000    S01 cluster_2
6  LHCELNOF_00002 ACBMEBGO_00002  95.913   100 1.0000000    S01 cluster_2
7  LHCELNOF_00003 LHCELNOF_00003 100.000   100 1.0000000    S01 cluster_3
8  LHCELNOF_00003 GDFMIKHC_00337  99.259   100 1.0000000    S01 cluster_3
9  LHCELNOF_00003 ACBMEBGO_00003  98.519   100 1.0000000    S01 cluster_3
10 LHCELNOF_00004 LHCELNOF_00004 100.000   100 1.0000000    S01 cluster_4

Show all NA in cluster column, and also filter the ids that are equal in qseqid and sseqid (the same id in both columns, and those must not be assigned because are the same elements )

S01x %>% filter(is.na(cluster) & !qseqid == sseqid) %>% .[c("qseqid", "sseqid", "pident", "qcovs", "lrqs", "sample", "cluster")] %>% .[1:10, ]
           qseqid         sseqid pident qcovs lrqs sample cluster
1  LHCELNOF_01549 GDFMIKHC_01944 97.857   100    1    S01    <NA>
2  LHCELNOF_01549 GDFMIKHC_01901 97.857   100    1    S01    <NA>
3  LHCELNOF_01550 GDFMIKHC_01999 99.177   100    1    S01    <NA>
4  LHCELNOF_01551 GDFMIKHC_02000 99.333   100    1    S01    <NA>
5  LHCELNOF_01552 ACBMEBGO_01584 90.891   100    1    S01    <NA>
6  LHCELNOF_01554 ACBMEBGO_01602 96.835   100    1    S01    <NA>
7  LHCELNOF_01555 LHCELNOF_01602 99.174   100    1    S01    <NA>
8  LHCELNOF_01555 GDFMIKHC_02005 98.623   100    1    S01    <NA>
9  LHCELNOF_01558 LHCELNOF_01681 99.531   100    1    S01    <NA>
10 LHCELNOF_01558 ACBMEBGO_01569 95.305   100    1    S01    <NA>

if I review the SAMPLE or S01x with LHCELNOF_01549 to identified the values

filter(S01x, qseqid=="LHCELNOF_01549") %>% .[c("qseqid", "sseqid", "pident", "qcovs", "lrqs", "sample", "cluster")]
          qseqid         sseqid  pident qcovs lrqs sample cluster
1 LHCELNOF_01549 LHCELNOF_01549 100.000   100    1    S01    <NA>
2 LHCELNOF_01549 GDFMIKHC_01944  97.857   100    1    S01    <NA>
3 LHCELNOF_01549 GDFMIKHC_01901  97.857   100    1    S01    <NA>

it were assigned as NA in cluster column but LHCELNOF_01549, GDFMIKHC_01944 and GDFMIKHC_01901 presented the identity (numbers) to generate a cluster, those must be the same cluster

filter(SAMPLES, qseqid %in% c("LHCELNOF_01549", "GDFMIKHC_01944", "GDFMIKHC_01901") ) %>% .[c("qseqid", "sseqid", "pident", "qcovs", "lrqs", "sample")]
           qseqid         sseqid  pident qcovs      lrqs sample
1  LHCELNOF_01549 LHCELNOF_01549 100.000   100 1.0000000    S01
2  LHCELNOF_01549 GDFMIKHC_01944  97.857   100 1.0000000    S01
3  LHCELNOF_01549 GDFMIKHC_01901  97.857   100 1.0000000    S01
4  GDFMIKHC_01901 GDFMIKHC_01901 100.000   100 1.0000000    S03
5  GDFMIKHC_01901 LHCELNOF_01549  97.857   100 1.0000000    S03
6  GDFMIKHC_01901 GDFMIKHC_01944  96.667   100 1.0000000    S03
7  GDFMIKHC_01944 GDFMIKHC_01944 100.000   100 1.0000000    S03
8  GDFMIKHC_01944 LHCELNOF_01549  97.857   100 1.0000000    S03
9  GDFMIKHC_01944 GDFMIKHC_01901  96.667   100 1.0000000    S03
10 GDFMIKHC_01944 GDFMIKHC_01943 100.000    11 0.2678571    S03

Any idea to generate the clustering based on the similitude or help me to solve the errors ??

I don’t necessary need a loop, just to solve the problem !!!

Thanks so much !!!

Solution

One idea is to create a new cluster variable by sorting the sseqid's in each cluster and concatenating them together. Then by comparing this variable you can see if any cluster's contain identical genes and remove the duplicate ones. Here's a simple example, with only 3 genes and 1 measure of similarity.

library(tidyverse)
dat <- expand.grid(qid = c("A","B","C"), sid = c("A","B","C"))
dat$similarity <- c(1, 0.8, 0.1, 0.8, 1, 0.2, 0.1, 0.2, 1)
dat
#>   qid sid similarity
#> 1   A   A        1.0
#> 2   B   A        0.8
#> 3   C   A        0.1
#> 4   A   B        0.8
#> 5   B   B        1.0
#> 6   C   B        0.2
#> 7   A   C        0.1
#> 8   B   C        0.2
#> 9   C   C        1.0
U1 <- unique(dat$qid)
U2 <- unique(dat$sid)
U <- c(U1, U2)
U <- unique(U)
l<- c()
for(i in 1:length(U)){
  x <- filter(dat, (dat$qid == U[i]) & (dat$similarity > 0.7)) %>% .["sid"]
  x["cluster"] <- paste("cluster", i, sep = "_")
  l[[i]] <- x
  names(l)[i] <- paste("cluster", i, sep = "_") 
}
ldf <- do.call(rbind.data.frame, l)
rownames(ldf) <- NULL
ldf
#>   sid   cluster
#> 1   A cluster_1
#> 2   B cluster_1
#> 3   A cluster_2
#> 4   B cluster_2
#> 5   C cluster_3

# Create a variable displaying the elements in a consistent method
clust <- data.frame(cluster = unique(ldf$cluster))
clust$elements <- character(nrow(clust))
for(i in 1:nrow(clust)) {
  clust[i, "elements"] <- 
    paste(sort(ldf[ldf$cluster == clust$cluster[[i]],"sid"]), 
          collapse = "")
}
clust
#>     cluster elements
#> 1 cluster_1       AB
#> 2 cluster_2       AB
#> 3 cluster_3        C
# remove duplicates
clust_to_keep <- clust[!duplicated(clust$elements), "cluster"]
out <- ldf[ldf$cluster %in% clust_to_keep,]
out
#>   sid   cluster
#> 1   A cluster_1
#> 2   B cluster_1
#> 5   C cluster_3

^{Created on 2022-11-17 with reprex v2.0.2}