How is it possible to find out which "represented sequence(s)" are represented by which “representative sequence(s)”?
For example, in the following example, is there a way to find the original 627 sequences represented by “r1”?
data(biofam)
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab)
## Computing the distance matrix
costs <- seqsubm(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", sm=costs)
## Representative set using the neighborhood density criterion
biofam.rep <- seqrep(biofam.seq, diss=biofam.om, criterion="density")
biofam.rep
summary(biofam.rep)
[>] criterion: density
[>] 2000 sequence(s) in the original data set
[>] 4 representative sequences
[>] overall quality: 0.08113734
[>] statistics for the representative set:
na na(%) nb nb(%) SD MD DC V Q
r1 627 31.4 225 11.25 4566 7.28 4856 4.73 5.97
r2 577 28.8 123 6.15 4305 7.46 5175 5.05 16.81
r3 411 20.5 115 5.75 2658 6.47 2394 4.34 -11.04
r4 385 19.2 93 4.65 3006 7.81 3393 5.57 11.42
Total 2000 100.0 556 27.80 14535 7.27 15818 7.91 8.11
na: number of assigned objects
nb: number of objects in the neighborhood
SD: sum of the na distances to the representative
MD: mean of the na distances to the representative
DC: sum of the na distances to the center of the complete set
V: discrepancy of the subset
Q: quality of the representative
A complementary question. It would be great if there would be more explanation/clarification on how "na" and "nb" should be read and interpreted. For example, are the 4 representative sequences (r1, r2, r3, r4) representing the 2000 sequences or just the 556 sequences?
I tried to find answers to my questions.
The sequences assigned to each representative can be retrieved from the "Distances
attribute of the object returned by seqdef
. I illustrate following up your example:
## "Distances" attribute of object returned by seqrep
rep.dist <- attr(biofam.rep,"Distances")
rep.dist[1:9,] # first 9 rows to show how it looks out
# 1692 221 1167 1245
# 1167 NA NA 0.000000 NA
# 514 NA 10.000000 NA NA
# 1013 NA 9.794079 NA NA
# 275 NA 1.945416 NA NA
# 2580 NA 5.954724 NA NA
# 773 1.96818 NA NA NA
# 1187 13.89761 NA NA NA
# 47 NA 9.704456 NA NA
# 2091 NA NA 3.957049 NA
## retrieving assigned representative
rep.grp <- apply(rep.dist, 1, which.min)
seqdplot(biofam.seq, group=rep.grp, border=NA)
## sequences assigned to 1st representative
seq.rep1 <- biofam.seq[rep.grp==1,]
nrow(seq.rep1)
# 627
Regarding your complementary question:
Each sequence is assigned to the closer representative sequence and na[i]
is the total number of sequences assigned to ri
.
Now, the neighborhood of each representative is defined by the pradius
argument (by default, 10% of the maximum distance). nb[i]
is the number out of the na[i]
sequences that are in the neighborhood of ri
.
A sequence can be assigned to a representative without being in its neighborhood. It can also be in the neighborhood of a representative but be assigned (i.e., closer) to another representative.
For the example, the sum of the nb
's tells us that 556 sequences are covered, i.e., in the neighborhood of at least one of the representatives. The sum of the na
's is always the total number of sequences.