traminer

Mapping represented sequences with representative sequences


How is it possible to find out which "represented sequence(s)" are represented by which “representative sequence(s)”?

For example, in the following example, is there a way to find the original 627 sequences represented by “r1”?

data(biofam)
biofam.lab <- c("Parent", "Left", "Married", "Left+Marr",
"Child", "Left+Child", "Left+Marr+Child", "Divorced")
biofam.seq <- seqdef(biofam, 10:25, labels=biofam.lab)
## Computing the distance matrix
costs <- seqsubm(biofam.seq, method="TRATE")
biofam.om <- seqdist(biofam.seq, method="OM", sm=costs)
## Representative set using the neighborhood density criterion
biofam.rep <- seqrep(biofam.seq, diss=biofam.om, criterion="density")
biofam.rep
summary(biofam.rep)

[>] criterion: density 
[>] 2000 sequence(s) in the original data set
[>] 4 representative sequences
[>] overall quality: 0.08113734 
[>] statistics for the representative set:


        na na(%)  nb nb(%)    SD   MD    DC    V      Q
r1     627  31.4 225 11.25  4566 7.28  4856 4.73   5.97
r2     577  28.8 123  6.15  4305 7.46  5175 5.05  16.81
r3     411  20.5 115  5.75  2658 6.47  2394 4.34 -11.04
r4     385  19.2  93  4.65  3006 7.81  3393 5.57  11.42
Total 2000 100.0 556 27.80 14535 7.27 15818 7.91   8.11

    na: number of assigned objects
    nb: number of objects in the neighborhood
    SD: sum of the na distances to the representative
    MD: mean of the na distances to the representative
    DC: sum of the na distances to the center of the complete set
    V: discrepancy of the subset
    Q: quality of the representative

A complementary question. It would be great if there would be more explanation/clarification on how "na" and "nb" should be read and interpreted. For example, are the 4 representative sequences (r1, r2, r3, r4) representing the 2000 sequences or just the 556 sequences?

I tried to find answers to my questions.


Solution

  • The sequences assigned to each representative can be retrieved from the "Distances attribute of the object returned by seqdef. I illustrate following up your example:

    ## "Distances" attribute of object returned by seqrep
    rep.dist <- attr(biofam.rep,"Distances")
    rep.dist[1:9,] # first 9 rows to show how it looks out
    
    #          1692       221     1167 1245
    # 1167       NA        NA 0.000000   NA
    # 514        NA 10.000000       NA   NA
    # 1013       NA  9.794079       NA   NA
    # 275        NA  1.945416       NA   NA
    # 2580       NA  5.954724       NA   NA
    # 773   1.96818        NA       NA   NA
    # 1187 13.89761        NA       NA   NA
    # 47         NA  9.704456       NA   NA
    # 2091       NA        NA 3.957049   NA
    
    ## retrieving assigned representative
    rep.grp <- apply(rep.dist, 1, which.min)
    
    seqdplot(biofam.seq, group=rep.grp, border=NA)
    

    enter image description here

    ## sequences assigned to 1st representative
    seq.rep1 <- biofam.seq[rep.grp==1,]
    nrow(seq.rep1)
    
    # 627 
    

    Regarding your complementary question:

    Each sequence is assigned to the closer representative sequence and na[i] is the total number of sequences assigned to ri.

    Now, the neighborhood of each representative is defined by the pradius argument (by default, 10% of the maximum distance). nb[i] is the number out of the na[i] sequences that are in the neighborhood of ri.

    A sequence can be assigned to a representative without being in its neighborhood. It can also be in the neighborhood of a representative but be assigned (i.e., closer) to another representative.

    For the example, the sum of the nb's tells us that 556 sequences are covered, i.e., in the neighborhood of at least one of the representatives. The sum of the na's is always the total number of sequences.