rsubsequencetraminer

Complete subsequence in Traminer


Subsequences functions give interesting results with the seqefsub function. I work on sequences composed of geographical locations. Is there a way to know whether the subsequence listed is a complete subsequence.

I provide an example.

library(TraMineR)
id = c(rep(1,5), rep(2,3), rep(3,3), rep(4,3), rep(5,2),  rep(6,2), rep(7,3), rep(8,3))
begin = c(1963, 1969, 1969, 1974, 2004, 1971, 1976, 1984, 1996, 1998, 2011, 1997, 2008, 2011, 1967, 1971, 1972, 1985, 1971, 1980, 1986, 1974, 2000, 2002)
end = c(1969, 1969, 1974, 2004, 2012, 1976, 1984, 2012, 1998, 2011, 2012, 2008, 2011, 2012, 1971, 2012, 1985 ,2012 ,1980 ,1986 ,2012 ,2000 ,2002 ,2012)
status = c(1, 5, 6, 5, 1, 1, 5, 1, 1, 3, 8, 1, 3, 1, 1, 5, 1, 8, 1, 5, 1, 1, 8, 1)
df = data.frame(id,begin,end,status)
df.seq1 = seqformat(df, from = "SPELL", to="STS", process = FALSE)
df.seq2 <- seqdef(df.seq1, informat='STS')
df.seq3 <- seqecreate(df.seq2, tevent = "transition")

fsubseq <-seqefsub(df.seq3, min.support = 1)

There are 8 sequences where status corresponds to different geographical locations. Time unit is a year. The function fsubseq lists all possible subsequences.

             Subsequence Support Count
1                    (*)   0.875     7
2              (*)-(*>1)   0.875     7
3                  (*>1)   0.875     7
4        (*)-(*>1)-(1>5)   0.375     3
5              (*)-(1>5)   0.375     3
6            (*>1)-(1>5)   0.375     3
7                  (1>5)   0.375     3
8                  (5>1)   0.375     3
9        (*)-(*>1)-(1>3)   0.250     2
10 (*)-(*>1)-(1>5)-(5>1)   0.250     2
11       (*)-(*>1)-(1>8)   0.250     2
12       (*)-(*>1)-(5>1)   0.250     2
13             (*)-(1>3)   0.250     2
14       (*)-(1>5)-(5>1)   0.250     2
15             (*)-(1>8)   0.250     2
16             (*)-(5>1)   0.250     2
17           (*>1)-(1>3)   0.250     2
18     (*>1)-(1>5)-(5>1)   0.250     2
19           (*>1)-(1>8)   0.250     2
20           (*>1)-(5>1)   0.250     2
21                 (1>3)   0.250     2
22           (1>5)-(5>1)   0.250     2
23                 (1>8)   0.250     2

what i call the "complete subsequence" correspond to the subsequences that encompass all successive states for one individual. In this examples, there are seven: 1/6/5/1, 1/5/1,1/3/8, 1/3/1, 1/5, 1/8, 1/8/1. The "complete subsequence" 1/5/1 corresponds to line 10. It is difficult to spot in the list the "complete subsequence". So my question is to know whether there is a way to filter from the list the complete subsequence.


Solution

  • From what I understand, what you call "complete subsequences" are the sequences of distinct successive states.

    The distinct successive states are obtained from the state sequences with seqdss, and the frequencies of the sequences with seqtab. So, we get frequencies of what you call "complete subsequences" with:

    seqtab(seqdss(df.seq2))
    #                 Freq Percent
    # 1/1-5/1-1/1        2      25
    # 1/1-3/1-1/1        1      12
    # 1/1-3/1-8/1        1      12
    # 1/1-5/1            1      12
    # 1/1-6/1-5/1-1/1    1      12
    # 1/1-8/1            1      12
    # 1/1-8/1-1/1        1      12