[SOLVED] Fitting a VLMC to very long sequences

Fitting a VLMC to very long sequences

I am trying to fit a VLMC to a dataset where the longest sequence is 296 states. I do it as shown below:

# Load libraries
library(PST)
library(RCurl)
library(TraMineR)

# Load and transform data
x <- getURL("https://gist.githubusercontent.com/aronlindberg/08228977353bf6dc2edb3ec121f54a29/raw/241ef39125ecb55a85b43d7f4cd3d58f617b2ecf/challenge_level.csv")
data <- read.csv(text = x)

data.seq <- seqdef(data[,2:ncol(data)], missing = NA, right = NA, nr = "*")
S1 <- pstree(data.seq, ymin = 0.01, lik = TRUE, with.missing = TRUE, nmin = 2)

This, however, yields the following error:

Error in res[i, , drop = FALSE] : subscript out of bounds

How can I fit the model to data with sequences this long? Are there any good justifications for limiting the length within the model?

Solution

The problem comes from your data. By not setting L in the pstree function, you mean that you want to fit a model of maximum order. The fitting process produces an error at L=8, since you have nmin=2 but at this order only one context has nmin=2

> cprob(data.seq, L=8, nmin=2)
 [>] 21 sequences, min/max length: 19/296
 [>] computing prob., L=8, 2043 distinct context(s)
 [>] removing 1894 context(s) where n<2
 [>] total time: 0.156 secs
                        EX  FA I1  I2 I3 N1 N2 N3 NR QU TR [n]
I2-I3-FA-I3-EX-I3-EX-I2  0 0.5  0 0.5  0  0  0  0  0  0  0   2

Fitting a model using L=8 works fine

S1 <- pstree(data.seq, ymin = 0.01, lik = TRUE, nmin = 2, L=8) 

 [>] 21 sequence(s) - min/max length: 19/296
 [>] max. depth L=8, nmin=2, ymin=0.01
     [L]  [nodes]
       0        1
       1       11
       2       99
       3      368
       4      340
       5      126
       6       34
       7        4
       8        1
 [>] computing sequence(s) likelihood ... (0.804 secs)
 [>] total time: 2.968 secs

Again, you don't need to use any 'missing', 'right' or 'nr' option in seqdef(), nor 'with.missing' in pstree()

Best, Alexis