I'm a communication scientist and a total newbie with TraMineR and sequenze analyis. I have a (relatively large) dataset that includes the app usage of study participants. My aim is to identify sequences of app-categories used in succession.
The original dataset looks like this:
Participant ID | Session ID | Category of used Apps | Start Time (in unix-time) | End Time (in unix-time) |
---|---|---|---|---|
0001 | 0001_1 | Communication | 1614868224 | 1614868236 |
0001 | 0001_1 | Social Media | 1614868236 | 1614868265 |
0002 | 0002_1 | Games | 1614868265 | 1614868320 |
... | ... | ... | ... | ... |
Accordingly, I have two levels of analysis: (1) On the one hand the participants and on the other hand (2) the sessions.
In the first step my aim is to identify sequences of app-cateogries used in succession. A session is a coherent usage sequence between switching the smartphone screen on and off. The data set comprises just under 400 participants, with each participant having around 2000-5000 sessions (~ 1,4 mio sessions for the whole dataset).
labels = seqstatl(sample$app_category)
states = 1:length(labels)
session_seq = seqdef(data = sample,
var = c("session", "begin", "end", "app_category"),
informat = "SPELL",
states = states,
labels = labels,
process = FALSE)
print(session_seq[1:15, ], format = "SPS")
# Using the transition rates between states observed in the sequence data
cost = seqsubm(session_seq, method = "TRATE", with.missing = TRUE)
# compute the distances using the matrix and the default indel cost of 1
session_seq_OM = seqdist(session_seq, method = "OM", sm = cost, with.missing = TRUE)
# --> Function crashed due to lack of RAM
I have already made my first attempts with subsamples and have come across the question:
The question relates to the computing resources required. I need a relatively large amount of computing power even for a sub-sample. Is it possible to make the calculation more resource-efficient? Is it an option to split the data set and later merge the sequence distance (batching) or will this distort my results?
I have already created the sequence object in STS format (530 objects and 1222844 variables) for a subset of the data set (the mobile sessions of a participant, n = ~ 4000; the structure of the data set looks as described above) and then wanted to calculate the sequence distance ("OM"). However, I was unable to calculate the sequence distance due to the high computing resources required. The calculation was cancelled due to a lack of RAM even on a 1TB RAM machine.
I am also happy to receive further tips for reading. The TraMineR User Guide has already helped me a lot.
Your question encompasses multiple aspects and I'll try to answer some of them here.
In sequence analysis (SA), three counts matter: the number n of sequences, the length of the sequences, and the size |A| of the alphabet (i.e., the number of different tokens that can appear in sequences). The computation time of pairwise dissimilarities such as OM depends on these three counts. In your case, from what I understand:
So, you have a serious issue with the sequence lengths. Here, you must pay attention to two aspects: time unit and sequence alignment.
We do not know to what the time unit corresponds. If time units are seconds, then the sequence length of 1222844 corresponds to 340 hours. When comparing 340 hours long sequences, I doubt that it really makes sense to consider differences of a few seconds in the time spent in an application. Therefore, I would suggest you use a harsher granularity, which would reduce the sequence length.
Also, we do not know how you obtain these 1222844 variables. If sequences were aligned on observation time and sequences were not observed over the same periods, the 340 hours may well correspond to the length of the period over which observations were done even if none of the individual sequences was observed over the whole period. If this is the case, I suggest to align sequences on process time, i.e., using time since start of observation.
Regarding what you call "batching", i.e., running the computation of the dissimilarities in successive batches, I cannot figure out a simple way of doing that when the limitation issue has to do with sequence length. In particular, with dissimilarities such as OM, which allow for time warp when comparing sequences, splitting the time frame and computing dissimilarities on successive intervals we would loss time warp around the split points.
Note that the computation of some alternative dissimilarity measures such as OMspell can be much faster than OM for long sequences when the number of spells is significantly lower than the sequence length.
In conclusion, I encourage you to carefully examine the question of the sequence length: granularity and sequence alignment.