rtraminer

Problem with big data (?) during computation of sequence distances using TraMineR


I am trying to run an optimal matching analysis using TraMineR but it seems that I am encountering an issue with the size of the dataset. I have a big dataset of European countries which contains employment spells. I have more than 57,000 sequences which are 48 units long and consist of 9 distinct states. In order to get an idea of the analysis, here is the head of sequence object employdat.sts:

[1] EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-...  
[2] EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-...  
[3] ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-...  
[4] ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-...  
[5] EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-EF-...  
[6] ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-ST-...  

In a shorter SPS format, this reads as follows:

Sequence               
[1] "(EF,48)"              
[2] "(EF,48)"              
[3] "(ST,48)"              
[4] "(ST,36)-(MS,3)-(EF,9)"
[5] "(EF,48)"              
[6] "(ST,24)-(EF,24)"

After passing this sequence object to the seqdist() function, I get the following error message:

employdat.om <- seqdist(employdat.sts, method="OM", sm="CONSTANT", indel=4)    
[>] creating 9x9 substitution-cost matrix using 2 as constant value  
[>] 57160 sequences with 9 distinct events/states  
[>] 12626 distinct sequences  
[>] min/max sequence length: 48/48  
[>] computing distances using OM metric  
Error in .Call(TMR_cstringdistance, as.integer(dseq), as.integer(dim(dseq)),  : negative length vectors are not allowed

Is this error related to the huge number of distinct, long sequences? I am using a x64-machine with 4GB RAM and I have also tried it on a machine with 8-GB RAM which reproduced the error message. Does someone know a way to tackle this error? Besides, analyses for each single country using the same syntax with an index for the country worked well and produced meaningful results.


Solution

  • I never saw this error code before, but it might well be due to your high number of sequences. There are at least two things you can try to do:

    Hope this helps.