rregressionsequencetraminer

seqformat creates sequences with NA values in R


I'm using seqformat in R to analyze the sequence of events.

I have this data, for example, for practice where I have a huge dataset, but I'm using it to understand the function format:

   
 Location_Id     Event       Start_day    End_day   temp    year         
     1         Sever snow       6              12     4     2014          
     1         Medium snow      15             21     6     2016          
     2         Sever snow       7              8      3     2013

I used this command:

sts.data <- seqformat(df, from="SPELL", to="STS", id="Event", begin="Start_day", end="End_day", status="temp",limit=3)

When I run the command, I get this message

    [!!] max of 'end' column > limit! Sequences truncated at limit= 3     [>]

converting SPELL data into 2 STS sequences (internal format)

 The output with NA values is as below

                          

                 a1    a2    a3
Sever snow       NA    NA    NA       
Medium snow      NA    NA    NA

I'm not sure if the end parameter needs to be greater than the begin parameter among all events or this is not the problem. 

Any thoughts about why I can't have this sequence of events created successfully, please?


Solution

  • The limit argument sets the maximum length of the sequences. In your data the first valid information is at day 6 and, therefore, the first three positions (days) are NAs.

    The latest valid information is on day 21. To avoid truncation of the sequences, set limit=21 or larger. Note also that the function may produce unexpected results when ids are not contiguous. Since you are using Event as id, I sort the rows of df by Event to make ids contiguous.

    df <- read.table(header=TRUE, text = "
    Location_Id     Event       Start_day    End_day   temp    year
         1         Sever.snow       6              12     4     2014          
         1         Medium.snow      15             21     6     2016          
         2         Sever.snow       7              8      3     2013
                     ")
    ## Event used as id: sort to make identical ids contiguous
    df <- df[order(df[,"Event"]),]
    sts.data <- seqformat(df, from="SPELL", to="STS", id="Event",
            begin="Start_day", end="End_day", status="temp",limit=21)
    sts.data
    #             a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a13 a14 a15 a16 a17 a18 a19 a20 a21
    # Medium.snow NA NA NA NA NA NA NA NA NA  NA  NA  NA  NA  NA   6   6   6   6   6   6   6
    # Sever.snow  NA NA NA NA NA  4  3  3  4   4   4   4  NA  NA  NA  NA  NA  NA  NA  NA  NA