rentropytraminer

I can't work out the entropy formula for TraMineR


I used the ggseqplot() library to calculate the entropy from the TraMineR library. To familiarise myself with the formula, I tried unsuccessfully to calculate it by hand using simple example data.

Alphabet : 3

Probability of each event: 1/26, 22/26, 3/26

The calculation obtained by ggseqdplot(mine_mvad.seq, with.entropy = TRUE) gives me 0.265087328, which is the default normalised entropy.

The formula here is pp.77-78. I tried : - 1/log(3) / ((1/26*(log(1/26)) + 22/26*log(22/26) + 3/26*log(3/26)))

and

- (1/26*log(1/26) + 22/26*log(22/26) + 3/26*log(3/26)) But I don't have the right result. I've missed something in the formula.


Solution

  • Shannon's entropy of a distribution $(p_1,...,p_n)$ is $-\sum_i p_i \log(p_i)$. In TraMineR, the entropy is computed with the natural logarithm.

    TraMineR computes both the longitudinal entropy (i.e., the entropy of the distribution within each sequence) with function seqient and the cross-sectional entropy of the state distribution at each position with function seqstatd. Function ggseqdplot (from package ggseqplot) plots the cross-sectional distributions and the cross-sectional entropies. The plotted values are taken from the outcome of seqstatd.

    In your example, we do not know to what the distribution c(1/26, 22/26, 3/26) corresponds. If it is the distribution inside a sequence of length 26 you should compare with the outcome of seqient. If it is the distribution, say at the first position, across 26 sequences, then you should compare with the outcome of seqstatd (the entropy is the third element of the returned list or $Entropy). Both return normalized values. The normalization is done by dividing by the maximal possible entropy for the given alphabet, which is the log of the alphabet size. In both cases, you can get non-normalized values by specifying norm=FALSE.

    To illustrate, I build a set of 26 sequences of length 26 using alphabet {"a","b","c"}.

    library(TraMineR)
    sdat <- c("a/1-b/22-c/3","b/2-c/24","c/5-b/10-a/11")
    sda <- sdat[c(1,rep(2,22),rep(3,3))]
    sdsts <- seqformat(sda, from="SPS", to="STS", 
                  SPS.in = list(xfix = "", sdsep = "/"), stsep="-")
    myseq <- seqdef(sdsts, cnames=1:26)
    seqIplot(myseq)
    

    dplot of myseq

    Now, we compute the cross-sectional entropies and display the first five values:

    myseq.distr <- seqstatd(myseq)
    myseq.distr$Entropy[1:5]
    ##         1         2         3         4         5 
    ## 0.4695343 0.3255263 0.1483905 0.1483905 0.1483905
    

    The first value (i.e., the cross-sectional entropy at the first position) is obtained as:

    ent <- -((1/26*(log(1/26)) + 22/26*log(22/26) + 3/26*log(3/26)))
    ent/log(3)
    ## [1] 0.4695343
    

    The cross-sectional distributions and entropies are plotted with

    seqdHplot(myseq)
    

    enter image description here

    We get the longitudinal entropy of the first sequence with

    seqient(myseq[1,])
    ##       Entropy
    ## [1] 0.4695343
    

    and non-normalized value with

    seqient(myseq[1,], norm=FALSE)
    ##       Entropy
    ## [1] 0.5158361
    

    The value of the normalized entropy is the same as the entropy computed transversally at the first position because, by construction of our example data, the distribution within the first sequence is the same as the cross-sectional distribution at the first position.