rterm-document-matrixquanteda

Term document entropy calculation


Using dtm it is possible to take the term frequency.

How is it possible or is there any easy way to calculate the entropy? It is giving higher weight to the terms with less frequency in some documents.

entropy = 1 + (Σj pij log2(pij)/log2n

pij = tfij / Σj tfij

tfij is the number of times word i occurs in document j.


Solution

  • Here's a function for doing that, although it could be improved by maintaining sparsity in the p_ij and log computations (this is how dfm_tfidf() is written for instance). Note that I changed the formula slightly since (according to https://en.wikipedia.org/wiki/Latent_semantic_analysis#Mathematics_of_LSI among other sources) there should be a minus in front of the sum.

    library("quanteda")
    textstat_entropy <- function(x, base = exp(1), k = 1) {
        # this works because of R's recycling and column-major order, but requires t()
        p_ij <- t(t(x) / colSums(x))
    
        log_p_ij <- log(p_ij, base = base)
        k - colSums(p_ij * log_p_ij / log(ndoc(x), base = base), na.rm = TRUE)
    }
    
    textstat_entropy(data_dfm_lbgexample, base = 2)
    #        A        B        C        D        E        F        G        H        I        J        K 
    # 1.000000 1.000000 1.000000 1.000000 1.000000 1.045226 1.045825 1.117210 1.173655 1.277210 1.378934 
    #        L        M        N        O        P        Q        R        S        T        U        V 
    # 1.420161 1.428939 1.419813 1.423840 1.436201 1.440159 1.429964 1.417279 1.410566 1.401663 1.366412 
    #        W        X        Y        Z       ZA       ZB       ZC       ZD       ZE       ZF       ZG 
    # 1.302785 1.279927 1.277210 1.287621 1.280435 1.211205 1.143650 1.092113 1.045825 1.045226 1.000000 
    #        ZH       ZI       ZJ       ZK 
    # 1.000000 1.000000 1.000000 1.000000 
    

    This matches the weight function in the lsa package, when the base is e:

    library("lsa")
    all.equal(
        gw_entropy(as.matrix(t(data_dfm_lbgexample))),
        textstat_entropy(data_dfm_lbgexample, base = exp(1))
    )
    # [1] TRUE