rtidytextzipf

Tidy text: Compute Zipf's law from the following term-document matrix


I tried the code from http://tidytextmining.com/tfidf.html. My result can be seen in this image.

My question is: How can I rewrite the code to produce the negative relationship between the term frequency and the rank?

The following is the term-document matrix. Any comments are highly appreciated.

 # Zipf 's law

freq_rk < -DTM_words %>%
group_by(document) %>%
mutate(rank=row_number(),
       'term_frequency'=count/total)

freq_rk %>%
ggplot(aes(rank,term_frequency,color=document)) +
geom_line(size=1.2,alpha=0.8)


DTM_words
 # A tibble: 4,530 x 5
     document       term count     n total
        <chr>      <chr> <dbl> <int> <dbl>
 1        1      activ     1     1   109
 2        1 agencydebt     1     1   109
 3        1     assess     1     1   109
 4        1      avail     1     1   109
 5        1     balanc     2     1   109
 # ... with 4,520 more rows

Solution

  • To use row_number() to get rank, you need to make sure that your data frame is ordered by n, the number of times a word is used in a document. Let's look at an example. It sounds like you are starting with a document-term matrix that you are tidying? (I'm going to use some example data that is similar to a DTM from quanteda.)

    library(tidyverse)
    library(tidytext)
    
    data("data_corpus_inaugural", package = "quanteda")
    inaug_dfm <- quanteda::dfm(data_corpus_inaugural, verbose = FALSE)
    
    ap_td <- tidy(inaug_dfm)
    ap_td
    #> # A tibble: 44,725 x 3
    #>           document   term count
    #>              <chr>  <chr> <dbl>
    #>  1 1789-Washington fellow     3
    #>  2 1793-Washington fellow     1
    #>  3      1797-Adams fellow     3
    #>  4  1801-Jefferson fellow     7
    #>  5  1805-Jefferson fellow     8
    #>  6    1809-Madison fellow     1
    #>  7    1813-Madison fellow     1
    #>  8     1817-Monroe fellow     6
    #>  9     1821-Monroe fellow    10
    #> 10      1825-Adams fellow     3
    #> # ... with 44,715 more rows
    

    Notice that here, you have a tidy data frame with one word per row, but it is not ordered by count, the number of times that each word was used in each document. If we used row_number() here to try to assign rank, it isn't meaningful because the words are all jumbled up in order.

    Instead, we can arrange this by descending count.

    ap_td <- tidy(inaug_dfm) %>%
      group_by(document) %>%
      arrange(desc(count)) 
    
    ap_td
    #> # A tibble: 44,725 x 3
    #> # Groups:   document [58]
    #>         document  term count
    #>            <chr> <chr> <dbl>
    #>  1 1841-Harrison   the   829
    #>  2 1841-Harrison    of   604
    #>  3     1909-Taft   the   486
    #>  4 1841-Harrison     ,   407
    #>  5     1845-Polk   the   397
    #>  6   1821-Monroe   the   360
    #>  7 1889-Harrison   the   360
    #>  8 1897-McKinley   the   345
    #>  9 1841-Harrison    to   318
    #> 10 1881-Garfield   the   317
    #> # ... with 44,715 more rows
    

    Now we can use row_number() to get rank, because the data frame is actually ranked/arranged/ordered/sorted/however you want to say it.

    ap_td <- tidy(inaug_dfm) %>%
      group_by(document) %>%
      arrange(desc(count)) %>%
      mutate(rank = row_number(),
             total = sum(count),
             `term frequency` = count / total)
    
    ap_td
    #> # A tibble: 44,725 x 6
    #> # Groups:   document [58]
    #>         document  term count  rank total `term frequency`
    #>            <chr> <chr> <dbl> <int> <dbl>            <dbl>
    #>  1 1841-Harrison   the   829     1  9178       0.09032469
    #>  2 1841-Harrison    of   604     2  9178       0.06580954
    #>  3     1909-Taft   the   486     1  5844       0.08316222
    #>  4 1841-Harrison     ,   407     3  9178       0.04434517
    #>  5     1845-Polk   the   397     1  5211       0.07618499
    #>  6   1821-Monroe   the   360     1  4898       0.07349939
    #>  7 1889-Harrison   the   360     1  4744       0.07588533
    #>  8 1897-McKinley   the   345     1  4383       0.07871321
    #>  9 1841-Harrison    to   318     4  9178       0.03464807
    #> 10 1881-Garfield   the   317     1  3240       0.09783951
    #> # ... with 44,715 more rows
    
    ap_td %>%
      ggplot(aes(rank, `term frequency`, color = document)) +
      geom_line(alpha = 0.8, show.legend = FALSE) + 
      scale_x_log10() +
      scale_y_log10()