rcorpusquanteda

Quanteda: How do I create a corpus and plot dispersion of words?


I have some data which looks like this:

  date      signs  horoscope                                                      newspaper   
  <chr>     <chr>  <chr>                                                          <chr>       
1 06-06-20~ ARIES  Your week falls neatly into distinct phases. The completion o~ Indian Expr~
2 06-06-20~ TAURUS You're coming to the end of an emotional period, when you've ~ Indian Expr~
3 06-06-20~ GEMINI Passions are still running high, and the degree of emotional ~ Times of In~
4 06-06-20~ CANCER First things first - don't rush it! There is still a great de~ Indian Expr~
5 06-06-20~ LEO    The greatest pressures are coming from all directions at once~ Indian Expr~

I would like to create a corpus out of this data where all horoscope are grouped together by newspaper and signs as documents.

For example, all ARIES in the newspaper Times of India should be one document, but arranged chronologically in order of date (their index should be ordered by date).

Since I don't know how to group this text by newspaper and signs, I tried creating two different corpuses for each newspaper. I have tried doing this:


# Create a dataframe of only Times of India text
h_toi <- horoscopes %>%
  filter(newspaper == "Times of India") %>%
  select(-c("newspaper"))
  
# Create a corpus of out this
horo_corp_toi <- corpus(h_toi, text_field = "horoscope")

# Create docids
docids <- paste(h_toi$signs)

# Use this as docnames
docnames(horo_corp_toi) <- docids

head(docnames(horo_corp_toi), 5)
# [1] "ARIES.1"  "TAURUS.1" "GEMINI.1" "CANCER.1" "LEO.1" 

But as you can see, the docnames for the corpus are "ARIES.1", `"TAURUS.1" and so on. This is a problem since when I try to plot it using quanteda's textplot_xray(), there are thousands of documents plotted instead of just 12 documents for each sign:

# Plot lexical dispersion of love in all signs 
kwic(tokens(horo_corp_toi), pattern = "love") %>%
    textplot_xray()

enter image description here

Instead, I would like to be able to do something like this: enter image description here

I am not able to get this visualization because I don't know how to manipulate and create the corpus initially. How can I do this, and what am I doing wrong?

Sample DPUT is here


Solution

  • Since the question asks how to group by both sign and newspaper, let me answer that one first.

    library("quanteda")
    ## Package version: 3.1.0
    ## Unicode version: 13.0
    ## ICU version: 69.1
    ## Parallel computing: 12 of 12 threads used.
    ## See https://quanteda.io for tutorials and examples.
    library("quanteda.textplots")
    
    ## horoscopes <- [per linked dput in OP]
    
    corp <- corpus(horoscopes, text_field = "horoscope")
    toks <- tokens(corp)
    
    # grouped by sign and newspaper
    tokens_group(toks, groups = interaction(signs, newspaper)) %>%
      kwic(pattern = "love") %>%
      textplot_xray()
    

    To achieve the result output above (only the last image is shown here), you can loop through the newspapers and group only by signs. Note that the number of signs here is limited because in the sample data provided, not all of the zodiac range was included in the data.

    # separate kwic for each newspaper
    for (i in unique(toks$newspaper)) {
      thiskwic <- toks %>%
        tokens_subset(newspaper == i) %>%
        tokens_group(signs) %>%
        kwic(pattern = "love")
      textplot_xray(thiskwic) +
        ggplot2::ggtitle(paste("Lexical dispersion plot -", toupper(i)))
    }