rggplot2ldatopic-modelingtbl

Cannot access data from tbl df object to be used on ggplot2


I'm working on a set of LDA models to compare their predictive accuracy on topic assignments. Some short description below.

I applied the per document per topic assignment extracting the topic with the highest "gamma" (15 in total) for each document, I then used Chang and Blei's (2009) rtm method to get topic prediction per document per word/token and select the most frequent topic in a given document as the predicted topic for that document. Finally, I merged both predictions together with topic as the header for the first method consensus as the second method, matched by document ID and keeping the original document text. The data (named assignments) can be assessed here (330 x 6, not very large).

I tried to visualize the predictive accuracy of the methods with ggplot2, using the per document/per topic method as the baseline plotted along the y-axis and the rtm method evaluated on the x-axis with the following code

library(foreign)
library(topicmodels)
library(tm)
library(tidyr)
library(plyr)
library(ggplot2)
library(lda)
library(igraph)
library(scales)

load("~/assignments.Rdata")

assignments %>%
  count(topic, consensus, wt_var = freq) %>%
  group_by(topic) %>%
  mutate(percent = n / sum(n)) %>%
  ggplot(aes(consensus, topic, fill = percent)) +
  geom_tile() +
  scale_fill_gradient2(high = "red", label = percent_format()) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1),
        panel.grid = element_blank()) +
  labs(x = "RTM assignments",
       y = "Documents came from",
       fill = "% of assignments")

However, I received an error message at the count(topic, consensus,...) line, showing Error in count(., topic, consensus, wt_var = freq) : unused argument (consensus), yet, if I removed consensus from the line of code, I got Error in count(., topic, wt_var = freq) : object 'topic' not found . I suspect this could be an S4 class issue (or maybe not), so I tried the following methods. Using "" on group_by() variable, but it didn't work. Instead, I got this error message Error in sum(n) : invalid 'type' (closure) of argument.

Then I used tbl_df(assignments) to convert assignments to tibble compatible object. Again, it didn't work, R still could not find consensus and topic data from the tibble object.

I am really confused and would like to have someone take a look at my code and enlighten me on this.

Thanks.


Solution

  • I think you're having issues with your data manipulation in the first half of the function. I grouped the assignments first by topic and consensus so the count would differentiate between them (rather than just returning a sum of frequencies), then applied count (with the variables in quotes) and finally mutate:

    library(dplyr)
    
    assignments_2 <- assignments %>% group_by(topic, consensus) %>%
        count(vars = "topic", wt_var = "freq") %>%
        mutate(percent = n / sum(n))
    

    If that puts the data into the format I think you want it, you should then be able to plot your graph!