I am doing a basic, exploratory topic analysis of two different question responses and am trying to visualize the results by question. I am working in RStudio and using an RMarkdown file. The example dataset I made here is way smaller than what I'm working with, which shouldn't be an issue to describe the problem. Below is all the code to get the matrix and gamma table, but it all runs fine.
library(tidyverse)
library(tidytext)
library(stm)
#here is a representative example of my data
Term <- c("y57","t44","y57","t44","y57","t44","t44","y57")
Question <- c(1,1,1,1,2,2,2,2)
Id <- c(1,2,1,2,3,4,4,3)
Text <- c("words that are all here in this dataframe", "other sorts of things to meet the needs of the data", "stuff and the like about such and such and this that and the other", "et cetera and so on and so forth and on ad nauseum", "bla bla shockablooey the hooey is newey to youey", "wooly sheep are superior to all other sheep", "come together in this hour of great trial", "right words are different from wrong ones")
df_data <- data.frame(term = Term, question = Question, id = Id, text = Text)
#unnesting the words into a new dataframe
df_tidy <-
df_data %>%
unnest_tokens(word,text)
#setting up the necessary pieces for the topic analysis plot
df_sparse <-
df_tidy %>%
count(id,word) %>%
filter(n > 1) %>%
cast_sparse(id,word,n)
set.seed(216)
topic_model_5 <- stm(df_sparse, K = 5)
df_gamma_5 <-
tidy(topic_model_5,
matrix = "gamma",
document_names = rownames(df_sparse))
My issue is in the final preparation for plotting, in which I want to sort the topics by a variable (question) to get two plots. I am trying to use the left_join function between "df_data" and "df_gamma_5". At least that's what I think this segment is trying to do...
#object type troubleshooting that made sense to me
df_data$question <- as.factor(df_data$question)
df_data$id <- as.character(df_data$id)
#what I can't get unstuck, which I think has to be from the left_join somehow
df_gamma_5 %>%
left_join(
df_data %>%
select(question, document = id) %>%
mutate(question, fct_inorder(question)),
relationship = "many-to-many"
) %>%
mutate(topic = factor(topic)) +
ggplot(aes(gamma, topic, fill = topic)) +
geom_boxplot(alpha = 0.7, show.legend = TRUE) +
ggtitle("topics by question") +
facet_wrap(vars(term)) %>%
print()
The error message that I get from this line reads:
Joining with `by = join_by(document) `Warning: Detected an unexpected many-to-many relationship between `x` and `y`. Error in `fortify()`:
! `data` must be a <data.frame>, or an object coercible by
`fortify()`, or a valid <data.frame>-like object coercible by
`as.data.frame()`, not a <uneval> object.
ℹ Did you accidentally pass `aes()` to the `data` argument?
Run `rlang::last_trace()` to see where the error occurred.
This is driving me nuts because I followed an online example using Taylor Swift lyric data (https://www.youtube.com/watch?v=rXDv0ZuX0Fc&t=216s) and the code I wrote for that example worked just fine. The plot I want is essentially the same as the video, except instead of plots by album (n=11) I want plots by question (n=2). In a more complex analysis I'd like a 2x2 with each plot sample selected by term and question, but that's for another day. I suspect that the issue has something to do with the fact that, unlike Taylor Swift's catalog where lyrics for each song are a distinct observation, I have two different text observations for each id variable. I don't know if that's the problem though, and even if I did, I don't know how to solve it.
I'm ten days into learning R (and coding generally), so any help at all will be miles above me grasping at straws alone. Thank you!
This might bring you closer. There are several typos! And your desired result is unknown.
Changes in last code block.
(1) Make either document
of df_gamma_5
an integer or document
from df_data
a character. You migth want to do that in a previous step.
(2) What is mutate(question, fct_inorder(question))
for? If indeed needed, then do mutate(question = fct_inorder(question))
instead.
(3) mutate(topic = factor(topic))
can be done inside aes()
.
(4) There is no term
variable. Changed vars(term)
to ~question
inside facet_wrap()
.
df_gamma_5 |>
mutate(document=as.character(document)) |>
# or as.integer()/strtoi() for df_data
left_join(df_data |> select(question, document=id),
relationship='many-to-many') |> # could be skipped
ggplot(aes(x=gamma, y=topic, fill=factor(topic))) +
geom_boxplot(alpha=.7, show.legend=TRUE) +
facet_wrap(~question) +
ggtitle('topics by question')
Plot
Recommendation I would either suppress the y-axis (y
) or the fill aesthetic (fill
). Two indicators for one variable is somehow misleading/visual distracting.
Note. I use single backticks ('
) instead of doubles ("
). I use the native base R
operator |>
instead of somewhat outdated {magrittr}
pipe operator %>%
. Finally, I tend to avoid spaces on either side of =
. This is nothing just personal preference.