Please have a look at the self-contained example at the end of the post. I simplified the reprex and you can download the dfm (document-feature matrix) from
https://e.pcloud.link/publink/show?code=XZmHFDZeObPiNtsGWfzuBlnVw2ryzATt1X7
A couple of things which I do not understand happen
What causes 'subscript out of bounds' error in STM topic modeling with missing data?
but here I give a reproducible example.
Any help for 1) and 2) is appreciated!
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 15.0
#> ICU version: 72.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
library(stm)
#> stm v1.3.6.1 successfully loaded. See ?stm for help.
#> Papers, resources, and other materials at structuraltopicmodel.com
library(RCurl)
library(readtext)
#>
#> Attaching package: 'readtext'
#> The following object is masked from 'package:quanteda':
#>
#> texts
library(tidytext)
library(ggplot2)
## Download the dfm matrix from
## https://e.pcloud.link/publink/show?code=XZmHFDZeObPiNtsGWfzuBlnVw2ryzATt1X7
dfm_mat <- readRDS("dfm_mat.RDS")
## see https://rstudio-pubs-static.s3.amazonaws.com/406792_9287b832dd9e413f97243628cb2f7ddb.html
## convert the dfm to a format suitable to stm.
dfm2stm <- convert(dfm_mat, to = "stm")
model.stm <- stm(dfm2stm$documents, dfm2stm$vocab, K = 9, data = dfm2stm$meta,
init.type = "Spectral")
#> Beginning Spectral Initialization
#> Calculating the gram matrix...
#> Finding anchor words...
#> .........
#> Recovering initialization...
#> ...........................
#> Initialization complete.
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 1 (approx. per word bound = -6.780)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 2 (approx. per word bound = -6.762, relative change = 2.715e-03)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 3 (approx. per word bound = -6.761, relative change = 4.260e-05)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 4 (approx. per word bound = -6.761, relative change = 1.602e-05)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 5 (approx. per word bound = -6.761, relative change = 1.024e-05)
#> Topic 1: europe, can, european, new, need
#> Topic 2: union, need, europe, today, us
#> Topic 3: europe, union, work, european, need
#> Topic 4: union, need, europe, today, us
#> Topic 5: europe, can, european, new, need
#> Topic 6: europe, union, work, european, need
#> Topic 7: union, need, europe, today, us
#> Topic 8: europe, can, european, new, need
#> Topic 9: accelerate, union, need, europe, us
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Model Converged
## I make the model tidy.
## See https://juliasilge.com/blog/sherlock-holmes-stm/
stm_tidy <- tidy(model.stm)
gpl <- stm_tidy |>
group_by(topic) |>
top_n(10, beta) |>
ungroup() |>
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) |>
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(x = NULL, y = expression(beta),
title = "Highest word probabilities for each topic",
subtitle = "Different words are associated with different topics")
gpl
## I can fit a model by stm with a chosen number of topics to the data
### Now I try determining the optimal number of topics using the searchK function
### See https://stackoverflow.com/questions/64989642/use-dfm-in-searchk-calcuation
set.seed(02138)
K <- 5:15
model_search <- searchK(dfm2stm$documents, dfm2stm$vocab, K,
data = dfm2stm$meta)
#> Beginning Spectral Initialization
#> Calculating the gram matrix...
#> Finding anchor words...
#> .....
#> Recovering initialization...
#> ...........................
#> Initialization complete.
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 1 (approx. per word bound = -6.781)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 2 (approx. per word bound = -6.761, relative change = 2.956e-03)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 3 (approx. per word bound = -6.761, relative change = 2.235e-05)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Model Converged
#> Error in missing$docs[[i]]: subscript out of bounds
## This fails but I do not understand why....
sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 12 (bookworm)
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.4.4 tidytext_0.4.1 readtext_0.90 RCurl_1.98-1.13
#> [5] stm_1.3.6.1 quanteda_3.3.1 dplyr_1.1.3
#>
#> loaded via a namespace (and not attached):
#> [1] janeaustenr_1.0.0 utf8_1.2.4 generics_0.1.3 slam_0.1-50
#> [5] bitops_1.0-7 stringi_1.7.12 lattice_0.22-5 digest_0.6.33
#> [9] magrittr_2.0.3 evaluate_0.23 grid_4.3.2 fastmap_1.1.1
#> [13] plyr_1.8.9 Matrix_1.6-2 httr_1.4.7 stopwords_2.3
#> [17] fansi_1.0.5 scales_1.2.1 cli_3.6.1 rlang_1.1.2
#> [21] tokenizers_0.3.0 munsell_0.5.0 reprex_2.0.2 withr_2.5.2
#> [25] yaml_2.3.7 tools_4.3.2 reshape2_1.4.4 colorspace_2.1-0
#> [29] fastmatch_1.1-4 vctrs_0.6.4 R6_2.5.1 lifecycle_1.0.4
#> [33] stringr_1.5.0 fs_1.6.3 pkgconfig_2.0.3 RcppParallel_5.1.7
#> [37] pillar_1.9.0 gtable_0.3.4 data.table_1.14.8 glue_1.6.2
#> [41] Rcpp_1.0.11 xfun_0.41 tibble_3.2.1 tidyselect_1.2.0
#> [45] knitr_1.45 farver_2.1.1 htmltools_0.5.7 SnowballC_0.7.1
#> [49] rmarkdown_2.25 labeling_0.4.3 compiler_4.3.2
Created on 2023-11-14 with reprex v2.0.2
I think what is happening is this: With only three documents in your dfm_mat
, the searchK()
is trying by default to drop half of them to use for a held-out set. This is causing many features to be zero, which means they are dropped from the vocab by default in estimating the topic models used in searchK()
. stm()
needs only non-zero features, but searchK()
considers the vocab
set to be fixed, so it's breaking some code inside the function. (I did not check this in the code however.)
> sum(colSums(dfm_sample(dfm_mat, size = 2)) == 0)
[1] 603
> sum(colSums(dfm_sample(dfm_mat, size = 2)) == 0)
[1] 583
> sum(colSums(dfm_sample(dfm_mat, size = 2)) == 0)
[1] 582
These are the three sample options for dropping 1 of the 3 documents (0.50 rounded up).
You would need to contact the stm package maintainers about a potential bug report. Or, for your problem, use more documents and trim those with low frequencies.