Please have a look at the self-contained example at the end of the post. I simplified the reprex and you can download the dfm (document-feature matrix) from
https://e.pcloud.link/publink/show?code=XZmHFDZeObPiNtsGWfzuBlnVw2ryzATt1X7
A couple of things which I do not understand happen
What causes 'subscript out of bounds' error in STM topic modeling with missing data?
but here I give a reproducible example.
Any help for 1) and 2) is appreciated!
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 15.0
#> ICU version: 72.1
#> Parallel computing: 4 of 4 threads used.
#> See https://quanteda.io for tutorials and examples.
library(stm)
#> stm v1.3.6.1 successfully loaded. See ?stm for help.
#> Papers, resources, and other materials at structuraltopicmodel.com
library(RCurl)
library(readtext)
#>
#> Attaching package: 'readtext'
#> The following object is masked from 'package:quanteda':
#>
#> texts
library(tidytext)
library(ggplot2)
## Download the dfm matrix from
## https://e.pcloud.link/publink/show?code=XZmHFDZeObPiNtsGWfzuBlnVw2ryzATt1X7
dfm_mat <- readRDS("dfm_mat.RDS")
## see https://rstudio-pubs-static.s3.amazonaws.com/406792_9287b832dd9e413f97243628cb2f7ddb.html
## convert the dfm to a format suitable to stm.
dfm2stm <- convert(dfm_mat, to = "stm")
model.stm <- stm(dfm2stm$documents, dfm2stm$vocab, K = 9, data = dfm2stm$meta,
init.type = "Spectral")
#> Beginning Spectral Initialization
#> Calculating the gram matrix...
#> Finding anchor words...
#> .........
#> Recovering initialization...
#> ...........................
#> Initialization complete.
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 1 (approx. per word bound = -6.780)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 2 (approx. per word bound = -6.762, relative change = 2.715e-03)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 3 (approx. per word bound = -6.761, relative change = 4.260e-05)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 4 (approx. per word bound = -6.761, relative change = 1.602e-05)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 5 (approx. per word bound = -6.761, relative change = 1.024e-05)
#> Topic 1: europe, can, european, new, need
#> Topic 2: union, need, europe, today, us
#> Topic 3: europe, union, work, european, need
#> Topic 4: union, need, europe, today, us
#> Topic 5: europe, can, european, new, need
#> Topic 6: europe, union, work, european, need
#> Topic 7: union, need, europe, today, us
#> Topic 8: europe, can, european, new, need
#> Topic 9: accelerate, union, need, europe, us
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Model Converged
## I make the model tidy.
## See https://juliasilge.com/blog/sherlock-holmes-stm/
stm_tidy <- tidy(model.stm)
gpl <- stm_tidy |>
group_by(topic) |>
top_n(10, beta) |>
ungroup() |>
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) |>
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(x = NULL, y = expression(beta),
title = "Highest word probabilities for each topic",
subtitle = "Different words are associated with different topics")
gpl

## I can fit a model by stm with a chosen number of topics to the data
### Now I try determining the optimal number of topics using the searchK function
### See https://stackoverflow.com/questions/64989642/use-dfm-in-searchk-calcuation
set.seed(02138)
K <- 5:15
model_search <- searchK(dfm2stm$documents, dfm2stm$vocab, K,
data = dfm2stm$meta)
#> Beginning Spectral Initialization
#> Calculating the gram matrix...
#> Finding anchor words...
#> .....
#> Recovering initialization...
#> ...........................
#> Initialization complete.
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 1 (approx. per word bound = -6.781)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 2 (approx. per word bound = -6.761, relative change = 2.956e-03)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Completing Iteration 3 (approx. per word bound = -6.761, relative change = 2.235e-05)
#> ...
#> Completed E-Step (0 seconds).
#> Completed M-Step.
#> Model Converged
#> Error in missing$docs[[i]]: subscript out of bounds
## This fails but I do not understand why....
sessionInfo()
#> R version 4.3.2 (2023-10-31)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 12 (bookworm)
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.11.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.11.0
#>
#> locale:
#> [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
#> [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
#> [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Europe/Brussels
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.4.4 tidytext_0.4.1 readtext_0.90 RCurl_1.98-1.13
#> [5] stm_1.3.6.1 quanteda_3.3.1 dplyr_1.1.3
#>
#> loaded via a namespace (and not attached):
#> [1] janeaustenr_1.0.0 utf8_1.2.4 generics_0.1.3 slam_0.1-50
#> [5] bitops_1.0-7 stringi_1.7.12 lattice_0.22-5 digest_0.6.33
#> [9] magrittr_2.0.3 evaluate_0.23 grid_4.3.2 fastmap_1.1.1
#> [13] plyr_1.8.9 Matrix_1.6-2 httr_1.4.7 stopwords_2.3
#> [17] fansi_1.0.5 scales_1.2.1 cli_3.6.1 rlang_1.1.2
#> [21] tokenizers_0.3.0 munsell_0.5.0 reprex_2.0.2 withr_2.5.2
#> [25] yaml_2.3.7 tools_4.3.2 reshape2_1.4.4 colorspace_2.1-0
#> [29] fastmatch_1.1-4 vctrs_0.6.4 R6_2.5.1 lifecycle_1.0.4
#> [33] stringr_1.5.0 fs_1.6.3 pkgconfig_2.0.3 RcppParallel_5.1.7
#> [37] pillar_1.9.0 gtable_0.3.4 data.table_1.14.8 glue_1.6.2
#> [41] Rcpp_1.0.11 xfun_0.41 tibble_3.2.1 tidyselect_1.2.0
#> [45] knitr_1.45 farver_2.1.1 htmltools_0.5.7 SnowballC_0.7.1
#> [49] rmarkdown_2.25 labeling_0.4.3 compiler_4.3.2
Created on 2023-11-14 with reprex v2.0.2
I think what is happening is this: With only three documents in your dfm_mat, the searchK() is trying by default to drop half of them to use for a held-out set. This is causing many features to be zero, which means they are dropped from the vocab by default in estimating the topic models used in searchK(). stm() needs only non-zero features, but searchK() considers the vocab set to be fixed, so it's breaking some code inside the function. (I did not check this in the code however.)
> sum(colSums(dfm_sample(dfm_mat, size = 2)) == 0)
[1] 603
> sum(colSums(dfm_sample(dfm_mat, size = 2)) == 0)
[1] 583
> sum(colSums(dfm_sample(dfm_mat, size = 2)) == 0)
[1] 582
These are the three sample options for dropping 1 of the 3 documents (0.50 rounded up).
You would need to contact the stm package maintainers about a potential bug report. Or, for your problem, use more documents and trim those with low frequencies.