Creating bigrams from unigrams doesn't seem to work in Japanese in {quanteda}
. I can hack the text with gsub()
, but I hope there's a better way. I can't post a complete reprex because SO won't allow posts with Japanese or Chinese text, so please don't ping/ask me about reprex--it's beyond my control.
library(quanteda)
text <- readLines("https://laits.utexas.edu/~mr56267/example.txt")
test_corpus <- corpus(text)
test_tokens <- tokens(test_corpus, remove_punct = TRUE, remove_numbers = TRUE) |>
tokens_compound(pattern = phrase("休","浜"), concatenator = "")
test_tokens
test_kwic <- kwic(test_tokens, pattern = "休浜", window = 5)
test_kwic
It should be pattern = phrase("休 浜")
or pattern = list(c("休","浜"))
.
phrase()
simply splits the string by the whitespace and create a list.