rtext-miningquanteda

Tokenization of Compound Words not Working in Quanteda in Japanese


Creating bigrams from unigrams doesn't seem to work in Japanese in {quanteda}. I can hack the text with gsub(), but I hope there's a better way. I can't post a complete reprex because SO won't allow posts with Japanese or Chinese text, so please don't ping/ask me about reprex--it's beyond my control.

library(quanteda) 
text <- readLines("https://laits.utexas.edu/~mr56267/example.txt")
test_corpus <- corpus(text)

test_tokens <- tokens(test_corpus, remove_punct = TRUE, remove_numbers = TRUE) |>
  tokens_compound(pattern = phrase("休","浜"), concatenator = "")
test_tokens

test_kwic <- kwic(test_tokens, pattern = "休浜", window = 5)
test_kwic

Solution

  • It should be pattern = phrase("休 浜") or pattern = list(c("休","浜")).

    phrase() simply splits the string by the whitespace and create a list.