rnlpcorpusquanteda

R: How to count the total number of tokens in a corpus?


I have created a Quanteda corpus called readtext_corpus with 190 types of text. I would like to count the total number of tokens or words in the corpus. I tried the function ntoken which gives a number of words per text not the total number of words for all 190 texts.


Solution

  • you can just use the sum() function which is really simple. I left an example:

    test <- c("testing string number 1","testing string number 2")
    
    sum(quanteda::ntoken(test))
    

    Result:

    > quanteda::ntoken(test)
    text1 text2 
        4     4 
    > sum(quanteda::ntoken(test))
    [1] 8
    > 
    

    In case you are using pipes, which is pretty common with quanteda

    > quanteda::ntoken(test) %>% sum()
    [1] 8