rnlpstemming

changing the output of text_tokens function in R


I have a question redarding text mining with the corpus package and the function text_tokens(). I want to use the function for stemming and deleting stop words. I have a huge amount of data (almost 1.000.000 comments) where I want to use it for. But I've problems with the output, the function text_tokens produces. So here is a basic example of my data and code:

library(tidyverse)
library(corpus)
library(stopwords)

text <- data.frame(comment_id = 1:2,
                   comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))


tmp <- text_tokens(text$comment_content, 
                   text_filter(stemmer = "de",drop = stopwords("german")))

My problem now is, that I want a data.frame as output with the comment_id in the first column and word_token in the column. So the output I would like to have looks as followed:

df <- data.frame(comment_id = c(1,1,1,2,2,2),
                 comment_tokens = c("hallo","nam","aaron","lieb","dank","video"))

output I need

I tried different do.calls (cbind/rbind), but they don't give me the result I need. So what is the function I'm looking for, is it map() from the tidyverse?

Thank you in advance.

Cheers,

Aaron


Solution

  • Here's an option using imap_dfr from purrr:

    library(corpus)
    library(dplyr)
    library(purrr)
    
    text <- data.frame(comment_id = 1:2,
                       comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))
    
    
    tmp <- text_tokens(text$comment_content, 
                       text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>% 
      purrr::imap_dfr(function(x, y) {
      tibble(
        comment_id = y,
        comment_tokens = x
      )
    })
    
    tmp
    #> # A tibble: 6 × 2
    #>   comment_id comment_tokens
    #>        <int> <chr>         
    #> 1          1 hallo         
    #> 2          1 nam           
    #> 3          1 aaron         
    #> 4          2 lieb          
    #> 5          2 dank          
    #> 6          2 video
    

    Or if you prefer using an anonymous function:

    tmp <- text_tokens(text$comment_content, text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>% 
      purrr::imap_dfr(~ tibble(comment_id = .y, comment_tokens = .x))