I have a question redarding text mining with the corpus package
and the function text_tokens()
. I want to use the function for stemming and deleting stop words. I have a huge amount of data (almost 1.000.000 comments) where I want to use it for. But I've problems with the output, the function text_tokens
produces. So here is a basic example of my data and code:
library(tidyverse)
library(corpus)
library(stopwords)
text <- data.frame(comment_id = 1:2,
comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))
tmp <- text_tokens(text$comment_content,
text_filter(stemmer = "de",drop = stopwords("german")))
My problem now is, that I want a data.frame
as output with the comment_id in the first column and word_token in the column. So the output I would like to have looks as followed:
df <- data.frame(comment_id = c(1,1,1,2,2,2),
comment_tokens = c("hallo","nam","aaron","lieb","dank","video"))
I tried different do.calls
(cbind/rbind), but they don't give me the result I need. So what is the function I'm looking for, is it map()
from the tidyverse?
Thank you in advance.
Cheers,
Aaron
Here's an option using imap_dfr
from purrr
:
library(corpus)
library(dplyr)
library(purrr)
text <- data.frame(comment_id = 1:2,
comment_content = c("Hallo mein Name ist aaron","Vielen Lieben Dank für das Video"))
tmp <- text_tokens(text$comment_content,
text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>%
purrr::imap_dfr(function(x, y) {
tibble(
comment_id = y,
comment_tokens = x
)
})
tmp
#> # A tibble: 6 × 2
#> comment_id comment_tokens
#> <int> <chr>
#> 1 1 hallo
#> 2 1 nam
#> 3 1 aaron
#> 4 2 lieb
#> 5 2 dank
#> 6 2 video
Or if you prefer using an anonymous function:
tmp <- text_tokens(text$comment_content, text_filter(stemmer = "de",drop = corpus::stopwords_de)) %>%
purrr::imap_dfr(~ tibble(comment_id = .y, comment_tokens = .x))