Using the tidytext package, I want to transform my tibble into a one-token-per-document-per-row. I transformed the text column of my tibble from factor to character but I still get the same error.
text_df <- tibble(line = 1:3069, text = text)
My tibble looks like this, with a column as character:
# A tibble: 3,069 x 2
line text$text
<int> <chr>
However when I try to apply unnest_tokens:
text_df %>%
unnest_tokens(word, text$text)
I always get the same error:
Error in check_input(x) : Input must be a character vector of any length or a list of character vectors, each of which has a length of 1.
What is the issue in my code?
Your text
column is probably a data frame itself with a single text
column :
library(tibble)
library(dplyr,warn.conflicts = FALSE)
library(tidytext)
text <- data.frame(text= c("hello world", "this is me"), stringsAsFactors = FALSE)
text_df <- tibble(line = 1:2, text = text)
text_df
#> # A tibble: 2 x 2
#> line text$text
#> <int> <chr>
#> 1 1 hello world
#> 2 2 this is me
text_df %>%
unnest_tokens(word, text$text)
Error in check_input(x) :
Input must be a character vector of any length or a list of character vectors, each of which has a length of 1.
Modify it to extract the text column and proceed :
text_df <- mutate(text_df, text = text$text)
# or if your text is stored as factor
# text_df <- mutate(text_df, text = as.character(text$text))
text_df
#> # A tibble: 2 x 2
#> line text
#> <int> <chr>
#> 1 1 hello world
#> 2 2 this is me
text_df %>%
unnest_tokens(word, text)
#> # A tibble: 5 x 2
#> line word
#> <int> <chr>
#> 1 1 hello
#> 2 1 world
#> 3 2 this
#> 4 2 is
#> 5 2 me
It's a good idea to use str()
, or sometimes summary()
, names()
or unclass()
to diagnose this sort of issues :
text <- data.frame(text= c("hello world", "this is me"), stringsAsFactors = FALSE)
text_df <- tibble(line = 1:2, text = text)
str(text_df)
#> Classes 'tbl_df', 'tbl' and 'data.frame': 2 obs. of 2 variables:
#> $ line: int 1 2
#> $ text:'data.frame': 2 obs. of 1 variable:
#> ..$ text: chr "hello world" "this is me"