rtexttext-mining

Randomly reshuffle words order in string


I have a larger data frame consisting of texts where I want to reshuffle the order of words in each string randomly.

To give you a concrete exampleMy data looks somehow like the data below:

library(stringi)
require(tidyverse)

set.seed(123)

n <- 100
df <- data.frame(id = 1:n,
                 text = rep(stri_rand_lipsum(n)))

# Some preprocessing
df <- df %>%
  mutate(text = tolower(text),
         text = gsub("[[:punct:]]", "", text))

I want to reshuffle word order at random in each string found in the variable text.

I found several ways how to reshuffle each letter, but not any ways of how to reshuffle word's order randomly. Does anybody know how to do it? An important factor is that my data consists of millions of rows, thus, the approach need to be suitable for larger data sets as well.

Thanks!


Solution

  • We can strsplit the whole string with space " " as the delimiter. Then use sample on these individual words to generate random order, and paste them back into one string. I guess we should directly assign the result into a new column instead of using mutate if we are aiming for efficiency. However, I'm not sure how efficient my code is.

    df$random_text <- sapply(strsplit(df$text, " "), \(x) paste(sample(x), collapse = " "))