rtwittervectorizationstringr

unlist keeping the same number of elements (vectorized)


I am trying to extract all hashtags from some tweets, and obtain for each tweet a single string with all hashtags. I am using str_extract from stringr, so I obtain a list of character vectors. My problem is that I do not manage to unlist it and keep the same number of elements of the list (that is, the number of tweets). Example:

This is a vector of tweets of length 3:

a <- "rt @ugh_toulouse: #mondial2014 : le top 5 des mannequins brésiliens http://www.ladepeche.fr/article/2014/06/01/1892121-mondial-2014-le-top-5-des-mannequins-bresiliens.html #brésil "
b <- "rt @30millionsdamis: beauté de la nature : 1 #baleine sauve un naufragé ; elles pourtant tellement menacées par l'homme... http://example.com/xqrqhd #instinctanimal "
c <- "rt @onlyshe31: elle siège toujours!!!!!!!  marseille. nouveau procès pour la députée - 01/06/2014 - ladépêche.fr http://www.ladepeche.fr/article/2014/06/01/1892035-marseille-nouveau-proces-pour-la-deputee.html #toulouse "
all <- c(a, b, c)

Now I use str_extract_all to extract the hashtags:

ex <- str_extract_all(all, "#(.+?)[ |\n]")

If I now use unlist I get a vector of length 5:

undesired <- unlist(ex)
> undesired
[1] "#mondial2014 "    "#brésil "        
[3] "#baleine "        "#instinctanimal "
[5] "#toulouse " 

What I want is something like the following. However this is very inefficient, because it is not vectorized, and it takes forever (really!) on a smallish data frame of tweets:

desired <- c()
for (i in 1:length(ex)){
  desired[i] <- paste(ex[[i]], collapse = " ")
}

> desired
[1] "#mondial2014  #brésil "    
[2] "#baleine  #instinctanimal "
[3] "#toulouse " 

Help!


Solution

  • You could use stringi which may be faster for big datasets

    library(stringi)
    sapply(stri_extract_all_regex(all, '#(.+?)[ |\n]'), paste, collapse=' ')
    #[1] "#mondial2014  #brésil "     "#baleine  #instinctanimal "
    #[3] "#toulouse " 
    

    The for loops can be fast if you preassign the length of the output desired

    desired <- numeric(length(ex))
    for (i in 1:length(ex)){
      desired[i] <- paste(ex[[i]], collapse = " ") 
    }
    

    Or you could use vapply which would be faster than sapply and a bit safer (contributed by @Richie Cotton)

    vapply(ex, toString, character(1))
    #[1] "#mondial2014 , #brésil "     "#baleine , #instinctanimal "
    #[3] "#toulouse "                 
    

    Or as suggested by @Ananda Mahto

     vapply(stri_extract_all_regex(all, '#(.+?)[ |\n]'),
                  stri_flatten, character(1L), collapse = " ")