I am working in R and using the replace_emoticon
function from the textclean package to replace emoticons with their corresponding words:
library(textclean)
test_text <- "i had a great experience xp :P"
replace_emoticon(test_text)
[1] "i had a great e tongue sticking out erience tongue sticking out tongue sticking out "
As seen above, the function works but it also replaces characters that look like an emoticon but are within a word (for example the "xp" in "experience"). I have tried to find a solution for this issue and found the following function-overwrite that claims to fix this issue:
replace_emoticon <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){
trimws(gsub(
"\\s+",
" ",
mgsub_regex(x, paste0('\\b\\Q', emoticon_dt[['x']], '\\E\\b'), paste0(" ", emoticon_dt[['y']], " "))
))
}
replace_emoticon(test_text)
[1] "i had a great experience tongue sticking out :P"
However, while it does solve the issue with the word "experience", it creates a whole new issue: it stops replacing the ":P" - which is an Emoticon and should normally get replaced by the function.
Furthermore, the error is known with the characters "xp", but I am not sure whether there are other characters except for "xp" that also get replaced incorrectly while they are part of a word.
Is there a solution to tell the replace_emoticon
function to only replace "emoticons" when they are not part of a word?
Thank you!
Wiktor is right, the word boundery check is causing an issue. I have adjusted it slightly in the below function. There is still 1 issue with this and that is if the emoticon is immediately followed by a word without a space between the emoticon and the word. The question is if the last issue is important or not. See examples below.
Note: I added this info to the issue tracker with textclean.
replace_emoticon2 <- function(x, emoticon_dt = lexicon::hash_emoticons, ...){
trimws(gsub(
"\\s+",
" ",
mgsub_regex(x, paste0('\\Q', emoticon_dt[['x']], '\\E\\b'), paste0(" ", emoticon_dt[['y']], " "))
))
}
# works
replace_emoticon2("i had a great experience xp :P")
[1] "i had a great experience tongue sticking out tongue sticking out"
replace_emoticon2("i had a great experiencexp:P:P")
[1] "i had a great experience tongue sticking out tongue sticking out tongue sticking out"
# does not work:
replace_emoticon2("i had a great experience xp :Pnewword")
[1] "i had a great experience tongue sticking out :Pnewword"
New function added:
Based on stringi and the regex escaping function from wiktor from this post
replace_emoticon_new <- function (x, emoticon_dt = lexicon::hash_emoticons, ...)
{
regex_escape <- function(string) {
gsub("([][{}()+*^${|\\\\?.])", "\\\\\\1", string)
}
stringi::stri_replace_all(x,
regex = paste0("\\s+", regex_escape(emoticon_dt[["x"]])),
replacement = paste0(" ", emoticon_dt[['y']]),
vectorize_all = FALSE)
}
test_text <- "Hello :) Great experience! xp :) :P"
replace_emoticon_new(test_text)
[1] "Hello smiley Great experience! tongue sticking out smiley tongue sticking out"