rstringunicode

Rupee Unicode replacement not working in R


I'm trying to replace currency symbols in my corpus to text such as $ to dollar. For example:

x <- "i have \u20AC and \u0024 and \u00A3 and \u00A5 and \u20B9"
"i have € and $ and £ and ¥ and \u20b9"

Unicode works well for all the currency except the rupee. So what would be the problem?

My second issue is while doing a gsub, Unicode replacement works for every symbol except for dollar.

sub('\u0024'dollar', x) ## which gives me
"i have € and $ and £ and ¥ and \u20b9dollar"

Replacing dollar could be done using this:

gsub([$], dollar, x)

Solution

  • To view your x with the rupee in it, use cat:

    > cat(x, sep="\n")
    i have € and $ and £ and ¥ and ₹
    > 
    

    To replace the dollar, use a literal string replacement by adding fixed=TRUE (so as not to escape the $ symbol that denotes the end of string in a regex):

    > x <- gsub("$", "dollar", x, fixed=TRUE)
    > cat(x, sep="\n")
    i have € and dollar and £ and ¥ and ₹
    > 
    

    When you do not pass fixed=TRUE, sub and gsub parses the "$" as a regex pattern, and in regex, $ denotes the end of string. That is why in your results, dollar is added after the rupee.