rfiltercyrillic

Filter does not work on Cyrillic alphabet with breve "й" in R


I am kinda confused with the filter in R that seem to not work for Cyrillic alphabet with breve (й). Here is a simplified dataframe of my data which contains 1363 rows.

df <- data_frame(lex_1 = c('гей', "лесбиянка", "гей", "трансгендер", "гей", "лесбиянка"),
                 lex_2 = c("лесбиянка", "лесбиянка", "трансгендер", "трансгендер", "гей", "гей"),
                 w1w2 = c(10, 20, 25, 40, 65, 90),
                 w1 = c(1, 2, 3, 4, 5, 6),
                 w2 = c(6, 5, 4, 3, 2, 1))

Now, from this dataframe, I want to filter just "гей" string on lex_1, so I used this code

kolokasi_gay <- df[df$lex_1=="гей",]

However, this is the result of the filter. It does not retrieve anything. enter image description here

But when I change the word for the filter to "трансгендер", it worked perfectly fine.

kolokasi_gay <- df[df$lex_1=="трансгендер",]

enter image description here

Now I wonder, where did this go wrong? Why does it does not want to filter only words with breve "й"?

Your help is appreciated, thank you.

I tried to change the word that I want to filter, and it worked when the word does not have any breve diacritic.


Solution

  • There is a single Unicode for that character

    utf8ToInt("й")
    [1] 1081
    

    Your dataframe seems to be using a merged version of "и" and "̆"

    strsplit(df$lex_1[1], "")[[1]][3:4]
    [1] "и" "̆" 
    
    utf8ToInt("й")
    [1] 1080  774
    

    You can check for both with

    grepl(
      paste0(intToUtf8("1080"), intToUtf8("774"), "|", intToUtf8("1081")),
        df$lex_1)
    [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE
    

    or, more universal, replace the character altogether

    df$new <- gsub(
      paste0(intToUtf8("1080"), intToUtf8("774")), intToUtf8("1081"), df$lex_1)
    
    df$new == "гей"
    [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE