I am kinda confused with the filter in R that seem to not work for Cyrillic alphabet with breve (й). Here is a simplified dataframe of my data which contains 1363 rows.
df <- data_frame(lex_1 = c('гей', "лесбиянка", "гей", "трансгендер", "гей", "лесбиянка"),
lex_2 = c("лесбиянка", "лесбиянка", "трансгендер", "трансгендер", "гей", "гей"),
w1w2 = c(10, 20, 25, 40, 65, 90),
w1 = c(1, 2, 3, 4, 5, 6),
w2 = c(6, 5, 4, 3, 2, 1))
Now, from this dataframe, I want to filter just "гей" string on lex_1, so I used this code
kolokasi_gay <- df[df$lex_1=="гей",]
However, this is the result of the filter. It does not retrieve anything.
But when I change the word for the filter to "трансгендер", it worked perfectly fine.
kolokasi_gay <- df[df$lex_1=="трансгендер",]
Now I wonder, where did this go wrong? Why does it does not want to filter only words with breve "й"?
Your help is appreciated, thank you.
I tried to change the word that I want to filter, and it worked when the word does not have any breve diacritic.
There is a single Unicode for that character
utf8ToInt("й")
[1] 1081
Your dataframe seems to be using a merged version of "и" and "̆"
strsplit(df$lex_1[1], "")[[1]][3:4]
[1] "и" "̆"
utf8ToInt("й")
[1] 1080 774
You can check for both with
grepl(
paste0(intToUtf8("1080"), intToUtf8("774"), "|", intToUtf8("1081")),
df$lex_1)
[1] TRUE FALSE TRUE FALSE TRUE FALSE
or, more universal, replace the character altogether
df$new <- gsub(
paste0(intToUtf8("1080"), intToUtf8("774")), intToUtf8("1081"), df$lex_1)
df$new == "гей"
[1] TRUE FALSE TRUE FALSE TRUE FALSE