I am currently learning how to perform data analysis in R Studio and I am using an SPSS database as an example. Currently I am having problems with the results of an open answer question where people had to write what region they come from. So now I have many cases where the same answer is written slightly different so they are perceived as being different although they refer to the same region.
Example:
x<- c("Bucharest", "ploiesti", "Focsani",
"bucharest", "sinaia", "Ploiești", "Sinaia", "BUCHAREST", "Bucharest", "Ploiesti")
table(x)
and the result, if I want to make a table would be:
> table(x)
x
bucharest Bucharest BUCHAREST Focsani ploiesti Ploiesti Ploiești
1 2 1 1 1 1 1
sinaia Sinaia
1 1
I'm not sure if this is the best example as my problem is for a variable/ column from a dataset but I hope that this helps.
I tried using the "str_to_title()" function from the "stringr" package but I get the following error:
Warning message:
In stri_trans_totitle(string, opts_brkiter = stri_opts_brkiter(locale = locale)) :
argument is not an atomic vector; coercing
I want to find a way to make all the answers more uniform (ex: To turn all versions of "Bucharest" into a version with the same spelling that can be recognized as the same answer and do the same for the other answers) and then form a table where I can see how many times does each answer repeat.
x <- data.frame(region = c("Bucharest", "ploiesti", "Focsani",
"bucharest", "sinaia", "Ploiești", "Sinaia", "BUCHAREST", "Bucharest", "Ploiesti")) %>%
mutate(uniformName = str_to_title(region),
uniformName = str_replace(uniformName, 'ș', 's')) %>%
group_by(uniformName) %>%
summarise(count = n())