My dataset contains the following typos
unique(d$gender)
[1] "k" "kobieta" "M" "K" "m─Ö┼╝czyzna" "21" "m" "M─Ö┼╝czyzna"
> unique(d$age)
[1] 19 NA 21 20 30 32 22 25 29
Actually, rows with 21 for gender and NA for age have been switched and moreover, different naming have been used for gender variable (indeed, all the 'k' heading name corresponds to female 'F' and the heading one with 'm' stand for male 'M'). I've written down this command lines to fix this for gender variable:
> d$gender = ifelse(d$gender == 'K', 'F',
+ ifelse(d$gender =='kobieta', 'F', ifelse(d$gender == 'k', 'F',
+ ifelse(d$gender == "m-Ö++czyzna", 'M',ifelse(d$gender == '21', 'M',
+ ifelse(d$gender == 'm', 'M', ifelse(d$gender == 'M-Ö++czyzna', 'M',
+ ifelse(d$gender == 'M', 'M', 'M'))))))))
>
> unique(d$gender)
[1] "F" "M"
But I don't know how to do he same for age variable, neither if this method could be the right way. Anyone has any suggestions?
This is the dput() result:
d <- structure(
list(
ID = rep("SS49", 37),
gender = rep("M", 37),
age = rep(37, 37),
vab5 = c(34, 34, 34, 34, 34, 437, 37, 37, 37, 437, 437, 34, 37, 34, 437,
437, 37, 437, 437, 34, 437, 37, 37, 37, 34, 437, 34, 37, 34, 437,
34, 437, 37, 34, 437, 37, 37),
vab3 = factor(rep(1L, 37), labels = c("0", "1")),
besp = c("NO","NO","NO","_","NO","_","NO","NO","NO","_","NO","NO","_",
"NO","NO","_","NO","NO","_","NO","_","_","_","NO","NO","NO","_",
"_","NO","NO","_","NO","NO","NO","NO","NO"),
act = factor(c(1L,1L,1L,2L,1L,2L,1L,1L,1L,2L,1L,1L,2L,1L,1L,2L,1L,1L,2L,
1L,2L,2L,2L,1L,1L,1L,2L,2L,1L,1L,2L,1L,1L,1L,1L,1L),
labels = c("0","1")),
group = c("by","by","by","b","by","b","by","by","by","b","by","by","b",
"by","by","b","by","by","b","by","b","b","b","by","by","by","b",
"b","by","by","b","by","by","by","by","by"),
# Cambiato da qui in poi
response_time = runif(37, 0.2, 1.5), # valori casuali
condition_code = sample(letters[1:5], 37, replace = TRUE),
signal_strength = round(runif(37, 0.4, 1.5), 3),
trial_number = 1:37,
left_dots = sample(45:55, 37, replace = TRUE),
right_dots = sample(45:55, 37, replace = TRUE),
task_type = sample(c("taskA", "taskB", "taskC"), 37, replace = TRUE),
is_correct = factor(sample(c(0, 1), 37, replace = TRUE), labels = c("no", "yes")),
go_nogo = sample(c("go", "nogo"), 37, replace = TRUE),
accuracy = factor(sample(c(0, 1), 37, replace = TRUE), labels = c("low", "high")),
category_code = factor(sample(1:5, 37, replace = TRUE)),
difficulty_level = sample(c("easy", "medium", "hard"), 37, replace = TRUE)
),
row.names = c(NA, -37L),
class = c("tbl_df", "tbl", "data.frame")
)
I don't know which is the problem with the Age. But the ifelse statement can be rewritten in the next way:
If there are no anomalies in the d$gender
field:
d$gender2 = ifelse(tolower(substr(d$gender,1,1)) == "k", "F", "M")
If there are anomalies in the d$gender
field:
d$gender2 = ifelse(tolower(substr(d$gender,1,1)) == "k", "F",
ifelse(tolower(substr(d$gender,1,1)) == "m" | d$gender == "21", "M", "Other")
I think that is a more comfortable method. and you could use some variation like this.
In the case of the Age I don't know what you want to do.