rif-statementerror-handling

R: How to fix errors in specific dataset rows by using ifelse() function or other methods


My dataset contains the following typos

unique(d$gender)
[1] "k"           "kobieta"     "M"           "K"           "m─Ö┼╝czyzna" "21"          "m"           "M─Ö┼╝czyzna"

> unique(d$age)
[1] 19 NA 21 20 30 32 22 25 29

Actually, rows with 21 for gender and NA for age have been switched and moreover, different naming have been used for gender variable (indeed, all the 'k' heading name corresponds to female 'F' and the heading one with 'm' stand for male 'M'). I've written down this command lines to fix this for gender variable:

> d$gender = ifelse(d$gender == 'K', 'F', 
+            ifelse(d$gender =='kobieta', 'F', ifelse(d$gender == 'k', 'F', 
+            ifelse(d$gender == "m-Ö++czyzna", 'M',ifelse(d$gender == '21', 'M',
+            ifelse(d$gender == 'm', 'M', ifelse(d$gender == 'M-Ö++czyzna', 'M', 
+            ifelse(d$gender == 'M', 'M', 'M'))))))))
> 
> unique(d$gender)
[1] "F" "M"

But I don't know how to do he same for age variable, neither if this method could be the right way. Anyone has any suggestions?

This is the dput() result:

 d <- structure(
  list(
    ID = rep("SS49", 37),
    gender = rep("M", 37),
    age = rep(37, 37),
    vab5 = c(34, 34, 34, 34, 34, 437, 37, 37, 37, 437, 437, 34, 37, 34, 437, 
             437, 37, 437, 437, 34, 437, 37, 37, 37, 34, 437, 34, 37, 34, 437, 
             34, 437, 37, 34, 437, 37, 37),
    vab3 = factor(rep(1L, 37), labels = c("0", "1")),
    besp = c("NO","NO","NO","_","NO","_","NO","NO","NO","_","NO","NO","_",
             "NO","NO","_","NO","NO","_","NO","_","_","_","NO","NO","NO","_",
             "_","NO","NO","_","NO","NO","NO","NO","NO"),
    act = factor(c(1L,1L,1L,2L,1L,2L,1L,1L,1L,2L,1L,1L,2L,1L,1L,2L,1L,1L,2L,
                   1L,2L,2L,2L,1L,1L,1L,2L,2L,1L,1L,2L,1L,1L,1L,1L,1L),
                 labels = c("0","1")),
    group = c("by","by","by","b","by","b","by","by","by","b","by","by","b",
              "by","by","b","by","by","b","by","b","b","b","by","by","by","b",
              "b","by","by","b","by","by","by","by","by"),
    
    # Cambiato da qui in poi
    response_time = runif(37, 0.2, 1.5),    # valori casuali
    condition_code = sample(letters[1:5], 37, replace = TRUE),
    signal_strength = round(runif(37, 0.4, 1.5), 3),
    trial_number = 1:37,
    left_dots = sample(45:55, 37, replace = TRUE),
    right_dots = sample(45:55, 37, replace = TRUE),
    task_type = sample(c("taskA", "taskB", "taskC"), 37, replace = TRUE),
    is_correct = factor(sample(c(0, 1), 37, replace = TRUE), labels = c("no", "yes")),
    go_nogo = sample(c("go", "nogo"), 37, replace = TRUE),
    accuracy = factor(sample(c(0, 1), 37, replace = TRUE), labels = c("low", "high")),
    category_code = factor(sample(1:5, 37, replace = TRUE)),
    difficulty_level = sample(c("easy", "medium", "hard"), 37, replace = TRUE)
  ),
  row.names = c(NA, -37L),
  class = c("tbl_df", "tbl", "data.frame")
)

Solution

  • I don't know which is the problem with the Age. But the ifelse statement can be rewritten in the next way:

    If there are no anomalies in the d$gender field:

    d$gender2 = ifelse(tolower(substr(d$gender,1,1)) == "k", "F", "M") 
    

    If there are anomalies in the d$gender field:

    d$gender2 = ifelse(tolower(substr(d$gender,1,1)) == "k", "F",
                       ifelse(tolower(substr(d$gender,1,1)) == "m" | d$gender == "21", "M", "Other")
    

    I think that is a more comfortable method. and you could use some variation like this.

    In the case of the Age I don't know what you want to do.