rone-hot-encodingmultibyte

One-hot-encoding multi-byte string values in R


I collected some data from a survey that asked respondents to rank their preferences for players' profiles:

profile1: Tom, center, pitcher
profile2: Pete, right, hitter
profile3: Clay, left, hitter
profile4: Tom, right, fielder
profile5: Pete, left, fielder
profile6: Clay, center, pitcher

However, being unfamiliar with this questionnaire development software, the responses I collected are stored as multi-byte string values like the following (for each respondent), which are then read into R:

preferences <- data.frame(pref = c("1. Pete, right, hitter\n2. Clay, center, pitcher\n3. Tom, right, fielder\n4. Tom, center, pitcher\n5. Clay, left, hitter\n6. Pete, left, fielder",
"1. Tom, right, fielder\n2. Clay, center, pitcher\n3. Pete, left, fielder\n4. Pete, right, hitter\n5. Tom, center, pitcher\n6. Clay, left, hitter",
"1. Clay, left, hitter\n2. Tom, center, pitcher\n3. Pete, right, hitter\n4. Pete, left, fielder\n5. Clay, center, pitcher\n6. Tom, right, fielder"))

I'm wondering if there is any way to map each of a respondent's ranked choices to distinct column values corresponding to players' profiles given above, kind of like one-hot-encoding (OHE), and turn the result into the following format:

df <- data.frame(profile1 = c(4, 5, 2), profile2 = c(1, 4, 3), profile3 = c(5, 6, 1), profile4 = c(3, 1, 6), profile5 = c(6, 3, 4), profile6 = c(2, 2, 5))

df

  profile1 profile2 profile3 profile4 profile5 profile6
1        4        1        5        3        6        2
2        5        4        6        1        3        2
3        2        3        1        6        4        5

Any suggestions would be appreciated.


Solution

  • preferences <- data.frame(pref = c("1. Pete, right, hitter\n2. Clay, center, pitcher\n3. Tom, right, fielder\n4. Tom, center, pitcher\n5. Clay, left, hitter\n6. Pete, left, fielder",
    "1. Tom, right, fielder\n2. Clay, center, pitcher\n3. Pete, left, fielder\n4. Pete, right, hitter\n5. Tom, center, pitcher\n6. Clay, left, hitter",
    "1. Clay, left, hitter\n2. Tom, center, pitcher\n3. Pete, right, hitter\n4. Pete, left, fielder\n5. Clay, center, pitcher\n6. Tom, right, fielder"), stringsAsFactors = F)
    
    profiles <- c(
      "Tom, center, pitcher",
      "Pete, right, hitter",
      "Clay, left, hitter",
      "Tom, right, fielder",
      "Pete, left, fielder",
      "Clay, center, pitcher"
    )
    
    
    df <- data.frame(do.call(rbind, lapply(preferences$pref, function(x) {
      match(
       profiles,
       str_replace_all(strsplit(x, "\\n")[[1]], "^[0-9]+. ", "")
      )
    })))
    
    names(df) <- paste0("profile", 1:length(profiles))
    
    df
    
    #   profile1 profile2 profile3 profile4 profile5 profile6
    # 1        4        1        5        3        6        2
    # 2        5        4        6        1        3        2
    # 3        2        3        1        6        4        5