rdataframefor-loopif-statementupdating

R - How to change column values based on a combination of values in other columns in that data frame?


I'm working with translation data and am trying to distinguish between simple typos and actual text modifications; typos are identified as those alterations that do not occur within 7 keystrokes after or before another alteration or broken word (longer pause within a word; those I've managed to identify). Ideally, the code would also check if any other alterations or broken words occur in the same word, regardless of how many keystrokes that is removed from the present one. The variable Issue takes the values 'BW' for broken word, 'ALT' for alterations, and 'none' if new text is produced smoothly. Each 'Id' represents a keystroke. 'Count' keeps track of the word count, i.e. all keystrokes contributing to the first word are labeled 1, the second word 2, etc.

I'd like to divvy up the 'ALT' group into 'ALT' and 'Typo' by determining for each 'ALT' if another issue ('ALT' or 'BW') pops up within 7 keystrokes before or after it, or within the same word, regardless of how long that word is (i.e. for each 'Id' with the same 'Count'). If this is not the case, the 'ALT' should be considered a 'Typo'.

I've tried a bunch of nested for and if statements, but these tend to get problematic or claim that arguments 'have length 0' when this is not the case; I'm not that skilled a coder to actually get them to work. In my final attempt, below, I'd settled for any issues within 7 keystrokes before or after the 'ALT', and not taken into account the word itself, although that is not ideal.

Example dataset:

T01 <- structure(list(Id = 1:100, Count = c(1L, 1L, 1L, 1L, 1L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 
7L, 7L, 8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 10L), Issue = c("none", 
"none", "none", "none", "none", "none", "BW", "none", "ALT", 
"ALT", "ALT", "BW", "none", "none", "none", "none", "none", "none", 
"none", "none", "none", "none", "none", "none", "none", "ALT", 
"none", "none", "none", "none", "none", "none", "none", "none", 
"none", "none", "none", "none", "none", "BW", "ALT", "ALT", "ALT", 
"ALT", "ALT", "ALT", "BW", "BW", "none", "BW", "ALT", "ALT", 
"BW", "none", "none", "none", "none", "none", "none", "none", 
"none", "none", "none", "none", "none", "none", "none", "none", 
"none", "ALT", "none", "none", "none", "ALT", "BW", "none", "BW", 
"ALT", "ALT", "ALT", "BW", "BW", "none", "none", "BW", "none", 
"none", "none", "none", "none", "none", "none", "none", "none", 
"none", "none", "none", "none", "none", "none")), row.names = c(NA, 
100L), class = "data.frame")

Final coding attempt:

library(dplyr)

Changes <- which(T01$Issue == 'Alt') # Identify which issues are alterations, and therefore possibly typos
T01$Typo <- F           #Create a column in the dataframe to store values
for (index in 1:length(T01$Id)) {     # i.e. for each row
for (x in 1:7) {                      # check up to 7 rows before or after this one
if (T01$Id[index] %in% Changes) {     # Only check indices where T01$Issue == 'ALT'
if ((T01$Issue[index-x] != 'ALT') & (T01$Issue[index-x] != 'BW') & 
# no alterations or broken words up to seven keystrokes before this one
(T01$Issue[index+x] != 'ALT') & (T01$Issue[index+x] != 'BW')) 
# no alterations or broken words up to seven keystrokes after this one
{T01$Typo[index] <- T}}}}

Hoping someone here can help out!


Solution

  • We can do it with group_by(Count) and zoo::rollapply().

    rollapply creates a window and apply a function, partial = TRUE is to consider incomplete windows, ie there isn't 7 observations before

    library(dplyr)
    
    T01 %>% 
      arrange(Id, Count) %>%
      group_by(Count) %>% 
      mutate(teste = ifelse(Issue == "ALT", 
                            ifelse(
                              zoo::rollapply(Issue,
                                             width = list(c(-7:-1,1:7)), 
                                             \(x) any(x == "ALT" | x == "BW"),
                                             partial = TRUE),
                              "Typo","ALT"),
                            Issue
                            )
             ) 
    #> # A tibble: 100 × 4
    #> # Groups:   Count [10]
    #>       Id Count Issue teste
    #>    <int> <int> <chr> <chr>
    #>  1     1     1 none  none 
    #>  2     2     1 none  none 
    #>  3     3     1 none  none 
    #>  4     4     1 none  none 
    #>  5     5     1 none  none 
    #>  6     6     2 none  none 
    #>  7     7     2 BW    BW   
    #>  8     8     2 none  none 
    #>  9     9     2 ALT   Typo 
    #> 10    10     2 ALT   Typo 
    #> # … with 90 more rows
    

    Created on 2023-02-24 with reprex v2.0.2