rsequence

Identifying, extracting and counting patterns in sequences


I'm working with a data-frame that contains only two columns one column corresponds to a Unique ID generated by a Virtual Machine and the second column contains a name. However, this particular column may also contain the string "ERROR".

The objective is to create a script that will allow us to identify every time the string "ERROR" is found and capture the last and following names around it and also the unique ID assigned to the string "ERROR". To illustrate let's look at the following example:

If I have this data:

ID NAMES
1 James
3 ERROR
6 Keras
88 Kelly
53 Micheal
55 ERROR
7 Cindy
834 Keras

Then we would like to have come up with the following list:

ID NAMES
3 James-Keras
55 Micheal-Cindy

This is because the first string "ERROR" found had an ID of 3 and was between the names James (before ERROR) and Keras (After ERROR) the next "ERROR" had an ID of 55 and was between Micheal and Cindy what if "ERROR" is a the top of the list or the bottom then we should only include whatever name we find it is OK to have lets say " NA-NAME" is ERROR was found at the top...

But here is where it gets tricky; if we ever run into a sequence with consecutive strings "ERROR" we should always use as a "guide" the very last one in descending order for instance:

If I have this data set

ID NAMES
1 James
3 ERROR
6 ERROR
88 ERROR
53 Jude
55 ERROR
7 Cindy
834 Keras

then we will want to have

ID NAMES
88 James-Jude
55 Jude-Cindy

and this is because the string ERROR was repeated 3 times consecutively but the last one was at ID 88 so that means that we'll take that as a reference and record the names before and after it. Another way of seeing this is to view the strings "ERROR" as a block, so we'll record the names before and after each block of strings "ERROR".


Solution

  • We may create a function to do this

    f1 <- function(dat) {
    
        subdat1 <- subset(dat, !duplicated(with(rle(NAMES == "ERROR"), 
               rep(seq_along(values), lengths)), fromLast = TRUE))
        subdat2 <- subset(dat, !duplicated(with(rle(NAMES == "ERROR"), 
              rep(seq_along(values), lengths))))
        ind <- which(subdat1$NAMES == "ERROR")
        do.call(rbind, lapply(ind[c(TRUE, diff(ind) > 1)], function(i) 
            data.frame(ID = subdat1$ID[i],NAMES = paste(subdat1$NAMES[i-1], 
            subdat2$NAMES[i+1], sep="-"))))
    }
    

    -testing

    > f1(df1)
      ID         NAMES
    1  3   James-Keras
    2 55 Micheal-Cindy
    > f1(df2)
      ID      NAMES
    1 88 James-Jude
    2 55 Jude-Cindy
    

    data

    df1 <- structure(list(ID = c(1L, 3L, 6L, 88L, 53L, 55L, 7L, 834L), NAMES = c("James", 
    "ERROR", "Keras", "Kelly", "Micheal", "ERROR", "Cindy", "Keras"
    )), class = "data.frame", row.names = c(NA, -8L))
    
    df2 <- structure(list(ID = c(1L, 3L, 6L, 88L, 53L, 55L, 7L, 834L), NAMES = c("James", 
    "ERROR", "ERROR", "ERROR", "Jude", "ERROR", "Cindy", "Keras")), 
     class = "data.frame", row.names = c(NA, 
    -8L))