rlistdataframerun-length-encoding

Remove consecutive duplicates per row with RLE and check logic of sequence in R


I have a two-step data cleaning problem for a dataset with patient pathways (e.g. Arrival -> Area A -> Ward). This is an example of how the data looks like:

df <- data.frame(Patient = c(1,2,3,4,5),
                 Area1 = c("Arrival1", "Arrival1", "Arrival2", "Arrival1", "Arrival2"),
                 Area2 = c("Area A", "Diagnostics", "Area A", "Area B", NA),
                 Area3 = c("Area B", "Diagnostics", "Area B", "Area A", NA),
                 Area4 = c("Ward", "Ward", "Area B", "Area C", NA),
                 Area5 = c(NA, NA, "Ward", "Arrival", NA)
)

Step 1: Removing duplicates in consecutive columns There are patients where there are duplicates in consecutive columns, e.g. for patient 2 (Diagnostics -> Diagnostics) and patient 3 (Area B -> Area B). I need these to be unique pathways.

I have solved this using apply() and rle(): df1 <- apply(df,1,rle)

However, this gives me a (large) list with the values and lengths. How can I transfer that back into a data frame of the above form (i.e. keeping patient ID and values)? I have tried various versions of do.call, rbindlist() and unlist() but none of them seem to work for me.

Step 2: Check logic of pathways Assume we now have a clean dataset:

dfclean <- data.frame(Patient = c(1,2,3,4,5),
                 Area1 = c("Arrival1", "Arrival1", "Arrival2", "Arrival1", "Arrival2"),
                 Area2 = c("Area A", "Diagnostics", "Area A", "Area B", NA),
                 Area3 = c("Area B", "Ward", "Area B", "Area A", NA),
                 Area4 = c("Ward", NA, "Ward", "Area C", NA),
                 Area5 = c(NA, NA, NA, "Arrival", NA)
)

Now I need to check the logic of the pathways. To do so, I have a second dataset that lists all possible pathways and I need to check for every pathway in dataset 1 whether this pathway is "possible" according to dataset 2. Suppose dataset 2 looks like that:

df2 <- data.frame(Patient = c(1,2,3,4,5),
                 Area1 = c("Arrival1", "Arrival1", "Arrival2", "Arrival1", "Arrival2"),
                 Area2 = c("Area A", "Diagnostics", "Area A", "Area B", NA),
                 Area3 = c("Area B", "Area A", "Area B", "Area A", NA),
                 Area4 = c("Ward", "Ward", "Ward", "Area C", NA),
                 Area5 = c(NA, NA, NA, NA, NA)
)

I would like to create a variable that indicates TRUE for valid pathways (e.g. Patient 1) and FALSE for invalid pathways (e.g. Patient 4). I have no idea how to do that...


Solution

  • Step 1:

    df[,-1] <- data.frame(t(apply(df[,-1], 1, function(z) {
      r <- rle(z)
      c(r$values, rep(NA, length(z) - length(r$values)))
    })))
    df
    #   Patient    Area1       Area2  Area3  Area4   Area5
    # 1       1 Arrival1      Area A Area B   Ward    <NA>
    # 2       2 Arrival1 Diagnostics   Ward   <NA>    <NA>
    # 3       3 Arrival2      Area A Area B   Ward    <NA>
    # 4       4 Arrival1      Area B Area A Area C Arrival
    # 5       5 Arrival2        <NA>   <NA>   <NA>    <NA>
    

    Step 2: (tbd, pending "possible pathways")