rdataframesequenceinformation-extractionidentify

Identify sequences in alphanumeric strings in R


I am attempting to create a flag for when transaction IDs are sequential. For reasons that I will not get into here, these can be a red flag. The problem that I am having is that the IDs are not standardized. Some can be numeric only and others are alphanumeric. I want to identify cases where a vendor's transactions are sequential for X number (how many in a row TBD) of transactions. The data will be grouped by vendor and I'd like to extract all rows and columns for instances where a sequence is present. A much simplified example below:

df <- read.table(text=
"Vendor        'Transaction ID'
 ACME               1
 ACME               2
 ACME               3
 JDOE               A1
 JDOE               A6
 JDOE               A10
 XYZ                B12
 XYZ                B13
 XYZ                B14", header=TRUE)

In this instance I would want to extract the Vendor and Transaction ID for all ACME and all XYZ rows since they are in a sequence. I've done some research but not found an approach that seems to fit what I'm trying to do.


Solution

  • One option is to convert the Transaction_ID to a number using parse_number, then you can create a grouping column for consecutive values within each group. Then, you can filter depending on how many consecutive numbers you want to have. Here, I use greater than 1, but you can update the filter statement depending on your desired criteria. Then, I ungroup and remove the extra columns - Transaction_ID2 and grp.

    library(dplyr)
    library(readr)
    
    df %>%
      mutate(Transaction_ID2 = parse_number(Transaction_ID)) %>%
      group_by(Vendor, grp = cumsum(c(1, diff(Transaction_ID2) != 1))) %>%
      filter(n() > 1) %>% 
      ungroup %>% 
      select(-c(Transaction_ID2, grp))
    

    Output

      Vendor Transaction_ID
      <chr>  <chr>         
    1 ACME   1             
    2 ACME   2             
    3 ACME   3             
    4 XYZ    B12           
    5 XYZ    B13           
    6 XYZ    B14