I am attempting to create a flag for when transaction IDs are sequential. For reasons that I will not get into here, these can be a red flag. The problem that I am having is that the IDs are not standardized. Some can be numeric only and others are alphanumeric. I want to identify cases where a vendor's transactions are sequential for X number (how many in a row TBD) of transactions. The data will be grouped by vendor and I'd like to extract all rows and columns for instances where a sequence is present. A much simplified example below:
df <- read.table(text=
"Vendor 'Transaction ID'
ACME 1
ACME 2
ACME 3
JDOE A1
JDOE A6
JDOE A10
XYZ B12
XYZ B13
XYZ B14", header=TRUE)
In this instance I would want to extract the Vendor and Transaction ID for all ACME and all XYZ rows since they are in a sequence. I've done some research but not found an approach that seems to fit what I'm trying to do.
One option is to convert the Transaction_ID
to a number using parse_number
, then you can create a grouping column for consecutive values within each group. Then, you can filter
depending on how many consecutive numbers you want to have. Here, I use greater than 1, but you can update the filter
statement depending on your desired criteria. Then, I ungroup and remove the extra columns - Transaction_ID2
and grp
.
library(dplyr)
library(readr)
df %>%
mutate(Transaction_ID2 = parse_number(Transaction_ID)) %>%
group_by(Vendor, grp = cumsum(c(1, diff(Transaction_ID2) != 1))) %>%
filter(n() > 1) %>%
ungroup %>%
select(-c(Transaction_ID2, grp))
Output
Vendor Transaction_ID
<chr> <chr>
1 ACME 1
2 ACME 2
3 ACME 3
4 XYZ B12
5 XYZ B13
6 XYZ B14