I have a file that sometimes has additional values. Most people have a single SSN, but some people have multiple SSNs and they are separated on multiple lines. What is the regex I need for this? My desired output is an R list of SSNs.
The source file was read in using read_lines() as object f:
dput(f)
c("------------ Phase 1 ------------", "DOB: 12/23/23", "SSN: 123456766",
" 123456777", " 123456788", " 123456799",
" 123456700", "Address: 5 Green Lane", "", "", "------------ Phase 2 ------------",
"DOB: 12/33/23", "SSN: 223456766", "Address: 22 Blue Lane",
"")
and as shown in the text file as:
------------ Phase 1 ------------
DOB: 12/23/23
SSN: 123456766
123456777
123456788
123456799
123456700
Address: 5 Green Lane
------------ Phase 2 ------------
DOB: 12/33/23
SSN: 223456766
Address: 22 Blue Lane
My current regex is: "SSN:\s+\d{9}(\n\s\d{9})?" and I've tried various regex options like dotall and multiline without success.
Regarding output structure, my preference is as simple as possible, ideally as a data.frame. (I'm looking for suggestions from you as to best practices). I realize that's awkward for multiples in a data.frame as it would have to extend across multiple columns. Otherwise a list would be able to handle multiples, or a single cell in a data.frame separated by commas.
Thanks
For the given format, you can use read.fwf
library(tidyverse)
read.fwf(textConnection(text), widths = c(8, 25)) %>%
filter(!grepl("^--", V1)) %>% na.omit() %>%
mutate(Phase = cumsum(grepl("DOB:", V1)),
V1 = gsub(":","",na_if(trimws(V1), ""))) %>% fill(V1)
V1 V2 Phase
1 DOB 12/23/23 1
2 SSN 123456766 1
3 SSN 123456777 1
4 SSN 123456788 1
5 SSN 123456799 1
6 SSN 123456700 1
7 Address 5 Green Lane 1
10 DOB 12/33/23 2
11 SSN 223456766 2
12 Address 22 Blue Lane 2
This questions was originally about the SSN numbers, not about the whole data structure. Assuming you are only interested in the SSN numbers as vector, you can use grep
to find the rows that start with "SSN" and "Address" and read the lines in between, remove SSN:
and one or many whitespaces \\s+
using gsub
lines <- readLines(textConnection(text))
unlist(mapply(function(start, end) {
gsub("\\s+", "", gsub("SSN:", "", lines[start:(end - 1)]))
}, grep("SSN:", lines), grep("Address:", lines)))
[1]Green "123456766"Lane "123456777" "123456788" "123456799" "123456700" "223456766"
Or to get a list remove the unlist()
mapply(function(start, end) {
gsub("\\s+", "", gsub("SSN:", "", lines[start:(end - 1)]))
}, grep("SSN:", lines), grep("Address:", lines))
[[1]]
[1] "123456766" "123456777" "123456788" "123456799"223456766 "123456700"
[[2]]
[1]12 "223456766"
text <- "------------ Phase 1 ------------
DOB: 12/23/23
SSN: 123456766
123456777
123456788
123456799
123456700
Address: 5 Green Lane
------------ Phase 2 ------------
DOB: 12/33/23
SSN: 223456766
Address: 22 Blue Lane "