rregexmultiline

Regex Finding values across multiple lines


I have a file that sometimes has additional values. Most people have a single SSN, but some people have multiple SSNs and they are separated on multiple lines. What is the regex I need for this? My desired output is an R list of SSNs.

The source file was read in using read_lines() as object f:

dput(f)
c("------------ Phase 1 ------------", "DOB:    12/23/23", "SSN:    123456766", 
"        123456777", "        123456788", "        123456799", 
"        123456700", "Address: 5 Green Lane", "", "", "------------ Phase 2 ------------", 
"DOB:    12/33/23", "SSN:    223456766", "Address: 22 Blue Lane", 
"")

and as shown in the text file as:

------------ Phase 1 ------------  
DOB:    12/23/23  
SSN:    123456766  
        123456777  
        123456788  
        123456799  
        123456700  
Address: 5 Green Lane  
  
  
------------ Phase 2 ------------  
DOB:    12/33/23  
SSN:    223456766  
Address: 22 Blue Lane  

My current regex is: "SSN:\s+\d{9}(\n\s\d{9})?" and I've tried various regex options like dotall and multiline without success.

Regarding output structure, my preference is as simple as possible, ideally as a data.frame. (I'm looking for suggestions from you as to best practices). I realize that's awkward for multiples in a data.frame as it would have to extend across multiple columns. Otherwise a list would be able to handle multiples, or a single cell in a data.frame separated by commas.

Thanks


Solution

  • 1 read.fwf

    For the given format, you can use read.fwf

    library(tidyverse)
    read.fwf(textConnection(text), widths = c(8, 25)) %>% 
      filter(!grepl("^--", V1)) %>% na.omit() %>%   
      mutate(Phase = cumsum(grepl("DOB:", V1)), 
             V1 = gsub(":","",na_if(trimws(V1), ""))) %>% fill(V1)
      
    
            V1              V2 Phase
    1      DOB      12/23/23       1
    2      SSN     123456766       1
    3      SSN     123456777       1
    4      SSN     123456788       1
    5      SSN     123456799       1
    6      SSN     123456700       1
    7  Address  5 Green Lane       1
    10     DOB      12/33/23       2
    11     SSN     223456766       2
    12 Address  22 Blue Lane       2
    

    2 Gsub

    This questions was originally about the SSN numbers, not about the whole data structure. Assuming you are only interested in the SSN numbers as vector, you can use grep to find the rows that start with "SSN" and "Address" and read the lines in between, remove SSN: and one or many whitespaces \\s+ using gsub

    lines <- readLines(textConnection(text))
    
    unlist(mapply(function(start, end) {
      gsub("\\s+", "", gsub("SSN:", "", lines[start:(end - 1)]))
    }, grep("SSN:", lines), grep("Address:", lines)))
    
    [1]Green "123456766"Lane "123456777" "123456788" "123456799" "123456700" "223456766"
    

    Or to get a list remove the unlist()

    mapply(function(start, end) {
      gsub("\\s+", "", gsub("SSN:", "", lines[start:(end - 1)]))
    }, grep("SSN:", lines), grep("Address:", lines))
    
    [[1]]
    [1] "123456766" "123456777" "123456788" "123456799"223456766 "123456700"
    
    [[2]]     
    [1]12 "223456766"
    

    Test data

    text <- "------------ Phase 1 ------------  
    DOB:    12/23/23  
    SSN:    123456766  
            123456777  
            123456788  
            123456799  
            123456700  
    Address: 5 Green Lane  
      
      
    ------------ Phase 2 ------------  
    DOB:    12/33/23  
    SSN:    223456766  
    Address: 22 Blue Lane  "