regexrubybanking

Regex for bank transaction parsing


How do I parse and extract the 4 important columns from a text table of the following format? These are bank transaction line items extracted from a PDF using Ruby's pdf-reader package - as you can see the spacing between columns is very irregular between various columns.

11/4                      Stripe Transfer St-XYZ Agnostic Computers                      582.30
11/4                      Recurring Payment authorized on 11/01 Digitalocean.Com                                           12.00
11/4                      Purchase authorized on 11/01 Google *Gsuite_Get                                                  24.00
11/4                      Purchase authorized on 11/02 Amazon Web Service                                                 460.15
11/4                      Purchase authorized on 11/02 Amazon Web Service                                                   8.07           2,903.09
11/5                      Recurring Payment authorized on 11/03 Atlassian                                                  15.00           2,888.09
11/6                      Recurring Payment authorized on 11/04 Pipedrive Inc NY NY                                        24.00           2,864.09
11/12                     Foobar Retail Dis 211011 ABCDEFGH                            8,031.44
11/12                     Wire Trans Svc Charge - Sequence: 999999999999 Srf#                                              45.00
11/12                     WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC                                             5,000.00           5,850.53
11/14                      Purchase authorized on 11/13 Microconf Microconf.Com MN                                            100.00           5,702.53

The above transactions are extracted from a bank PDF with the following visual layout Table column names

Need to parse the bold colums via a regexp:

  1. Date - dd/mm format - always present
  2. Check number - always empty and may be ignored (alphanumeric single word?)
  3. Description - Text with dates, numbers, special characters - always present
  4. Credits - currency amount (only for deposits)
  5. Debits - currency amount (only for payments)
  6. Balance - currency amount (appears sporadically, not important)

I have only been able to get as far as /^(\d{1,2}\/\d{1,2})\s+/mg for extracting the mm/dd. Should I start chomping the amounts from the right, but then there are no clear seperator patterns!


Solution

  • The spacing between the columns is irregular but always seems to be more than 2. In that case you can use 3 capture group and an optional 4th part with also a capture group for the Debits part.

    ^(\d{1,2}\/\d{1,2})\s{2,}(\S.*?)\s{2,}(\d{1,3}(?:,\d{3})*\.\d{2})(?:\s{2,}(\d{1,3}(?:,\d{3})*\.\d{2}))?
    

    In parts the pattern matches:

    See a rubular regex demo