How do I parse and extract the 4 important columns from a text table of the following format? These are bank transaction line items extracted from a PDF using Ruby's pdf-reader
package - as you can see the spacing between columns is very irregular between various columns.
11/4 Stripe Transfer St-XYZ Agnostic Computers 582.30
11/4 Recurring Payment authorized on 11/01 Digitalocean.Com 12.00
11/4 Purchase authorized on 11/01 Google *Gsuite_Get 24.00
11/4 Purchase authorized on 11/02 Amazon Web Service 460.15
11/4 Purchase authorized on 11/02 Amazon Web Service 8.07 2,903.09
11/5 Recurring Payment authorized on 11/03 Atlassian 15.00 2,888.09
11/6 Recurring Payment authorized on 11/04 Pipedrive Inc NY NY 24.00 2,864.09
11/12 Foobar Retail Dis 211011 ABCDEFGH 8,031.44
11/12 Wire Trans Svc Charge - Sequence: 999999999999 Srf# 45.00
11/12 WT 211012-999999 ABCD Bank Limited /Bnf=FOOBARINC 5,000.00 5,850.53
11/14 Purchase authorized on 11/13 Microconf Microconf.Com MN 100.00 5,702.53
The above transactions are extracted from a bank PDF with the following visual layout
Need to parse the bold colums via a regexp:
I have only been able to get as far as /^(\d{1,2}\/\d{1,2})\s+/mg
for extracting the mm/dd. Should I start chomping the amounts from the right, but then there are no clear seperator patterns!
The spacing between the columns is irregular but always seems to be more than 2. In that case you can use 3 capture group and an optional 4th part with also a capture group for the Debits part.
^(\d{1,2}\/\d{1,2})\s{2,}(\S.*?)\s{2,}(\d{1,3}(?:,\d{3})*\.\d{2})(?:\s{2,}(\d{1,3}(?:,\d{3})*\.\d{2}))?
In parts the pattern matches:
^
Start of string(\d{1,2}\/\d{1,2})\s{2,}
Capture group 1 Match 1,2 digits /
1,2 digits and 2 or more whitespace chars(\S.*?)\s{2,}
Capture group 2 Match at least a single non whitespace chars and as least as possible characters until the next occurrence of 2 or more whitespace chars(\d{1,3}(?:,\d{3})*\.\d{2})
Capture group 3 Match the number format(?:
Non capture group
\s{2,}
Match 2 or more whitespace chars(\d{1,3}(?:,\d{3})*\.\d{2})
Capture group 4, match the number format)?
Close the non capture group and make it optionalSee a rubular regex demo