I am trying to implement a parsing function that will grab data from parts of a .txt file created using pdftotext. I hate PDFs! Essentially, I use pdftotext on a PDF file using the -raw option and I get a file like this:
SPORTS FANZ Order #62659
June 24, 2023
SHIP TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
BILL TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
ITEMS QUANTITY
Virginia Tech Hokies Basketball Cassell Coliseum Panoramic
Picture
Virginia Tech Hokies Panoramic Picture Select
VAT5M
1 of 1
Thank you for shopping with us!
Sports Fanz
123 Liberty St, Chester NY 12345, United States
Example with phone number and quantity of 2:
SPORTS FANZ Order #12345
June 24, 2023
SHIP TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
(123) 123-4567
BILL TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
ITEMS QUANTITY
Virginia Tech Hokies Basketball Cassell Coliseum Panoramic
Picture
Virginia Tech Hokies Panoramic Picture Select
VAT5M
2 of 2
Thank you for shopping with us!
Sports Fanz
123 Liberty St, Chester NY 12345, United States
Example with phone number (different format) and two SKUs:
SPORTS FANZ Order #58083
January 6, 2023
SHIP TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
+12345678900
BILL TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
ITEMS QUANTITY
Nebraska Cornhuskers Women's Volleyball Devaney Center Panoramic Picture
Nebraska Cornhuskers Panoramic Picture Select Frame
UNE11M
1 of 1
Kansas City Chiefs Super Bowl 54 Champions Panoramic Picture
Kansas City Chiefs SB 54 Champions Panoramic Picture Unframed
NFLSBC20CHF
1 of 1
Thank you for shopping with us!
Sports Fanz
123 Liberty St, Chester NY 12345, United States
I've implemented the following code already to grab some of the data:
Function for grabbing text between two strings
Public Function SuperMid(ByVal strMain As String, str1 As String, str2 As String, Optional reverse As Boolean) As String
'DESCRIPTION: Extract the portion of a string between the two substrings defined in str1 and str2.
'DEVELOPER: Ryan Wells (wellsr.com)
'HOW TO USE: - Pass the argument your main string and the 2 strings you want to find in the main string.
' - This function will extract the values between the end of your first string and the beginning
' of your next string.
' - If the optional boolean "reverse" is true, an InStrRev search will occur to find the last
' instance of the substrings in your main string.
Dim i As Integer, j As Integer, temp As Variant
On Error GoTo errhandler:
If reverse = True Then
i = InStrRev(strMain, str1)
j = InStrRev(strMain, str2)
If Abs(j - i) < Len(str1) Then j = InStrRev(strMain, str2, i)
If i = j Then 'try to search 2nd half of string for unique match
j = InStrRev(strMain, str2, i - 1)
End If
Else
i = InStr(1, strMain, str1)
j = InStr(1, strMain, str2)
If Abs(j - i) < Len(str1) Then j = InStr(i + Len(str1), strMain, str2)
If i = j Then 'try to search 2nd half of string for unique match
j = InStr(i + 1, strMain, str2)
End If
End If
If i = 0 And j = 0 Then GoTo errhandler:
If j = 0 Then j = Len(strMain) + Len(str2) 'just to make it arbitrarily large
If i = 0 Then i = Len(strMain) + Len(str1) 'just to make it arbitrarily large
If i > j And j <> 0 Then 'swap order
temp = j
j = i
i = temp
temp = str2
str2 = str1
str1 = temp
End If
i = i + Len(str1)
SuperMid = Mid(strMain, i, j - i)
Exit Function
errhandler:
MsgBox "Error extracting strings. Check your input" & vbNewLine & vbNewLine & "Aborting", , "Strings not found"
End
End Function
Extraction Sub
Sub extractPDF()
Dim phoneNumber, shippingInfo, shippingAddress, itemInfo, poNumber As String
Dim iTxtFile As Integer
Dim strFile As String
Dim strFileText As String
strFile = "C:\blah\blah\blah\#62875.txt"
iTxtFile = FreeFile
Open strFile For Input As FreeFile
strFileText = Input(LOF(iTxtFile), iTxtFile)
Close iTxtFile
Dim regexPattern As String
Dim regex As Object
Dim matches As Object
Dim match As Object
' Regular expression pattern
regexPattern = "Order #\d{5}"
' Create a regular expression object
Set regex = CreateObject("VBScript.RegExp")
' Set the pattern and ignore case
With regex
.Pattern = regexPattern
.IgnoreCase = True
End With
' Perform the search
Set matches = regex.Execute(strFileText)
' Loop through the matches
For Each match In matches
' Print the matched value
poNumber = Right(match, 5)
Next match
shippingInfo = SuperMid(strFileText, "SHIP TO", "BILL TO")
shippingAddress = SuperMid(shippingInfo, "", "United States")
phoneNumber = Application.WorksheetFunction.Clean(SuperMid(shippingInfo, "United States", "BILL TO"))
itemInfo = SuperMid(strFileText, "ITEMS QUANTITY", "Thank you for shopping with us!")
Debug.Print "PO #: " & poNumber
Debug.Print "Phone Number: " & phoneNumber
Debug.Print shippingAddress
Debug.Print itemInfo
End Sub
This gets me the shipping info, which I further break down into shipping address and phone number (if applicable), PO #, and the block of text containing the item information. What I'm struggling with is how to extract SKU and quantity data from the itemInfo block. Based on previous PDFs, the SKU line is always followed by the quantity line. So, in this example, SKU is VAR5M and quantity is 1 (if it was 2 it would say 2 of 2). Any ideas on the best way to implement what I need? Is there a better way to implement my needs than what I've already designed? Thanks for your help.
If you have a text string stored in cell A1 and you would get SKU and Quantity with following code.
Sub Demo()
Dim objRegExp As Object
Dim objMatches As Object
Set objRegExp = CreateObject("vbscript.regexp")
With objRegExp
.IgnoreCase = True
.Global = True
.Pattern = "([A-Z0-9]+)\s*(\d+) of \d+"
If .Test([a1]) Then
Set objMatches = objRegExp.Execute([a1])
For Each objMtch In objMatches
With objMtch.submatches
If .Count = 2 Then
SKU = .Item(0)
QTY = .Item(1)
Debug.Print "SKU:" & SKU & vbNewLine _
& "Quantity:" & QTY
End If
End With
Next
End If
End With
Set objMatches = Nothing
Set objRegExp = Nothing
End Sub