excelvbapdftotext

Parsing data from pdftotext .txt file output using VBA


I am trying to implement a parsing function that will grab data from parts of a .txt file created using pdftotext. I hate PDFs! Essentially, I use pdftotext on a PDF file using the -raw option and I get a file like this:

SPORTS FANZ Order #62659
June 24, 2023
SHIP TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
BILL TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
ITEMS QUANTITY
Virginia Tech Hokies Basketball Cassell Coliseum Panoramic
Picture
Virginia Tech Hokies Panoramic Picture Select
VAT5M
1 of 1
Thank you for shopping with us!
Sports Fanz
123 Liberty St, Chester NY 12345, United States

Example with phone number and quantity of 2:

SPORTS FANZ Order #12345
June 24, 2023
SHIP TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
(123) 123-4567
BILL TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
ITEMS QUANTITY
Virginia Tech Hokies Basketball Cassell Coliseum Panoramic
Picture
Virginia Tech Hokies Panoramic Picture Select
VAT5M
2 of 2
Thank you for shopping with us!
Sports Fanz
123 Liberty St, Chester NY 12345, United States

Example with phone number (different format) and two SKUs:

SPORTS FANZ Order #58083
January 6, 2023
SHIP TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
+12345678900
BILL TO
Abe Lincoln
123 Main St
New York, NY 12345
United States
ITEMS QUANTITY
Nebraska Cornhuskers Women's Volleyball Devaney Center Panoramic Picture
Nebraska Cornhuskers Panoramic Picture Select Frame
UNE11M
1 of 1
Kansas City Chiefs Super Bowl 54 Champions Panoramic Picture
Kansas City Chiefs SB 54 Champions Panoramic Picture Unframed
NFLSBC20CHF
1 of 1
Thank you for shopping with us!
Sports Fanz
123 Liberty St, Chester NY 12345, United States

I've implemented the following code already to grab some of the data:

Function for grabbing text between two strings

Public Function SuperMid(ByVal strMain As String, str1 As String, str2 As String, Optional reverse As Boolean) As String
'DESCRIPTION: Extract the portion of a string between the two substrings defined in str1 and str2.
'DEVELOPER: Ryan Wells (wellsr.com)
'HOW TO USE: - Pass the argument your main string and the 2 strings you want to find in the main string.
' - This function will extract the values between the end of your first string and the beginning
' of your next string.
' - If the optional boolean "reverse" is true, an InStrRev search will occur to find the last
' instance of the substrings in your main string.
Dim i As Integer, j As Integer, temp As Variant
On Error GoTo errhandler:
If reverse = True Then
    i = InStrRev(strMain, str1)
    j = InStrRev(strMain, str2)
    If Abs(j - i) < Len(str1) Then j = InStrRev(strMain, str2, i)
    If i = j Then 'try to search 2nd half of string for unique match
        j = InStrRev(strMain, str2, i - 1)
    End If
Else
    i = InStr(1, strMain, str1)
    j = InStr(1, strMain, str2)
    If Abs(j - i) < Len(str1) Then j = InStr(i + Len(str1), strMain, str2)
    If i = j Then 'try to search 2nd half of string for unique match
        j = InStr(i + 1, strMain, str2)
    End If
End If
If i = 0 And j = 0 Then GoTo errhandler:
If j = 0 Then j = Len(strMain) + Len(str2) 'just to make it arbitrarily large
If i = 0 Then i = Len(strMain) + Len(str1) 'just to make it arbitrarily large
If i > j And j <> 0 Then 'swap order
    temp = j
    j = i
    i = temp
    temp = str2
    str2 = str1
    str1 = temp
End If
i = i + Len(str1)
SuperMid = Mid(strMain, i, j - i)
Exit Function
errhandler:
MsgBox "Error extracting strings. Check your input" & vbNewLine & vbNewLine & "Aborting", , "Strings not found"
End
End Function

Extraction Sub

Sub extractPDF()
    
    Dim phoneNumber, shippingInfo, shippingAddress, itemInfo, poNumber As String
    Dim iTxtFile As Integer
    Dim strFile As String
    Dim strFileText As String
    strFile = "C:\blah\blah\blah\#62875.txt"
    iTxtFile = FreeFile
    Open strFile For Input As FreeFile
    strFileText = Input(LOF(iTxtFile), iTxtFile)
    Close iTxtFile
    
    Dim regexPattern As String
    Dim regex As Object
    Dim matches As Object
    Dim match As Object
    
    ' Regular expression pattern
    regexPattern = "Order #\d{5}"
    
    ' Create a regular expression object
    Set regex = CreateObject("VBScript.RegExp")
    
    ' Set the pattern and ignore case
    With regex
        .Pattern = regexPattern
        .IgnoreCase = True
    End With
    
    ' Perform the search
    Set matches = regex.Execute(strFileText)
    
    ' Loop through the matches
    For Each match In matches
        ' Print the matched value
        poNumber = Right(match, 5)
    Next match
    
    shippingInfo = SuperMid(strFileText, "SHIP TO", "BILL TO")
    shippingAddress = SuperMid(shippingInfo, "", "United States")
    phoneNumber = Application.WorksheetFunction.Clean(SuperMid(shippingInfo, "United States", "BILL TO"))
    itemInfo = SuperMid(strFileText, "ITEMS QUANTITY", "Thank you for shopping with us!")
    Debug.Print "PO #: " & poNumber
    Debug.Print "Phone Number: " & phoneNumber
    Debug.Print shippingAddress
    Debug.Print itemInfo

End Sub

This gets me the shipping info, which I further break down into shipping address and phone number (if applicable), PO #, and the block of text containing the item information. What I'm struggling with is how to extract SKU and quantity data from the itemInfo block. Based on previous PDFs, the SKU line is always followed by the quantity line. So, in this example, SKU is VAR5M and quantity is 1 (if it was 2 it would say 2 of 2). Any ideas on the best way to implement what I need? Is there a better way to implement my needs than what I've already designed? Thanks for your help.


Solution

  • If you have a text string stored in cell A1 and you would get SKU and Quantity with following code.

    Sub Demo()
        Dim objRegExp As Object
        Dim objMatches As Object
        Set objRegExp = CreateObject("vbscript.regexp")
        With objRegExp
            .IgnoreCase = True
            .Global = True
            .Pattern = "([A-Z0-9]+)\s*(\d+) of \d+"
            If .Test([a1]) Then
                Set objMatches = objRegExp.Execute([a1])
                For Each objMtch In objMatches
                    With objMtch.submatches
                        If .Count = 2 Then
                            SKU = .Item(0)
                            QTY = .Item(1)
                            Debug.Print "SKU:" & SKU & vbNewLine _
                                & "Quantity:" & QTY
                        End If
                    End With
                Next
            End If
        End With
        Set objMatches = Nothing
        Set objRegExp = Nothing
    End Sub