excelvbaacrobatpdf-extraction

Convert PDF to text file using VBA and Adobe Acrobat XI standard


Part 3 of a previous post.

The task: I am attempting to iterate over a series of URLs presented in excel and generate complete text files for each.

So far: The VBA solution in the previous post uses Word to open the PDF Url, and from this generate a text file. This is a much better improvement than using power query. However, I have come across one instance where it fails as it recognises the text on the first page as an image; therefore, this text is omitted in the generated text file.

As an example of a truncated text file generated: https://hpvchemicals.oecd.org/ui/handler.axd?id=b4b38713-7580-4843-86fd-614821a6f72b is an example where the generated PDF text file has text missing.

As advised, I have downloaded a version of Adobe Acrobat (Adobe Acrobat XI Standard), which enables Adobe Acrobat 10.0 Type Library Reference (and a bunch of others). I hope this will enable more accurate text conversion but will need help to develop this script.

So, for this problem, I wish to modify the following code to swap Word for Abode to perform the conversion. What I am unsure of, however, is whether Adobe can open a PDF from a URL in the same way word can. I think I may need to introduce some steps to download the PDF first in the folder and then for each convert those. Ideally, It would work as the above code but if this is necessary, then so be it.

M Code from the previous answer:

Sub Tester()

    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim oWd As Object, oDoc As Object, c As Range, fileRoot As String
    
    Set fso = New FileSystemObject
    Set oWd = CreateObject("word.application")
    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("D2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            Set oDoc = Nothing
            On Error Resume Next 'ignore error if no document...
            Set oDoc = oWd.Documents.Open(url)
            On Error GoTo 0      'stop ignoring errors
            If Not oDoc Is Nothing Then
                filePath = fileRoot & c.Offset(0, -1).Value & ".txt" 'filename from ColA
                Debug.Print filePath
                'open text stream as unicode
                Set fileStream = fso.CreateTextFile(filePath, overwrite:=True, Unicode:=True)
                fileStream.Write oDoc.Range.Text
                fileStream.Close
                oDoc.Close
                c.Interior.Color = vbGreen 'flag OK
            Else
                c.Interior.Color = vbRed   'flag problem
            End If
        End If 'have url
    Next c
    
    oWd.Quit
End Sub

As for developing this script, I have the following from what I have found online, which now successfully performs the conversion using Acrobat for the previously difficult PDF so Good start.

Sub convertpdf2()

    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim Filename As String
    Dim jsObj As Object
    Dim NewFileName As String

    Filename = "C:\temp\name1.pdf"
    NewFileName = "C:\temp\name1.txt"

    Set AcroXApp = CreateObject("AcroExch.App")
    'AcroXApp.Show

    Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
    AcroXAVDoc.Open Filename, "Acrobat"

    Set AcroXPDDoc = AcroXAVDoc.GetPDDoc


    Set jsObj = AcroXPDDoc.GetJSObject


    jsObj.SaveAs NewFileName, "com.adobe.acrobat.plain-text"


    AcroXAVDoc.Close False
    AcroXApp.Hide
    AcroXApp.Exit

    End Sub

Where name1 is a downloaded PDF.

So, in essence I am trying to:

I think if I can achieve this, then I will be able to work out how to merge this with the previous post so that it can loop through a given number of URLs to generate the text files as well as handle the non-unicofe characters.

Will update if I make any progress

Setup in Excel: enter image description here

URLs:

https://hpvchemicals.oecd.org/ui/handler.axd?id=e19d2799-0c16-496d-a607-b09330dd28a7
https://hpvchemicals.oecd.org/ui/handler.axd?id=40da06b1-a855-4c0c-bc21-bbc856dca725
https://hpvchemicals.oecd.org/ui/handler.axd?id=c4967546-1f5e-472a-b629-a2998323735b
https://hpvchemicals.oecd.org/ui/handler.axd?id=bde5e625-83ee-423d-aa70-eb0e453088e4
https://hpvchemicals.oecd.org/ui/handler.axd?id=621c4f55-ef3c-4b99-bb98-e6aaf3f436dd
https://hpvchemicals.oecd.org/ui/handler.axd?id=26e1420d-f9b7-4768-b6fa-d345f54e7683
https://hpvchemicals.oecd.org/ui/handler.axd?id=263f3491-90c7-4c3a-b43e-4c4e9395bcea
https://hpvchemicals.oecd.org/ui/handler.axd?id=b78d39a9-26c2-48ff-aadc-cb056a89f08b
https://hpvchemicals.oecd.org/ui/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16
https://hpvchemicals.oecd.org/ui/handler.axd?id=c6c3b7c1-9239-40d9-b51a-85a15e2411d6

Update: Attempting to add Error handling Highlighting

M Code:

#If VBA7 Then
    Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
      Alias "URLDownloadToFileA" (ByVal pCaller As LongPtr, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As LongPtr) As Long
#Else
    Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
    Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, _
        ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
#End If

Sub Tester1()
    Dim filePath As String
    Dim fso As FileSystemObject, url
    Dim fileStream As TextStream, ws As Worksheet
    Dim c As Range, fileRoot As String, pdfPath As String
    
    
    Dim AcroXApp As Acrobat.AcroApp
    Dim AcroXAVDoc As Acrobat.AcroAVDoc
    Dim AcroXPDDoc As Acrobat.AcroPDDoc
    Dim jsObj As Object

    
    Set ws = Worksheets("Data")     'use a specific worksheet reference
    fileRoot = ws.Range("C2").Value 'read this once
    If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
    
    For Each c In ws.Range("B2:B" & ws.Cells(Rows.count, "B").End(xlUp).row).Cells
        url = Trim(c.Value)
        If LCase(url) Like "http?:*" Then  'has a URL
            
            pdfPath = fileRoot & "PDF_" & c.Offset(0, -1).Value & ".pdf"
            
            Debug.Print pdfPath, c.Value
            
            DownloadFile url, pdfPath
            
            
            Set AcroXApp = CreateObject("AcroExch.App")
            Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
            AcroXAVDoc.Open pdfPath, "Acrobat" & c.Offset(0, -1).Value & ".txt"
            Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
        
            If Error <> 0 Then
            
            c.Interior.Color = vbRed   'flag problem
            
            Else
    
            Set jsObj = AcroXPDDoc.GetJSObject
            jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
            AcroXAVDoc.Close False
            AcroXApp.Hide
            AcroXApp.Exit
            
            c.Interior.Color = vbGreen 'flag OK
            
            End If
             
            
        End If 'have url
        
        
    Next c


    kill pdfPath ' toggle on/off to keep PDF


End Sub

Function DownloadFile(sURL, sSaveAs) As Boolean
    DownloadFile = (URLDownloadToFile(0, sURL, sSaveAs, 0, 0) = 0)
End Function

Solution

  • Tested:

    #If VBA7 Then
        Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
          Alias "URLDownloadToFileA" (ByVal pCaller As LongPtr, ByVal szURL As String, _
            ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As LongPtr) As Long
    #Else
        Private Declare Function URLDownloadToFile Lib "urlmon" _
          Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, _
            ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
    #End If
    
    Function DownloadFile(sURL, sSaveAs) As Boolean
        DownloadFile = (URLDownloadToFile(0, sURL, sSaveAs, 0, 0) = 0)
    End Function
    
    Sub Tester()
        Dim filePath As String
        Dim fso As FileSystemObject, url
        Dim fileStream As TextStream, ws As Worksheet
        Dim c As Range, fileRoot As String, pdfPath As String, success As Boolean
        
        Set ws = Worksheets("Data")     'use a specific worksheet reference
        fileRoot = ws.Range("D2").Value 'read this once
        If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
        
        For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
            url = Trim(c.Value)
            If LCase(url) Like "http?:*" Then  'has a URL
                
                pdfPath = fileRoot & "PDF_" & c.Offset(0, -1).Value & ".pdf"
                DownloadFile url, pdfPath
                
                If IsErrorPage(pdfPath) Then
                    c.Interior.Color = vbRed
                    c.Offset(0, 1).Value = "No PDF returned" 'flag in col C
                Else
                    success = ConvertPdf2(pdfPath, fileRoot & c.Offset(0, -1).Value & ".txt")
                    c.Interior.Color = IIf(success, vbGreen, vbRed)
                    c.Offset(0, 1).Value = IIf(success, "OK", "PDF not openable")
                End If
                
            End If 'have url
        Next c
    End Sub
    
    'returns true if conversion was successful (based on whether `Open` succeeded or not)
    Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
        Dim AcroXApp As Acrobat.AcroApp
        Dim AcroXAVDoc As Acrobat.AcroAVDoc
        Dim AcroXPDDoc As Acrobat.AcroPDDoc
        Dim jsObj As Object, success As Boolean
    
        Set AcroXApp = CreateObject("AcroExch.App")
        Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
        success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
        If success Then
            Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
            Set jsObj = AcroXPDDoc.GetJSObject
            jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
            AcroXAVDoc.Close False
        End If
        AcroXApp.Hide
        AcroXApp.Exit
        ConvertPdf2 = success 'report success/failure
    End Function
    
    'is the supposed PDF actually an HTML error page?
    Function IsErrorPage(path) As Boolean
        Dim txt
        txt = CreateObject("scripting.filesystemobject"). _
                      OpenTextFile(path, 1).ReadAll()
        IsErrorPage = InStr(1, txt, "<!DOCTYPE", vbTextCompare) > 0
    End Function