excelvbaoffice365pdf-scraping

Run-time error '5' VBA when running against specific PDF


I have the following Code in VBA following an answer to my last question, which iterates over a list of URLs and generates a text file using the word to extract the text.

For the following URL however; https://hpvchemicals.oecd.org/ui/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16, the code fails with Run-time error '5' Invalid procedure call or argument.

Whats strange is the text of the PDF prints to the console but wont write to the text file.

I can't quite understand why this happens as the PDF appears to be no different to the others, which are successful.

VBA Code: requires Microsoft Scripting Runtime reference

Sub Tester()

    Dim filePath As String
    Dim fso As FileSystemObject
    Set fso = New FileSystemObject
    Dim fileStream As TextStream
    
    Dim oWd As Object, oDoc As Object, c As Range
    
    Set oWd = CreateObject("word.application")
    
    n = 1
    
    For Each c In Range("B2:B200").Cells
    
    filePath = Range("D2").Value & "\" & Range("A" & n).Value & ".txt"
    
    Debug.Print filePath
    
    Set fileStream = fso.CreateTextFile(filePath)
   
    Debug.Print c.Value
        
        With oWd.Documents.Open(c.Value)

            Debug.Print .Range.Text
            'write to a file...
            fileStream.WriteLine .Range.Text
            fileStream.Close
            
        End With
        
        n = n + 1
        
    Next c
    
    oWd.Quit
    
End Sub

SetUp: enter image description here

URLs:

https://hpvchemicals.oecd.org/ui/handler.axd?id=e19d2799-0c16-496d-a607-b09330dd28a7
https://hpvchemicals.oecd.org/ui/handler.axd?id=40da06b1-a855-4c0c-bc21-bbc856dca725
https://hpvchemicals.oecd.org/ui/handler.axd?id=c4967546-1f5e-472a-b629-a2998323735b
https://hpvchemicals.oecd.org/ui/handler.axd?id=bde5e625-83ee-423d-aa70-eb0e453088e4
https://hpvchemicals.oecd.org/ui/handler.axd?id=621c4f55-ef3c-4b99-bb98-e6aaf3f436dd
https://hpvchemicals.oecd.org/ui/handler.axd?id=26e1420d-f9b7-4768-b6fa-d345f54e7683
https://hpvchemicals.oecd.org/ui/handler.axd?id=263f3491-90c7-4c3a-b43e-4c4e9395bcea
https://hpvchemicals.oecd.org/ui/handler.axd?id=b78d39a9-26c2-48ff-aadc-cb056a89f08b
https://hpvchemicals.oecd.org/ui/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16
https://hpvchemicals.oecd.org/ui/handler.axd?id=c6c3b7c1-9239-40d9-b51a-85a15e2411d6

So firstly, I think it's likely that the issue with the last URL can be solved. However, I think I also need to introduce some error handling that generates a blank text file and moves on to the next but im not sure how to achieve this.

I am not very competent with VBA, I have specified to repeat for rows B2:B200, but ideally, it would be good if, regardless of how many URLs, it just works to that number.

Also not sure if the logic in my code is particularly robust/if there is a better way to extract text from a URL.

The expected output is:

enter image description here

And here is an example of the generated Text file.

enter image description here


Solution

  • Seems like the document returned from that problem URL contained some characters which couldn't be written to a non-unicode text file.

    See comments inline:

    Sub Tester()
    
        Dim filePath As String
        Dim fso As FileSystemObject, url
        Dim fileStream As TextStream, ws As Worksheet
        Dim oWd As Object, oDoc As Object, c As Range, fileRoot As String
        
        Set fso = New FileSystemObject
        Set oWd = CreateObject("word.application")
        
        Set ws = Worksheets("Data")     'use a specific worksheet reference
        fileRoot = ws.Range("D2").Value 'read this once
        If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
        
        For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
            url = Trim(c.Value)
            If LCase(url) Like "http?:*" Then  'has a URL
                Set oDoc = Nothing
                On Error Resume Next 'ignore error if no document...
                Set oDoc = oWd.Documents.Open(url)
                On Error GoTo 0      'stop ignoring errors
                If Not oDoc Is Nothing Then
                    filePath = fileRoot & c.Offset(0, -1).Value & ".txt" 'filename from ColA
                    Debug.Print filePath
                    'open text stream as unicode
                    Set fileStream = fso.CreateTextFile(filePath, overwrite:=True, Unicode:=True)
                    fileStream.Write oDoc.Range.Text
                    fileStream.Close
                    oDoc.Close
                    c.Interior.Color = vbGreen 'flag OK
                Else
                    c.Interior.Color = vbRed   'flag problem
                End If
            End If 'have url
        Next c
        
        oWd.Quit
    End Sub