Part 3 of a previous post.
The task: I am attempting to iterate over a series of URLs presented in excel and generate complete text files for each.
So far: The VBA solution in the previous post uses Word to open the PDF Url, and from this generate a text file. This is a much better improvement than using power query. However, I have come across one instance where it fails as it recognises the text on the first page as an image; therefore, this text is omitted in the generated text file.
As an example of a truncated text file generated: https://hpvchemicals.oecd.org/ui/handler.axd?id=b4b38713-7580-4843-86fd-614821a6f72b is an example where the generated PDF text file has text missing.
As advised, I have downloaded a version of Adobe Acrobat (Adobe Acrobat XI Standard), which enables Adobe Acrobat 10.0 Type Library
Reference (and a bunch of others). I hope this will enable more accurate text conversion but will need help to develop this script.
So, for this problem, I wish to modify the following code to swap Word for Abode to perform the conversion. What I am unsure of, however, is whether Adobe can open a PDF from a URL in the same way word can. I think I may need to introduce some steps to download the PDF first in the folder and then for each convert those. Ideally, It would work as the above code but if this is necessary, then so be it.
M Code from the previous answer:
Sub Tester()
Dim filePath As String
Dim fso As FileSystemObject, url
Dim fileStream As TextStream, ws As Worksheet
Dim oWd As Object, oDoc As Object, c As Range, fileRoot As String
Set fso = New FileSystemObject
Set oWd = CreateObject("word.application")
Set ws = Worksheets("Data") 'use a specific worksheet reference
fileRoot = ws.Range("D2").Value 'read this once
If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
url = Trim(c.Value)
If LCase(url) Like "http?:*" Then 'has a URL
Set oDoc = Nothing
On Error Resume Next 'ignore error if no document...
Set oDoc = oWd.Documents.Open(url)
On Error GoTo 0 'stop ignoring errors
If Not oDoc Is Nothing Then
filePath = fileRoot & c.Offset(0, -1).Value & ".txt" 'filename from ColA
Debug.Print filePath
'open text stream as unicode
Set fileStream = fso.CreateTextFile(filePath, overwrite:=True, Unicode:=True)
fileStream.Write oDoc.Range.Text
fileStream.Close
oDoc.Close
c.Interior.Color = vbGreen 'flag OK
Else
c.Interior.Color = vbRed 'flag problem
End If
End If 'have url
Next c
oWd.Quit
End Sub
As for developing this script, I have the following from what I have found online, which now successfully performs the conversion using Acrobat for the previously difficult PDF so Good start.
Sub convertpdf2()
Dim AcroXApp As Acrobat.AcroApp
Dim AcroXAVDoc As Acrobat.AcroAVDoc
Dim AcroXPDDoc As Acrobat.AcroPDDoc
Dim Filename As String
Dim jsObj As Object
Dim NewFileName As String
Filename = "C:\temp\name1.pdf"
NewFileName = "C:\temp\name1.txt"
Set AcroXApp = CreateObject("AcroExch.App")
'AcroXApp.Show
Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
AcroXAVDoc.Open Filename, "Acrobat"
Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
Set jsObj = AcroXPDDoc.GetJSObject
jsObj.SaveAs NewFileName, "com.adobe.acrobat.plain-text"
AcroXAVDoc.Close False
AcroXApp.Hide
AcroXApp.Exit
End Sub
Where name1 is a downloaded PDF.
So, in essence I am trying to:
I think if I can achieve this, then I will be able to work out how to merge this with the previous post so that it can loop through a given number of URLs to generate the text files as well as handle the non-unicofe characters.
Will update if I make any progress
URLs:
https://hpvchemicals.oecd.org/ui/handler.axd?id=e19d2799-0c16-496d-a607-b09330dd28a7
https://hpvchemicals.oecd.org/ui/handler.axd?id=40da06b1-a855-4c0c-bc21-bbc856dca725
https://hpvchemicals.oecd.org/ui/handler.axd?id=c4967546-1f5e-472a-b629-a2998323735b
https://hpvchemicals.oecd.org/ui/handler.axd?id=bde5e625-83ee-423d-aa70-eb0e453088e4
https://hpvchemicals.oecd.org/ui/handler.axd?id=621c4f55-ef3c-4b99-bb98-e6aaf3f436dd
https://hpvchemicals.oecd.org/ui/handler.axd?id=26e1420d-f9b7-4768-b6fa-d345f54e7683
https://hpvchemicals.oecd.org/ui/handler.axd?id=263f3491-90c7-4c3a-b43e-4c4e9395bcea
https://hpvchemicals.oecd.org/ui/handler.axd?id=b78d39a9-26c2-48ff-aadc-cb056a89f08b
https://hpvchemicals.oecd.org/ui/handler.axd?id=97a7b56f-ebaf-4416-8b4b-88b19ca3bd16
https://hpvchemicals.oecd.org/ui/handler.axd?id=c6c3b7c1-9239-40d9-b51a-85a15e2411d6
Update: Attempting to add Error handling Highlighting
M Code:
#If VBA7 Then
Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
Alias "URLDownloadToFileA" (ByVal pCaller As LongPtr, ByVal szURL As String, _
ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As LongPtr) As Long
#Else
Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, _
ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
#End If
Sub Tester1()
Dim filePath As String
Dim fso As FileSystemObject, url
Dim fileStream As TextStream, ws As Worksheet
Dim c As Range, fileRoot As String, pdfPath As String
Dim AcroXApp As Acrobat.AcroApp
Dim AcroXAVDoc As Acrobat.AcroAVDoc
Dim AcroXPDDoc As Acrobat.AcroPDDoc
Dim jsObj As Object
Set ws = Worksheets("Data") 'use a specific worksheet reference
fileRoot = ws.Range("C2").Value 'read this once
If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
For Each c In ws.Range("B2:B" & ws.Cells(Rows.count, "B").End(xlUp).row).Cells
url = Trim(c.Value)
If LCase(url) Like "http?:*" Then 'has a URL
pdfPath = fileRoot & "PDF_" & c.Offset(0, -1).Value & ".pdf"
Debug.Print pdfPath, c.Value
DownloadFile url, pdfPath
Set AcroXApp = CreateObject("AcroExch.App")
Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
AcroXAVDoc.Open pdfPath, "Acrobat" & c.Offset(0, -1).Value & ".txt"
Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
If Error <> 0 Then
c.Interior.Color = vbRed 'flag problem
Else
Set jsObj = AcroXPDDoc.GetJSObject
jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
AcroXAVDoc.Close False
AcroXApp.Hide
AcroXApp.Exit
c.Interior.Color = vbGreen 'flag OK
End If
End If 'have url
Next c
kill pdfPath ' toggle on/off to keep PDF
End Sub
Function DownloadFile(sURL, sSaveAs) As Boolean
DownloadFile = (URLDownloadToFile(0, sURL, sSaveAs, 0, 0) = 0)
End Function
Tested:
#If VBA7 Then
Private Declare PtrSafe Function URLDownloadToFile Lib "urlmon" _
Alias "URLDownloadToFileA" (ByVal pCaller As LongPtr, ByVal szURL As String, _
ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As LongPtr) As Long
#Else
Private Declare Function URLDownloadToFile Lib "urlmon" _
Alias "URLDownloadToFileA" (ByVal pCaller As Long, ByVal szURL As String, _
ByVal szFileName As String, ByVal dwReserved As Long, ByVal lpfnCB As Long) As Long
#End If
Function DownloadFile(sURL, sSaveAs) As Boolean
DownloadFile = (URLDownloadToFile(0, sURL, sSaveAs, 0, 0) = 0)
End Function
Sub Tester()
Dim filePath As String
Dim fso As FileSystemObject, url
Dim fileStream As TextStream, ws As Worksheet
Dim c As Range, fileRoot As String, pdfPath As String, success As Boolean
Set ws = Worksheets("Data") 'use a specific worksheet reference
fileRoot = ws.Range("D2").Value 'read this once
If Right(fileRoot, 1) <> "\" Then fileRoot = fileRoot & "\" 'ensure terminating \
For Each c In ws.Range("B2:B" & ws.Cells(Rows.Count, "B").End(xlUp).row).Cells
url = Trim(c.Value)
If LCase(url) Like "http?:*" Then 'has a URL
pdfPath = fileRoot & "PDF_" & c.Offset(0, -1).Value & ".pdf"
DownloadFile url, pdfPath
If IsErrorPage(pdfPath) Then
c.Interior.Color = vbRed
c.Offset(0, 1).Value = "No PDF returned" 'flag in col C
Else
success = ConvertPdf2(pdfPath, fileRoot & c.Offset(0, -1).Value & ".txt")
c.Interior.Color = IIf(success, vbGreen, vbRed)
c.Offset(0, 1).Value = IIf(success, "OK", "PDF not openable")
End If
End If 'have url
Next c
End Sub
'returns true if conversion was successful (based on whether `Open` succeeded or not)
Function ConvertPdf2(pdfPath As String, textPath As String) As Boolean
Dim AcroXApp As Acrobat.AcroApp
Dim AcroXAVDoc As Acrobat.AcroAVDoc
Dim AcroXPDDoc As Acrobat.AcroPDDoc
Dim jsObj As Object, success As Boolean
Set AcroXApp = CreateObject("AcroExch.App")
Set AcroXAVDoc = CreateObject("AcroExch.AVDoc")
success = AcroXAVDoc.Open(pdfPath, "Acrobat") '<<< returns false if fails
If success Then
Set AcroXPDDoc = AcroXAVDoc.GetPDDoc
Set jsObj = AcroXPDDoc.GetJSObject
jsObj.SaveAs textPath, "com.adobe.acrobat.plain-text"
AcroXAVDoc.Close False
End If
AcroXApp.Hide
AcroXApp.Exit
ConvertPdf2 = success 'report success/failure
End Function
'is the supposed PDF actually an HTML error page?
Function IsErrorPage(path) As Boolean
Dim txt
txt = CreateObject("scripting.filesystemobject"). _
OpenTextFile(path, 1).ReadAll()
IsErrorPage = InStr(1, txt, "<!DOCTYPE", vbTextCompare) > 0
End Function