excelvbaperformanceweb-scrapingmsxml2

Excel VBA Web Scraping Tables Not Responding; MSXML2.ServerXMLhttp.6.0 Method


I have built a web scraper using Excel VBA that does the following:

  1. Reads one link at a time from a list of links in a sheet called "CIK_Links".
  2. It Goes to the link, reads its responsetext, and if in that responsetext it finds a hyperlink whose innerHTML reads, "(List of all Funds and Classes/Contracts for", then it saves that link into a variable and creates another MSXML2.ServerXMLhttp.6.0 object.
  3. After creating the object it then finds the 3rd table in the responsetext, loops through and finds specific elements of that table, and then outputs those values to Excel in a sheet called "Parsed_Tables".
  4. The code then goes to the next link on the "CIK_Links" sheet and repeats steps 1-3. Note: there are about 640,000 links in the sheet, but I am running loop for only a few thousand at a time. And yes, I have tried running it for as little as 10, 20, 100 at a time, but the issue still persists.

The issue I am having is that the as soon as I hit run, I receive message "Excel is not responding", but the code still runs in the background. The code works perfectly and is very fast considering what I am asking it to do, but obviously I need to optimize it even more to prevent it from overloading Excel. It would be helpful to find some way to avoid writing the parsed HTML to Excel on every iteration, however, I don't know how I could write the data in the format that I need it without doing so. An array solution would be great, but one would have to do quite a lot of manipulation to the data in the array before writing it to Excel, possibly even subsetting/slicing the array. I need help as I have exhausted all of my knowledge and I have done quite a bit of research over the course of building this application. I am even open to using other technologies like python and the beautifulsoup library, I just wouldn't know how to output the table data to a csv file in the format that I need it. Thanks in advance!

Here is the File: TrustTable_Parse.xlsb

Disclaimer: I have a B.S. in math and I taught myself how to code in VBA, SQL, and R by implementing many of my own projects in each language. Point being, if my code looks weird or you think that I am doing things inefficiently, it's because I haven't been coding for years and I don't know any better, lol.

Below is my code:

Option Explicit

Sub Final_Parse_TrustTables()

Dim HTML As New HTMLDocument
Dim http As Object
Dim links As Object
Dim Url, Trst As String
Dim link As HTMLHtmlElement
Dim i As Long

Dim http2 As Object
Dim HTML2 As New HTMLDocument
Dim tbl As Object
Dim ele As HTMLHtmlElement

Dim wb As Workbook
Dim ws, ws_2 As Worksheet

    'sets ScreenUpdating to false _ 
     turns off event triggers, ect.
    OptimizeCode_Begin


 Set wb = ThisWorkbook

 Set ws = wb.Sheets("CIK_Links")

 'Creates this object to see if Trust table exists
 Set http = CreateObject("MSXML2.ServerXMLhttp.6.0")

  'Loops through the list of URL's _
  in the 'CIK_Links' Sheet
  For i = 2 To 3000

   'List of URL's
    Url = ws.Range("C1").Cells(i, 1).Value2

    'Gets webpage to check _
    if Trust table exists
    On Error Resume Next
    http.Open "GET", Url, False
    http.send


    'Runs code If the website sent a valid response to our request _
    for FIRST http object
    If Err.Number = 0 Then

     If http.Status = 200 Then

      'If the website sent a valid response to our request _
      for SECOND http object "http2"
      If Err.Number = 0 Then

       If http2.Status = 200 Then

        HTML.body.innerHTML = http.responseText

        Set links = HTML.getElementsByTagName("a")

        'Determines if there is a trust table and if so _
        then it creates the http2 object and gets the _
        trust table responsetext 
        Trst = "(List all Funds and Classes/Contracts for"
        For Each link In links
            'Link is returned in responsetext with "about:/" at _
            the beginning instead of https://www.sec.gov/, so I _
            used this to replace the "about:/"
            If InStr(link.innerHTML, Trst) > 0 Then
                link = Replace(link, "about:/", "https://www.sec.gov/")
                Debug.Print link

        'Creates this object to go to trust table webpage
        Set http2 = CreateObject("MSXML2.ServerXMLhttp.6.0")

        'Gets webpage to parse _
        trust table
        On Error Resume Next
        http2.Open "GET", link, False
        http2.send

            HTML2.body.innerHTML = http2.responseText

                'If there exists a Trust, then this refers to the _
                3rd table on the trust table webpage; _
                note ("table")(3)
                On Error Resume Next
                Set tbl = HTML2.getElementsByTagName("table")(3)

                Set ws_2 = wb.Sheets("Parsed_Tables")

                With ws_2

                    For Each ele In tbl.getElementsByTagName("tr")
                    'First finds rows with Class/Con numbers
                    If InStr(ele.innerText, "C00") Then
                     'Pulls Class/Con Numbers, note children(2)
                       'output to col E sheet
                        .Cells(Rows.Count, "E"). _
                        End(xlUp).Offset(1, 0).Value2 = ele.Children(2).innerText

                      'Outputs Share Class, children(3)
                        'Output to col F sheet
                        .Cells(Rows.Count, "F"). _
                        End(xlUp).Offset(1, 0).Value2 = ele.Children(3).innerText

                      'Not not all Funds have Ticker _
                       so this keeps the module from _
                       asking for object to be set
                      On Error Resume Next
                      'Outputs Ticker to excel
                         'Reads the last value in Col F and offsets Ticker to _
                         to show directly in adjacent cel in Col G
                         .Cells(Rows.Count, "F"). _
                         End(xlUp).Offset(0, 1).Value2 = ele.Children(4).innerText

                    'Pulls SIC number
                    ElseIf InStr(ele.innerText, "S00") Then
                        'Offsets from col F to be placed in col C
                        .Cells(Rows.Count, "F"). _
                        End(xlUp).Offset(1, -3).Value2 = ele.Children(1).innerText

                      'Pulls Fund Name
                        'Offsets from col F to col D
                        .Cells(Rows.Count, "F"). _
                        End(xlUp).Offset(1, -2).Value2 = ele.Children(2).innerText

                    'Pulls CIK number
                    ElseIf InStr(ele.Children(0).innerText, "000") Then
                        'Offset from col F to col A
                        .Cells(Rows.Count, "F"). _
                        End(xlUp).Offset(1, -5).Value2 = ele.Children(0).innerText

                      'Pulls Trust Name
                        'Offsets from col F to col B
                        .Cells(Rows.Count, "F"). _
                        End(xlUp).Offset(1, -4).Value2 = ele.Children(1).innerText

                    End If

                    'Counts the number of iterations of the loop _
                     and places it in the lower left corner of the _
                     workbook
                     Application.StatusBar = "Current Iteration: " & i

                   Next

               End With

            End If

         Next

        End If

        Else
        MsgBox "Error loading webpage", vbExclamation, "Alert!!!"
        Exit Sub

      End If
      On Error GoTo 0

     End If

     Else
     MsgBox "Error loading webpage", vbExclamation, "Alert!!!"
     Exit Sub

    End If

On Error GoTo 0

 If i Mod 1000 = 0 Then
  ActiveWorkbook.Save
  Application.Wait (Now + TimeValue("0:00:03"))
 End If

Next i

    'sets everything back to normal after running code 
    OptimizeCode_End

End Sub

The following is a sample of the links listed in the CIK_Links Sheet:

https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=3&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=11&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=13&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=14&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=17&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=18&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2110&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2135&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2145&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2663&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2664&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2691&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=2768&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=3521&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=3794&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=4123&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=4405&owner=include&count=02
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=4568&owner=include&count=02

Solution

  • I don't think your code will run unless there is at least one On Error Resume Next hiding some runtime errors. For example, you have If http2.Status = 200 Then before you have instantiated the http2 object.

    Below is a method that could definitely be improved but uses a class to hold the xmlhttp object and provides methods for retrieving the required info. The layout of your desired table makes parsing the actual webpage particularly complicated. You may wish to stay with that. I have chosen to use the table structure as is. Perhaps, this may provide you with a framework at least. You would add your custom optimisation sub calls into this.


    TODO:

    See if an estimate can be made for an oversize results array that can hold all the results rather an array of arrays so the write out can be done in go. If I have time I will make this amendment.


    Class clsHTTP

    Option Explicit
    
    Private http As Object
    Const SEARCH_TERM As String = "(List all Funds and Classes/Contracts"
    
    Private Sub Class_Initialize()
        Set http = CreateObject("MSXML2.XMLHTTP")
    End Sub
    
    Public Function GetString(ByVal Url As String, Optional ByVal search As Boolean = False) As String
        Dim sResponse As String
        searchTermFound = False
        With http
            .Open "GET", Url, False
            .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
            .send
            sResponse = StrConv(.responseBody, vbUnicode)
            If InStr(sResponse, SEARCH_TERM) > 0 Then searchTermFound = True
            GetString = sResponse
        End With
    End Function
    
    Public Function GetLink(ByVal html As HTMLDocument) As String
        Dim i As Long, nodeList As Object
        Set nodeList = html.querySelectorAll("a")
        GetLink = vbNullString
        For i = 0 To nodeList.Length - 1
            If InStr(nodeList.item(i).innerText, SEARCH_TERM) > 0 Then
                GetLink = Replace$(nodeList.item(i).href, "about:/", "https://www.sec.gov/")
                Exit For
            End If
        Next
    End Function
    
    Public Function GetInfo(ByVal html As HTMLDocument) As Variant
        Dim CIK As String, table As HTMLTable, tables As Object, tRows As Object
        Dim arr(), tr As Object, td As Object, r As Long, c As Long
    
        Set tables = html.querySelectorAll("table")
    
        If tables.Length > 3 Then
            CIK = "'" & html.querySelector(".search").innerText
            Set table = tables.item(3)
            Set tRows = table.getElementsByTagName("tr")
            ReDim arr(1 To tRows.Length, 1 To 6)
            Dim numColumns As Long, numBlanks As Long
    
            For Each tr In tRows
                numColumns = tr.getElementsByTagName("td").Length
                r = r + 1: c = 2: numBlanks = 0
                If r > 4 Then
                    arr(r - 4, 1) = CIK
                    For Each td In tr.getElementsByTagName("td")
                        If td.innerText = vbNullString Then numBlanks = numBlanks + 1
                        arr(r - 4, c) = td.innerText
                        c = c + 1
                    Next td
                    If numBlanks = numColumns Then Exit For
                End If
            Next
        Else
            ReDim arr(1, 1)
            GetInfo = arr
            Exit Function
        End If
    
        arr = Application.Transpose(arr)
        ReDim Preserve arr(1 To 6, 1 To r - 4)
        arr = Application.Transpose(arr)
        GetInfo = arr
    End Function
    

    Standard module 1

    Option Explicit
    Public searchTermFound As Boolean
    
    Public Sub GetInfo()
        Dim wsLinks As Worksheet, links(), link As Long, http As clsHTTP
        Dim lastRow As Long, html As HTMLDocument, newURL As String
        Set wsLinks = ThisWorkbook.Worksheets("CIK_Links")
        Set http = New clsHTTP
        Set html = New HTMLDocument
        With wsLinks
            lastRow = GetLastRow(wsLinks, 3)
            If lastRow = 2 Then
                ReDim links(1, 1)
                links(1, 1) = .Range("C2").Value
            Else
                links = .Range("C2:C" & lastRow).Value
            End If
        End With
        Dim results(), arr(), i As Long, j As Long
        ReDim results(1 To UBound(links, 1))
        For link = LBound(links, 1) To UBound(links, 1)
    
            If InStr(links(link, 1), "https://www.sec.gov") > 0 Then
    
                html.body.innerHTML = http.GetString(links(link, 1), True)
    
                If searchTermFound Then
    
                    newURL = http.GetLink(html)
                    html.body.innerHTML = http.GetString(newURL, False)
                    arr = http.GetInfo(html)
    
                    If UBound(arr, 1) > 1 Then
                        i = i + 1
                        results(i) = arr
                    End If
                End If
            End If
        Next
    
        Dim wsOut As Worksheet
        Set wsOut = ThisWorkbook.Worksheets("Parsed_Tables")
    
        For j = 1 To i
            arr = results(j)
            With wsOut
                 .Cells(GetLastRow(wsOut, 1), 1).Resize(UBound(arr, 1), UBound(arr, 2)) = arr
            End With
        Next
    End Sub
    
    Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
        With ws
            GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
        End With
    End Function