vbasplitweb-scrapingresponsetext

Got stuck scraping certain fields from a site


I've written a script in vba using which I can parse "Company Name", "Phone", "Fax" and "Email" from a specific site but in case of scraping "Address", "Web" and "Name" I got stuck. I've written the script using responsetext and split method in vba. Hope there is someone to show me a workaround.

Here is what i tried with:

str = Split(http.responseText, " class=""contact-details block dark"">")
y = UBound(str)
    For i = 1 To y
        Cells(x, 1) = Split(Split(str(i), "Company Name:")(1), "<")(0)
        Cells(x, 2) = Split(Split(str(i), "Phone:")(1), "<")(0)
        Cells(x, 3) = Split(Split(str(i), "Fax:")(1), "<")(0)
        Cells(x, 4) = Split(Split(str(i), "mailto:")(1), ">")(0)
        x = x + 1
    Next i

Here goes the html element stuff:

<div class="contact-details block dark">
                <h3>Contact Details</h3><p>Company Name: PPEHeads Australia<br>Phone: +61 2 9824 5520<br>Fax: +61 2 9824 5526<br>Web: <a target="_blank" href="http://www.ppeheads.com.au">http://www.ppeheads.com.au</a></p><h4>Address</h4><p>Unit 2 / 4 Reaghs Farm Road<br>MINTO<br>NSW<br>2566</p><h4>Contact</h4><p>Name: Alan Hadfield<br>Phone: +61 2 9824 5520<br>Fax: +61 2 9824 5526<br>Email: <a href="mailto:alan@ppeheads.com.au">alan@ppeheads.com.au</a></p>
            </div>

Solution

  • Please provide the rest of your code next time, because the problem might not be where you think it is. Luckily I found your previous post here

    If you take a closer look there are 3 p tags within your html element:

    1st one is for Contact Company Details which you can get by

    Set ele = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(0)
    

    2nd one is for Address Details which you can get by

    Set ele2 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(1)
    

    3rd one is for Contact Person Details which you can get by

    Set ele3 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(2)
    

    Notice (0), (1), (2) changes at the end of code which gives you the appearance order of p tag.

    I amended your previous code and commented the changes so you can see the difference:

    Sub RestData()
    Dim http As New MSXML2.XMLHTTP60
    Dim html As New HTMLDocument
    Dim ele, ele2, ele3 As Object, post As Object 'declare
    Dim TypeDetails() As String
    Dim TypeDetails3() As String 'declare
    Dim TypeDetail() As String
    Dim i As Long, r As Long
    With CreateObject("MSXML2.serverXMLHTTP")
        .Open "GET", "http://www.austrade.gov.au/SupplierDetails.aspx?ORGID=ORG0120000508&folderid=1736", False
        .send
        html.body.innerHTML = .responseText
    End With
    
    'get all the p elements
    Set ele = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(0)
    Set ele2 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(1)
    Set ele3 = html.getElementsByClassName("contact-details block dark")(0).getElementsByTagName("p")(2)
    r = 2
    
    'split from line feed
    TypeDetails() = Split(ele.innerText, Chr(10))
    TypeDetails3() = Split(ele3.innerText, Chr(10))
    
    'This part goes for Contact Company Details, notice the operator is ": ",not ":"
    For i = 0 To UBound(TypeDetails())
        TypeDetail() = Split(TypeDetails(i), ": ")
        Cells(r, 1) = VBA.Trim(TypeDetail(0))
        Cells(r, 2) = VBA.Trim(TypeDetail(1))
        r = r + 1
    Next i
    
    'This part goes for Address Details, replaced new line with " " for it to be in the same line
    Cells(r, 1) = "Address"
    Cells(r, 2) = Replace(ele2.innerText, vbLf, " ")
    r = r + 1
    
    'This part goes for Contact Person Details
    For i = 0 To UBound(TypeDetails3())
        TypeDetail() = Split(TypeDetails3(i), ": ")
        Cells(r, 1) = VBA.Trim(TypeDetail(0))
        Cells(r, 2) = VBA.Trim(TypeDetail(1))
        r = r + 1
    Next i
    
    Set html = Nothing: Set ele = Nothing: Set docs = Nothing
    End Sub
    

    I hope this helps