google-sheetsgoogle-sheets-formulags-vlookupgoogle-sheets-vlookup

Pulling URL with =VLOOKUP from the table


Trying to get url from this link https://www.atanet.org/onlinedirectories/tsd_view.php?id=3856

enter image description here

I use the following formula: =VLOOKUP("Website",ImportXML(A1, "(//table[@id='tableTSDContent']//tr)"),2,0) But unfortunately, it does not pull out the url. I would really appreciate it if you could help me extract the url in question.


Solution

  • I tried using the APIPheny add on to import the data. After the <h2>Online Directories Listing</h2>, I saw a cell that said "Google bot blocked" or something to that effect.

    I then went to the site's robots.txt file (https://www.atanet.org/robots.txt), which says:

    User-agent: *
    Disallow: /onlinedirectories/tsd_view.php*
    Disallow: /onlinedirectories/tsd_search.php*
    
    Disallow: /onlinedirectories/tsd_listings/tsd_view.fpl*
    Disallow: /onlinedirectories/tsd_listings/tsd_search.fpl*
    Disallow: http://www.atanet.org/bin/mpg.pl/28644.html
    
    Disallow: /onlinedirectories/tsd_corp_listings/*
    
    Disallow: /bin
    Disallow: /division_calendar
    
    
    User-agent: Googlebot
    Disallow: /onlinedirectories/tsd_view.php*
    Disallow: /onlinedirectories/tsd_search.php*
    
    Disallow: /onlinedirectories/tsd_listings/tsd_view.fpl*
    Disallow: /onlinedirectories/tsd_listings/tsd_search.fpl*
    
    Disallow: /*division_calendar*
    
    Disallow: /*bin*
    
    Disallow: http://www.atanet.org/bin/mpg.pl/28644.html
    
    User-agent: ITABot
    Disallow: /onlinedirectories
    
    

    I also think this means that the Google Sheets user agent is the same as the same as the Search Engine (Googlebot). If this is the case, then with Google Sheets, you're out of luck here because the tsd_view.php you want is disallowed. Likely, this was put there because they didn't want Google (or other search engines, for that matter) to index people's contact information. Of course, if you're a malicious webcrawler, you could ignore the robots.txt, but Googlebot is a nice bot.