Trying to get url from this link https://www.atanet.org/onlinedirectories/tsd_view.php?id=3856
I use the following formula: =VLOOKUP("Website",ImportXML(A1, "(//table[@id='tableTSDContent']//tr)"),2,0) But unfortunately, it does not pull out the url. I would really appreciate it if you could help me extract the url in question.
I tried using the APIPheny add on to import the data. After the <h2>Online Directories Listing</h2>
, I saw a cell that said "Google bot blocked" or something to that effect.
I then went to the site's robots.txt file (https://www.atanet.org/robots.txt), which says:
User-agent: *
Disallow: /onlinedirectories/tsd_view.php*
Disallow: /onlinedirectories/tsd_search.php*
Disallow: /onlinedirectories/tsd_listings/tsd_view.fpl*
Disallow: /onlinedirectories/tsd_listings/tsd_search.fpl*
Disallow: http://www.atanet.org/bin/mpg.pl/28644.html
Disallow: /onlinedirectories/tsd_corp_listings/*
Disallow: /bin
Disallow: /division_calendar
User-agent: Googlebot
Disallow: /onlinedirectories/tsd_view.php*
Disallow: /onlinedirectories/tsd_search.php*
Disallow: /onlinedirectories/tsd_listings/tsd_view.fpl*
Disallow: /onlinedirectories/tsd_listings/tsd_search.fpl*
Disallow: /*division_calendar*
Disallow: /*bin*
Disallow: http://www.atanet.org/bin/mpg.pl/28644.html
User-agent: ITABot
Disallow: /onlinedirectories
I also think this means that the Google Sheets user agent is the same as the same as the Search Engine (Googlebot). If this is the case, then with Google Sheets, you're out of luck here because the tsd_view.php
you want is disallowed. Likely, this was put there because they didn't want Google (or other search engines, for that matter) to index people's contact information. Of course, if you're a malicious webcrawler, you could ignore the robots.txt, but Googlebot is a nice bot.