pythonparsingweb-scrapingtreelist

How to get data from the TreelView list


http://www.vliz.be/vmdcdata/mangroves/aphia.php?p=browser&id=235056&expand=true#ct (That's the information I am trying to scrape)

I wanna to scrape this detailed taxonomic trees so that I can manipulate them anyway I like.

But there are a few problem in geting this tree data.

  1. I can' t fully expand the taxonomic tree . when some expanding ,some collapse as the instruction indicated . so saving the full page as html files can not sove my problem. or I can repeat the process some times to get separate files and concatenate them.. but it seems to be a ugly way.

  2. I am tired of clicking , there are so many "plus" signs and I have to wait.

Is there a way to solve this out using Python ?


Solution

  • Use Selenium, this will expand the tree by clicking on the "plus signs" and get the entire DOM with all the elements in it after it's done:

    from selenium import webdriver
    import time
    
    browser=webdriver.Chrome()
    browser.get('http://www.vliz.be/vmdcdata/mangroves/aphia.php?p=browser&id=235301&expand=true#ct')
    
    while True:
          try:
              elem=browser.find_elements_by_xpath('.//*[@src="http://www.marinespecies.org/images/aphia/pnode.gif" or @src="http://www.marinespecies.org/images/aphia/plastnode.gif"]')[1]
              elem.click()
              time.sleep(2)
          except:
              break
    
    content=browser.page_source