pythonseleniumweb-scrapingimdb

How to distinguish two tables with the same relative XPATH with Selenium in Python


I'm trying to scrape some data from IMDb (with selenium in Python), but I have a problem. For each movie I have to fetch directors and writers. Both elements are contained in two tables and they have the same @class. I need to distinguish the two tables when I scrape, otherwise sometimes the program could fetch a writer as a director and vice versa.

I've tried to use relative XPATH to find all elements (tables) with that xpath and then put them in a loop where I try to distinguish them trough table title (that is a h4 element) and preceding-sibling function. The code works, but it do not find anything (everytime it returns nan).

This is my code:

    counter = 1
    try:
        driver.get('https://www.imdb.com/title/' + tt + '/fullcredits/?ref_=tt_cl_sm')
        ssleep()
        tables = driver.find_elements(By.XPATH, '//table[@class="simpleTable simpleCreditsTable"]/tbody')
        counter = 1
        for table in tables:
            xpath_table = f'//table[@class="simpleTable simpleCreditsTable"]/tbody[{counter}]' 
            xpath_h4 = xpath_table + "/preceding-sibling::h4[1]/text()"
            table_title = driver.find_element(By.XPATH, xpath_h4).text
            if table_title == "Directed by":
                rows_director = table.find_elements(By.CSS_SELECTOR, 'tr')
                for row in rows_director:
                    director = row.find_elements(By.CSS_SELECTOR, 'a')
                    director = [x.text for x in director]
                    if len(director) == 1:
                        director = ''.join(map(str, director))
                    else:
                        director = ', '.join(map(str, director))
                        director_list.append(director)
        counter += 1

    except NoSuchElementException:
        # director = np.nan
        director_list.append(np.nan)

Can any of you tell me why it doesn't work? Perhaps there is a better solution. I hope for your help.

(here you can find an example of the page I need to scrape: https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm)


Solution

  • To extract the names and directors and writers of each movie within an imdb.com you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use the following locator strategies: