I'm trying to scrape some data from IMDb (with selenium
in Python), but I have a problem. For each movie I have to fetch directors and writers. Both elements are contained in two tables and they have the same @class
. I need to distinguish the two tables when I scrape, otherwise sometimes the program could fetch a writer as a director and vice versa.
I've tried to use relative XPATH
to find all elements (tables) with that xpath and then put them in a loop where I try to distinguish them trough table title (that is a h4
element) and preceding-sibling
function. The code works, but it do not find anything (everytime it returns nan
).
This is my code:
counter = 1
try:
driver.get('https://www.imdb.com/title/' + tt + '/fullcredits/?ref_=tt_cl_sm')
ssleep()
tables = driver.find_elements(By.XPATH, '//table[@class="simpleTable simpleCreditsTable"]/tbody')
counter = 1
for table in tables:
xpath_table = f'//table[@class="simpleTable simpleCreditsTable"]/tbody[{counter}]'
xpath_h4 = xpath_table + "/preceding-sibling::h4[1]/text()"
table_title = driver.find_element(By.XPATH, xpath_h4).text
if table_title == "Directed by":
rows_director = table.find_elements(By.CSS_SELECTOR, 'tr')
for row in rows_director:
director = row.find_elements(By.CSS_SELECTOR, 'a')
director = [x.text for x in director]
if len(director) == 1:
director = ''.join(map(str, director))
else:
director = ', '.join(map(str, director))
director_list.append(director)
counter += 1
except NoSuchElementException:
# director = np.nan
director_list.append(np.nan)
Can any of you tell me why it doesn't work? Perhaps there is a better solution. I hope for your help.
(here you can find an example of the page I need to scrape: https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_sm)
To extract the names and directors and writers of each movie within an imdb.com you have to induce WebDriverWait for visibility_of_all_elements_located() and you can use the following locator strategies:
Using CSS_SELECTOR:
driver.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4#director +table > tbody tr > td > a")))])
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "h4#writer +table > tbody tr > td > a")))])
Using XPATH:
driver.get("https://www.imdb.com/title/tt1877830/fullcredits?ref_=tt_cl_wr_sm")
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4[@id='director']//following::table[1]/tbody//tr/td/a")))])
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//h4[@id='writer']//following::table[1]/tbody//tr/td/a")))])
Console Output:
['Matt Reeves']
['Matt Reeves', 'Peter Craig', 'Bill Finger', 'Bob Kane']
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC