I'm using Librewolf as my personal browser, in my script I'm using Firefox driver, do I need to install Firefox in my machine in order for the driver to work "better"? I have a Python
+ Selenium
app to get URL data from a website, it has 35 pages, the script worked for the first 2 pages, on the third gave me en error (class attribute missing). Do I need to give it more wait time? The attribute I'm trying to get is FS03250425_BCN03A20 from
<div class="costa-itinerary-tile FS03250425_BCN03A20" data-cc-cruise-id="FS03250425_BCN03A20">
Is there a better way to do it? And this is the code I'm using
select_url_css = "data-cc-cruise-id"
select_tile_css = "div.costa-itinerary-tile"
element = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, select_tile_css)))
elements = driver.find_elements(By.CSS_SELECTOR, select_tile_css)
for element in elements:
url_raw = element.get_attribute(select_url_css)
You have asked two questions. Am answering the second one, which is about scraping 35 pages.
Your selenium script needs to navigate to individual page and scrape the data. In the code below, I have used a while
loop to click on the Next Page and scrape the data until the last page is reached.
NOTE: Keep in mind that selenium is not the fastest way to achieve this, as selenium imitates human action and manually scrapes page by page which takes time.
Code:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
import time
options = webdriver.FirefoxOptions()
# options.add_argument("--headless") # Run headless if you wish
driver = webdriver.Firefox(options=options)
driver.get("https://www.costacruises.eu/cruises.html?page=1#occupancy_EUR_anonymous=A&guestAges=30&guestBirthdates=1995-04-10&%7B!tag=destinationTag%7DdestinationIds=ME&%7B!tag=embarkTag%7DembarkPortCode=BCN,ALC,VLC")
driver.maximize_window()
wait = WebDriverWait(driver, 10)
# Below try catch code is to handle the cookie consent pop-up. If you are not getting this pop-up, you can remove this code.
try:
wait.until(EC.element_to_be_clickable((By.XPATH, "//button[text()='Accept']"))).click()
except Exception as e:
print("Accept button not found or not clickable:", e)
cruise_ids = []
while True:
try:
# Wait for cruise tiles on the current page
elements = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div.costa-itinerary-tile")))
# Collect cruise IDs on current page
for element in elements:
cruise_id = element.get_attribute("data-cc-cruise-id")
if cruise_id and cruise_id not in cruise_ids:
cruise_ids.append(cruise_id)
# Below code will locate and store the next button in a list. And then clicks it only if it's enabled.
next_button = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@aria-label='Next page']")))
# If the next button is disabled, break the loop
if not next_button[0].is_enabled():
break
next_button[0].click()
# Wait for content to load
time.sleep(2)
except Exception as e:
break
print("Collected Cruise IDs:")
print(cruise_ids)
Output:
Collected Cruise IDs:
['FS04250428_BCN04A28', 'FS03250425_BCN03A20', 'FS03261103_BCN03A2W', 'TO03260409_BCN03A3D', 'FS03261020_BCN03306', 'FS03261027_BCN03A2U', 'SM07250427_BCN07A1B', 'SM07250504_BCN07A1B', 'SM07260126_BCN07A4U', 'PA05260401_VLC05A05', 'SM07250525_BCN07A3O', 'SM07251109_BCN07A3P', 'DI03250926_BCN03A2G', 'SM08251214_BCN08A0I', 'PA07251027_BCN07A45', 'SM07251102_BCN07A3P', 'TO07260419_BCN07A3M', 'TO07261025_BCN07A3M', 'PA07260520_VLC07A31', 'PA07260527_VLC07A32', 'PA07251006_VLC07A2Y', 'SM07261102_BCN07A4Y', 'TO07261018_BCN07A3M', 'PA07260603_ALC07A01', 'SM07250914_BCN07A3P', 'SM07250601_BCN07A3O', 'SM07260427_BCN07A4W', 'SM07260518_BCN07A4V', 'SM07260525_BCN07A4V', 'TO07260920_BCN07A3M', 'PA07250602_VLC07A2S', 'TO07250909_BCN07A4O']
Process finished with exit code 0