pythonpandasselenium-webdrivergoogle-translation-api

How to use data frame to scrap url links from the website using python and translate url links using google translation?


I am struggling to create a data frame, but the current works as it scrap the website title and course. Now i am struggling to write some functions using data frame that will count from the website as to how many url links it has. Thereafter must then translate these text context from the website(English into Hindi). Anyone who can help with me with this issue?

`# scrapping of the class-central.com website links
# this application uses selinium driver to access the web-pages
#

from selenium import webdriver
from selenium.webdriver.common.by import By
import time

url = "https://www.classcentral.com/collection/top-free-online-courses"
driver = webdriver.Chrome()
driver.get(url)
time.sleep(2)

all_courses = driver.find_element(by=By.CLASS_NAME, value='catalog-grid__results')
course_titles = all_courses .find_elements(by=By.CSS_SELECTOR, value='[class="color-charcoal course-name"]')

for title in course_titles:
    print(title.text)
`

Solution

  • I'm not sure I understand correctly but if you want to load all courses, you'll have to click on "Load more" until the button isn't available. You can get the URLs of the course via the hrefattribute:

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions
    import pandas as pd
    import time
    
    chrome_options = webdriver.ChromeOptions()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument("window-size=1920,1080")
    driver = webdriver.Chrome(chrome_options=chrome_options)
    url = "https://www.classcentral.com/collection/top-free-online-courses"
    driver.get(url)
    
    try:
        while True:
            # wait until button is clickable
            WebDriverWait(driver, 1).until(
                    expected_conditions.element_to_be_clickable((By.XPATH, "//button[@data-name='LOAD_MORE']"))
                ).click()
            time.sleep(0.5)
    except Exception as e:
        pass
    
    all_courses = driver.find_element(by=By.CLASS_NAME, value='catalog-grid__results')
    courses = all_courses.find_elements(by=By.CSS_SELECTOR, value='[class="color-charcoal course-name"]')
    
    df = pd.DataFrame([[course.text, course.get_attribute('href')] for course in courses],
                        columns=['Title (eng)', 'Link'])
    

    Output:

                                               Title (eng)                                               Link
    0                        Medical Parasitology | 医学寄生虫学  https://www.classcentral.com/course/edx-medica...
    1    Understanding Medical Research: Your Facebook ...  https://www.classcentral.com/course/medical-re...
    2    An Introduction to Interactive Programming in ...  https://www.classcentral.com/course/interactiv...
    3                                        Mountains 101  https://www.classcentral.com/course/mountains-...
    4                       Quantum Mechanics for Everyone  https://www.classcentral.com/course/edx-quantu...
    ..                                                 ...                                                ...
    260                          Web Security Fundamentals  https://www.classcentral.com/course/edx-web-se...
    261  Viral Marketing and How to Craft Contagious Co...  https://www.classcentral.com/course/wharton-co...
    262                              Introduction to Linux  https://www.classcentral.com/course/edx-introd...
    263            Bitcoin and Cryptocurrency Technologies  https://www.classcentral.com/course/bitcointec...
    264  Machine Learning Foundations: A Case Study App...  https://www.classcentral.com/course/ml-foundat...
    
    [265 rows x 2 columns]