pythonhtmlweb-scrapingsvgbeautifulsoup

How to scrape SVG element from a website using Beautiful Soup?


from bs4 import BeautifulSoup
import requests
import random

id_url = "https://codeforces.com/profile/akash77"
id_headers = {
    "User-Agent": 'Mozilla/5.0(Windows NT 6.1Win64x64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 87.0 .4280 .141 Safari / 537.36 '}
id_page = requests.get(id_url, headers=id_headers)
id_soup = BeautifulSoup(id_page.content, 'html.parser')

id_soup = id_soup.find('svg')
print(id_soup)

I'm getting None as the output for this.

If I parse the <div> element in which this <svg> tag is contained, the contents of the <div> element are not getting printed. The find() works for all HTML tags except the SVG tag.


Solution

  • The webpage is rendered dynamically with Javascript, so you will need selenium to get the rendered page.

    First, install the libraries

    pip install selenium
    pip install webdriver-manager
    

    Then, you can use it to access the full page

    from selenium import webdriver
    from selenium.webdriver.chrome.service import Service
    from webdriver_manager.chrome import ChromeDriverManager
    from selenium.webdriver.common.by import By
    
    s=Service(ChromeDriverManager().install())
    driver = webdriver.Chrome(service=s)
    driver.maximize_window()
    driver.get('https://codeforces.com/profile/akash77')
    elements = driver.find_elements(By.XPATH, '//*[@id="userActivityGraph"]')
    

    Elements is a selenium WebElement, so we will need to get HTML out of it.

    svg = [WebElement.get_attribute('innerHTML') for WebElement in elements]
    

    This gives you svg and all elements inside it.

    enter image description here

    Sometimes, you need to run a browser in headless mode (without opening a chrome UI), for that you can pass a 'headless' option to the driver.

    from selenium.webdriver.chrome.options import Options
    
    options = Options()
    options.add_argument('headless')
    
    # then pass options to the driver
    
    driver = webdriver.Chrome(service=s, options=options)