pythonhtmlweb-scrapingbeautifulsoupplaywright

Extract SVG Elements and Title from HTML using BeautifulSoup


I have a chunk of html I have extracted using BeautifulSoup which containts an SVG path and a title.

I am wondering if there is a way to extract the title text and the coordinates from this chunk.

<g class="myclass"><title>mytitle</title><path d="M -5, 0 a 5,5 0 1,0 10,0 a 5,5 0 1,0 -10,0" fill="rgba(255,255,255, 0.1" stroke="rgba(10, 158, 117, 0.8)" stroke-width="3" transform="translate(178.00000000000003, 201)"></path></g>

Below is the function which returns the above chunk (rows)

def scrape_spatial_data(page):
    html = page.inner_html("body")
    soup = BeautifulSoup(html, "html.parser")
    rows = soup.select("g.myclass")
    return rows

Is there a cleaner way of extracting this using BeautifulSoup?


Solution

  • A reasonable approach to this would be to return a dictionary that is keyed on the attribute names (from the path tag) and the title.

    I recommend using lxml (if available) as the HTML parser for enhanced performance.

    Try this:

    from playwright.sync_api import Page
    from bs4 import BeautifulSoup as BS
    from functools import cache
    
    @cache
    def get_parser():
        try:
            import lxml
            return "lxml"
        except Exception:
            pass
        return "html.parser"
    
    def scrape_spatial_data(page: Page) -> dict[str, str]:
        html = page.inner_html("body")
        soup = BS(html, get_parser())
        result = {}
        if row := soup.select_one("g.myclass"):
            if t := row.select_one("title"):
                result["title"] = t.text
            if p := row.select_one("path"):
                result.update(p.attrs)
        return result
    

    For the data shown in the question this would reveal:

    {
      "title": "mytitle",
      "d": "M -5, 0 a 5,5 0 1,0 10,0 a 5,5 0 1,0 -10,0",
      "fill": "rgba(255,255,255, 0.1",
      "stroke": "rgba(10, 158, 117, 0.8)",
      "stroke-width": "3",
      "transform": "translate(178.00000000000003, 201)"
    }