pythonhtmlbeautifulsoup

Python BeautifulSoup - How to crawl links <a> inside values in <td>


I'm learning web scraping and am trying to web crawl data from the below link. Is there a way for me to crawl the link from each of the td as well?

The website link: http://eecs.qmul.ac.uk/postgraduate/programmes/

Here's what I did so far.

from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "http://eecs.qmul.ac.uk/postgraduate/programmes/"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')

table_list = []
rows = soup.find_all('tr')

# For every row in the table, find each cell element and add it to the list
for row in rows:
    row_td = row.find_all('td')
    row_cells = str(row_td)
    row_cleantext = BeautifulSoup(row_cells, "lxml").get_text()
    table_list.append((row_cleantext))

print(table_list)

Solution

  • from urllib.request import urlopen
    from bs4 import BeautifulSoup
    url = "http://eecs.qmul.ac.uk/postgraduate/programmes/"
    html = urlopen(url)
    soup = BeautifulSoup(html, 'lxml')
    main_data=soup.find_all("td")
    

    You can find main_data and iterate over that so you will get specific td tag and now find a and use .get for href extraction and if any Attribute is not present so you can use try-except to handle exceptions

    for data in main_data:
        try:
            link=data.find("a").get("href")
            print(link)
        except AttributeError:
            pass
    

    For Understing only:

    main_data=soup.find_all("td")
    for data in main_data:
        try:
            link=data.find("a")
            print(link.text)
            print(link.get("href"))
        except AttributeError:
            pass
    

    Output:

    H60C
    https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/
    H60A
    https://www.qmul.ac.uk/postgraduate/taught/coursefinder/courses/advanced-electronic-and-electrical-engineering-msc/
    
    ..
    

    For creating table you can use pandas module

    main_data=soup.find_all("td")
    dict1={}
    for data in main_data:
        try:
            link=data.find("a")
            dict1[link.text]=link.get("href")
        except AttributeError:
            pass
    import pandas as pd
    df=pd.DataFrame(dict1.items(),columns=["Text","Link"])
    

    Output:

        Text    Link
    0   H60C    https://www.qmul.ac.uk/postgraduate/taught/cou...
    1   H60A    https://www.qmul.ac.uk/postgraduate/taught/cou...
    2   I4U2    https://www.qmul.ac.uk/postgraduate/taught/cou...
    ..
    

    Getting table from website

    import pandas as pd
    data=pd.read_html("http://eecs.qmul.ac.uk/postgraduate/programmes/")
    df=data[0]
    df
    

    Output

    Postgraduate degree programmes  Part-time(2 year)   Full-time(1 year)
    0   Advanced Electronic and Electrical Engineering  H60C    H60A
    1   Artificial Intelligence I4U2    I4U1
    .....