pythonhtmlweb-scrapingurlbeautifulsoup

Extracting text under a certain section from a source code of a url using BeautifulSoup in python


I'm a beginner in python and don't really have any experience with HTML. I just saw a youtube video (https://www.youtube.com/watch?v=kEItYHtqQUg&ab_channel=edureka%21) about web scraping and got interested in extracting texts from a URL in python.

I tried to practice it links from a random database. This is the URL and code I used https://rtk.rjifuture.org/rmp/facility/100000028301

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = "https://rtk.rjifuture.org/rmp/facility/100000028301"
html = urlopen(url)

soup = BeautifulSoup(html, "html.parser")
type(soup)

all_links = soup.findAll('div', {'class': 'col'})
str_cells = str(all_links)
cleartext = BeautifulSoup(str_cells, "html.parser").get_text().split(',')

Let's say I want to extract the Address under Location. By using the code above, I could just get the address by doing print(cleartext[7])

But then when I tried the same thing with another link in the same database like https://rtk.rjifuture.org/rmp/facility/100000083214, it didn't work as well as the first part of the webpage (The section right under the name of the facility) was structured slightly differently. This also didn't work well when there was a , in one of the data before the address.

Is there a way to target the Address under the Location section and extract the text from it?


Solution

  • For URL 1 you can first find all div based on class given and find index based on that find location div and extract data using get_text() method

    import requests
    from bs4 import BeautifulSoup
    res=requests.get("https://rtk.rjifuture.org/rmp/facility/100000028301")
    soup=BeautifulSoup(res.text,"html.parser")
     
    
    soup.find_all("div",class_="container-fluid rmp-section")[1].find("div",class_="col").get_text(strip=True)
    

    Output:

    '308 Timmons StreetSnow Hill, MD 21863'
    

    URL 2:

    import requests
    from bs4 import BeautifulSoup
    res=requests.get("https://rtk.rjifuture.org/rmp/facility/100000083214")
    soup=BeautifulSoup(res.text,"html.parser")
    soup.find_all("div",class_="container-fluid rmp-section")[1].find("div",class_="col").get_text(strip=True)
    

    Output:

    '2.5 miles E of Hwy 59 on Co. Rd VKit Carson, CO 80825'