pythonweb-scrapingbeautifulsoup

extract text not part of inner division


I have this code that extracts too much text. I am trying to extract only the title from top-content.

from bs4 import BeautifulSoup
import requests
r  = requests.get("https://education.maharashtra.gov.in/saral/27230500360")
data = r.text
soup = BeautifulSoup(data)
soup.find("div", {"class": "top-content"})

How do I extract the name of school that is not part of inner div? Expected output:

BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE (27230500360) 

update:

Is it possible to save the text as dict?

{27230500360 : "BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE"} 

Solution

  • Try this. It will get you there:

    from bs4 import BeautifulSoup
    import requests
    
    req  = requests.get("https://education.maharashtra.gov.in/saral/27230500360")
    soup = BeautifulSoup(req.text,"lxml")
    for item in soup.select("#logo"):
        data = ' '.join(item.text.split())
        item_dict = {data.split(" ")[-1]:' '.join(data.split(" ")[:-1])}
        print(item_dict)
    

    Output:

    {'(27230500360)': 'BHARATI VIDYAMANDIR HINDI NIGHT HIGH SCHOOL AND JR COLLEGE'}