pythonweb-scrapingbeautifulsouppython-requestshtml-content-extraction

Extract some information in a pdf embedded in a web page using python and requests


I am trying to extract some information in a pdf embedded in a web page using python and requests, And this is exactly the sentence I want to reach « Sciences de la vie et de l’environnement ».

image

Here is the code you wrote :

import time
import requests  
from bs4 import BeautifulSoup

# website to scrap
url = "https://fs.uit.ac.ma/avis-de-soutenance-dune-these-de-doctorat-mme-achachi-hind/"

with requests.session() as s:
    # get the url from requests get method
    html_content = s.get(url, verify=False)
    # Parse the html content
    soup = BeautifulSoup(html_content.content, "html.parser")
    url2 = soup.iframe["src"]
    html_doc = s.get(url2, verify=False).text
    print(html_doc)

Here's some of what print(html_doc),

Print result

When comparing the two pictures, I can't see what's inside in the last picture :

<div id="viewer" class="pdfViewer"></div>

Where inside this line is the writing that I want :

The line I want to reach


Solution

  • You can access the PDF manually (https://fs.uit.ac.ma/wp-content/uploads/2022/02/AVIS-DE-SOUTENANCE-ACHACHI-HIND.pdf) . There is the url in the iframe and request. In case of there is no way to get the url from the source code, you have to scrape the requests (eg. with BrowserMob)