pythonhtmlpdfweb-scrapingpdftotext

How to convert Web PDF to Text


I want to convert web PDF's such as - https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf & many more into a Text without saving them into my PC ,Cause 1000's of such announcemennts come up daily , Hence wanted to convert them to text without saving them on my PC. Any Python Code Solutions to this? Thanks


Solution

  • There is different methods to do this. But the simplest is to download locally the PDF then use one of following Python module to extract text (OCR) :

    Here is a simple code example for that (using pdfplumber)

    from urllib.request import urlopen
    import pdfplumber
    url = 'https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf'
    response = urlopen(url)
    file = open("img.pdf", 'wb')
    file.write(response.read())
    file.close()
    try:
        pdf = pdfplumber.open('img.pdf')
    except: 
        # Some files are not pdf, these are annexes and we don't want them. Or error reading the pdf (damaged ? )
        print(f'Error. Are you sure this is a PDF ?')
        continue
    #PDF plumber text extraction
    page = pdf.pages[0]
    text = page.extract_text()
    

    EDIT : My bad, just realised you asked "without saving it to my PC". That being said, I also scrap a lot (1000s aswell) of pdf, but all save them as "img.pdf" so they just keep replacing each other and end up with only 1 pdf file. I do not provide any solution for PDF OCR without saving the file. Sorry for that :'(