pythonpython-3.xweb-scrapingpython-requestspdfplumber

Unable to collect all the lines under transactions from a pdf file


I'm trying to extract all the lines under the transactions table from this pdf file. The script that I've created can scrape the first line under the first and last headers. How can I collect all the lines from that page?

import os
import io
import re
import requests
import pdfplumber

pdf_url = 'https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2016/20005444.pdf'

response = requests.get(pdf_url)

with io.BytesIO(response.content) as f:
    with pdfplumber.open(f) as pdf:
        text_content = ""
        for page in pdf.pages:
            text_content += page.extract_text()

pattern = r'(?:iD owner asset transaction Date notification amount cap\.\s*type Date gains >\s*\$200\?\s*|iD owner asset transaction Date notification(?: amount)?\s*type Date\s*)\s*([^\n]+)'
matches = re.findall(pattern, text_content, re.IGNORECASE | re.DOTALL)
for match in matches:
    print(match.strip())

Current output:

JT Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000
FIlINg STATuS: New
u.S. global Jets ETF (JETS) P 07/1/2016 07/1/2016 $1,001 - $15,000

For your reference, this is the type of line I'm interested in:

Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000


Solution

  • Perhaps you can use simpler strategy - find all lines with $:

    import pdfplumber
    import requests
    
    pdf_url = "https://disclosures-clerk.house.gov/public_disc/ptr-pdfs/2016/20005444.pdf"
    
    response = requests.get(pdf_url)
    
    with io.BytesIO(response.content) as f:
        with pdfplumber.open(f) as pdf:
            out = []
            for page in pdf.pages:
                for line in page.extract_text().splitlines():
                    if "$" in line:
                        out.append(line.removeprefix("JT "))
    
    print(out)
    

    Prints:

    [
        "Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $15,001 - $50,000",
        "Agnico Eagle Mines limited (AEM) S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "barrick gold Corporation (AbX) S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "Eldorado gold Corporation Ordinary S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "First Trust ISE-Revere Natural gas S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "goldcorp Inc. (gg) S 06/29/2016 06/30/2016 $15,001 - $50,000",
        "goldcorp Inc. (gg) S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "Kinross gold Corporation (KgC) S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "Newmont Mining Corporation (NEM) S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "Newmont Mining Corporation (NEM) S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "North American Palladium, ltd. (PAl) S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "Pan American Silver Corp. (PAAS) S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "Pilot gold, Inc Ordinary Shares (PlgTF) S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "Pinetree Capital ltd Ordinary Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "Rare Element Resources ltd. Ordinary S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "Silver Wheaton Corp Common Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "Silver Wheaton Corp Common Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
        "SPdR S&P International dividend ETF P 07/1/2016 07/1/2016 $1,001 - $15,000",
        "u.S. global Jets ETF (JETS) P 07/1/2016 07/1/2016 $1,001 - $15,000",
        "Yamana gold Inc. Ordinary Shares S 06/29/2016 06/30/2016 $1,001 - $15,000",
    ]