pythonstringweb-scrapingbeautifulsoupnumbers

Find Location of All Numbers with a Comma


I have a been scraping some HTML pages with beautiful soup trying to extract some updated financial data. I only care about numbers that have a comma ie 100,000 or 12,000,000 but not 450 for example. The goal is just to find the location of the comma separated numbers within a string then I need to extract the entire sentence they are in.

I moved the entire scrape to a string list and within that list I want to extract all numbers that have a comma.

url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
r = requests.get(url)  
soup = BeautifulSoup(r.content)
text = soup.find_all(text = True)
strings = []
for i in range(len(text)):
        text_s = str(proxy_text[i])
        strings.append(text)

I thought about the follow re code but I am not sure if it will extract all instances.. ie within the list there may be multiple instances of numbers separated by commas.

number  = re.sub('[^>0-9,]', "", text)

Any thoughts would be a huge help! Thank you


Solution

  • You can use:

    from bs4 import BeautifulSoup
    import requests, re
    
    url = 'https://www.sec.gov/Archives/edgar/data/354950/000035495020000024/hd-2020proxystatement.htm'
    soup = BeautifulSoup(requests.get(url).text, "html5lib")
    for el in soup.find_all(True): # loop all element in page
        if re.search(r"(?=\d+,\d+).*", el.text):
            print(el.text)
            # print("END OF ELEMENT\n") # debug only