I am trying to scrape address from 10K filing document in HTML: https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm
It has multiple div class, and I want to scrape for address inside span.
Expected output:
1600 Amphitheatre parkway
I have tried few things like below:
from requests_html import HTMLSession
s = HTMLSession()
r = s.get('https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm')
r
add1 = r.html.find_all('div')
add1
However, if you inspect the page it has many layers I am new to HTML and python. Please help
You could do it like this, but I'm not sure it's very robust, or applicable to many examples given how the ids look...
from requests_html import HTMLSession
from bs4 import BeautifulSoup
session = HTMLSession()
page = session.get('https://www.sec.gov/Archives/edgar/data/1652044/000165204419000032/goog10-qq32019.htm')
soup = BeautifulSoup(page.content, 'html.parser')
content = soup.find(id="d92517213e644-wk-Fact-0B11263160365DBABCF89969352EE602")
print(content.text)
output
1600 Ampitheatre Parkway
Edit : I didn't see @baduker answer and I didn't know there was an API, he is right, use the API