I am trying to use Beautiful Soup to grab the text in the Properties section of a 10K SEC filing on EDGAR.
I can get the Properties section header okay and work my way up the parent nodes but from there the next_sibling method is not identifying the next sibling (which in this case I believe contains the first paragraph of text in the section). Can someone tell me why this is not working / how to fix?
Code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1318605/000156459020004475/tsla-10k_20191231.htm'
soup = BeautifulSoup(requests.get(url).content, 'lxml')
properties_header = soup.find_all('p', text="PROPERTIES")[0]
print(properties_header.parent.parent.parent.parent.next_sibling)
Expected Result:
<p style="margin-top:4pt;margin-bottom:0pt;text-indent:5.24%;font-family:Times New Roman;font-size:10pt;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">We are headquartered in Palo Alto, California. Our principal facilities include a large number of properties in North America, Europe and Asia utilized for manufacturing and assembly, warehousing, engineering, retail and service locations, Supercharger sites, and administrative and sales offices. Our facilities are used to support both of our reporting segments, and are suitable and adequate for the conduct of our business. We primarily lease such facilities with the exception of some manufacturing facilities. The following table sets forth the location of our primary owned and leased manufacturing facilities.</p>
The first next_sibling is a NavigableString. Double-up on the next_sibling to get to the following p.
print(properties_header.parent.parent.parent.parent.next_sibling.next_sibling)