I'm trying to redact phone number information from an html file ... and while I can identify all of the phone numbers easily enough I can't figure out why I am unable to replace the phone numbers that have parentheses in them. Sample below:
import re
from bs4 import BeautifulSoup
text = '''<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted@gmail.com</li>
<li>Phones: (555) 555-5555</li>
</ul><b>Category:</b> <ul><li>Title 2 </li><li>Fake Info</li></ul>
City, MO 11111 | (555) 111-1111 | myemail@gmail.com
Some Category / Some Name: 555-222-2222 | Record Number#:
</html>'''
soup = BeautifulSoup(text, 'html.parser')
def find_phone_numbers(text):
phones = re.findall(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", text)
return phones
phones = find_phone_numbers(str(soup))
print(phones)
for i in phones:
target = soup.find_all(text=re.compile(i, re.I))
try:
for v in target:
v.replace_with(v.replace(i,'(XXX) XXX-XXXX'))
except TypeError:
pass;
print(soup)
These are my results from running the above:
['(555) 555-5555', '(555) 111-1111', '555-222-2222']
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted@gmail.com</li>
<li>Phones: (555) 555-5555</li>
</ul><b>Category:</b> <ul><li>Title 2 </li><li>Fake Info</li></ul>
City, MO 11111 | (555) 111-1111 | myemail@gmail.com
Some Category / Some Name: (XXX) XXX-XXXX | Record Number#:
</div></body></html>
You can use .find_all(text=True)
to obtain all text content from the HTML soup, and then replace it with re.sub
(that way, you preserve all tags, including <li>
):
for content in soup.find_all(text=True):
s = re.sub(r'(\(?\d{3}\)?)([\s.-]*)(\d{3})([\s.-]*)(\d{4})', '(XXX) XXX-XXXX', content)
content.replace_with(s)
print(soup)
Prints:
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted@gmail.com</li>
<li>Phones: (XXX) XXX-XXXX</li>
</ul><b>Category:</b> <ul><li>Title 2 </li><li>Fake Info</li></ul>
City, MO 11111 | (XXX) XXX-XXXX | myemail@gmail.com
Some Category / Some Name: (XXX) XXX-XXXX | Record Number#:
</div></body></html>