pythonhtmlbeautifulsoupphone-numberredaction

Beautiful Soup Can't Redact Phone Number with Parentheses


I'm trying to redact phone number information from an html file ... and while I can identify all of the phone numbers easily enough I can't figure out why I am unable to replace the phone numbers that have parentheses in them. Sample below:

import re
from bs4 import BeautifulSoup

text = '''<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted@gmail.com</li>
<li>Phones: (555) 555-5555</li>
</ul><b>Category:</b> <ul><li>Title 2    </li><li>Fake Info</li></ul>

 City, MO 11111 | (555) 111-1111 | myemail@gmail.com

 Some Category / Some Name: 555-222-2222 | Record Number#: 

 </html>'''

soup = BeautifulSoup(text, 'html.parser')

def find_phone_numbers(text):
    phones = re.findall(r"((?:\d{3}|\(\d{3}\))?(?:\s|-|\.)?\d{3}(?:\s|-|\.)\d{4})", text)
    return phones

phones = find_phone_numbers(str(soup))

print(phones)

for i in phones:
    target = soup.find_all(text=re.compile(i, re.I))
    try:
        for v in target:
            v.replace_with(v.replace(i,'(XXX) XXX-XXXX'))
    except TypeError:
        pass;

print(soup)

These are my results from running the above:

['(555) 555-5555', '(555) 111-1111', '555-222-2222']
<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Big Title</title>
<style type="text/css">
.parsed {font-size: 75%; color: #474747;}
</style>
</head>
<body>
<div class="parsed">
<h1>Redacted Redacted</h1>
<h2> Contact Info</h2>
<ul>
<li>Position Title: My Fake Title</li>
<li>Email: Redacted@gmail.com</li>
<li>Phones: (555) 555-5555</li>
</ul><b>Category:</b> <ul><li>Title 2    </li><li>Fake Info</li></ul>

 City, MO 11111 | (555) 111-1111 | myemail@gmail.com


 Some Category / Some Name: (XXX) XXX-XXXX | Record Number#: 

 </div></body></html>

Solution

  • You can use .find_all(text=True) to obtain all text content from the HTML soup, and then replace it with re.sub (that way, you preserve all tags, including <li>):

    for content in soup.find_all(text=True):
        s = re.sub(r'(\(?\d{3}\)?)([\s.-]*)(\d{3})([\s.-]*)(\d{4})', '(XXX) XXX-XXXX', content)
        content.replace_with(s)
    
    print(soup)
    

    Prints:

    <html>
    <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
    <title>Big Title</title>
    <style type="text/css">
    .parsed {font-size: 75%; color: #474747;}
    </style>
    </head>
    <body>
    <div class="parsed">
    <h1>Redacted Redacted</h1>
    <h2> Contact Info</h2>
    <ul>
    <li>Position Title: My Fake Title</li>
    <li>Email: Redacted@gmail.com</li>
    <li>Phones: (XXX) XXX-XXXX</li>
    </ul><b>Category:</b> <ul><li>Title 2    </li><li>Fake Info</li></ul>
    
     City, MO 11111 | (XXX) XXX-XXXX | myemail@gmail.com
    
     Some Category / Some Name: (XXX) XXX-XXXX | Record Number#:
    
     </div></body></html>