I'm using Tika to parse IP addresses from a PDF file. Below is my code:
import tika
from tika import parser
import re
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
tika.initVM()
# opening pdf file
parsed_pdf = parser.from_file("static_hosts.pdf")
text = parsed_pdf["content"]
regex = '(?:[0-9]{1,3}\.){3}[0-9]{1,3}$'
match = re.findall(regex, text)
print(match)
I have tested the regex online and found that they work properly. I even tried these but none of them work:
regex = '(?:[0-9]{1,3}\.){3}[0-9]{1,3}$'
regex = ^'(?:[0-9]{1,3}\.){3}[0-9]{1,3}$'
regex = r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}$'
Could you please show me where I missed?
Thank you. Huy
try this code:
import tika
from tika import parser
import re
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
tika.initVM()
# opening pdf file
parsed_pdf = parser.from_file("static_hosts.pdf")
text = parsed_pdf["content"]
regex = r'\b((?:\d{1,3}\.){3}\d{1,3})\b' # modified regex with capturing group
matches = re.finditer(regex, text)
ip_addresses = [match.group(0) for match in matches]
print(ip_addresses)