pythonregexpdftika-python

can't parse IP address from PDF file, no error, just empty


I'm using Tika to parse IP addresses from a PDF file. Below is my code:

import tika
from tika import parser
import re

 
# Press the green button in the gutter to run the script.
if __name__ == '__main__':
    tika.initVM()
    # opening pdf file
    parsed_pdf = parser.from_file("static_hosts.pdf")
    text = parsed_pdf["content"]
    regex = '(?:[0-9]{1,3}\.){3}[0-9]{1,3}$'
    match = re.findall(regex, text)
    print(match)

I have tested the regex online and found that they work properly. I even tried these but none of them work:

regex = '(?:[0-9]{1,3}\.){3}[0-9]{1,3}$'
regex = ^'(?:[0-9]{1,3}\.){3}[0-9]{1,3}$'
regex = r'(?:[0-9]{1,3}\.){3}[0-9]{1,3}$'

Could you please show me where I missed?

Thank you. Huy


Solution

  • try this code:

    import tika
    from tika import parser
    import re
    
    # Press the green button in the gutter to run the script.
    if __name__ == '__main__':
        tika.initVM()
        # opening pdf file
        parsed_pdf = parser.from_file("static_hosts.pdf")
        text = parsed_pdf["content"]
        regex = r'\b((?:\d{1,3}\.){3}\d{1,3})\b'  # modified regex with capturing group
        matches = re.finditer(regex, text)
        ip_addresses = [match.group(0) for match in matches]
        print(ip_addresses)