pythonregexhexsnort

Convert strings with an unknown number of hex strings embedded in them to strings using regex


So I have a list of strings (content from Snort rules), and I am trying to convert the hex portions of them to UTF-8/ASCII, so I can send the content over netcat.

The method I have now works fine for strings with single hex characters (i.e. 3A), but breaks when there's a series of hex characters (i.e. 3A 4B 00 FF)

My current solution is:

import re
import codecs

def convert_hex(match):
  string = match.group(1)
  string = string.replace(" ", "")
  decode_hex = codecs.getdecoder("hex_codec")
  try:
    result = decode_hex(string)[0]
  except:
    result = bytes.fromhex((lambda s: ("%s%s00" * (len(s)//2)) % tuple(s))(string)).decode('utf-16-le')
  return result.decode("utf-8")


strings = ['|0A|Referer|3A| res|3A|/C|3A|', 'RemoteNC Control Password|3A|', '/bbs/search.asp', 'User-Agent|3A| Mozilla/4.0 |28|compatible|3B| MSIE 5.0|3B| Windows NT 5.0|29|']

converted_strings = []

for string in strings:
    for i in range(len(string)):
        string = re.sub(r"\|(.{2})\|", convert_hex, string)
    converted_strings.append(string)

For the strings in strings, this works, but for a string like:

|08 00 00 00 27 C7 CC 6B C2 FD 13 0E|

it breaks.

I tried changing the regex to:

re.sub(r"\|.*([A-Fa-f0-9]{2}).*\|")

but that only converts the last hex.

I need this solution to work for strings like Hello|3A|World, |3A 00 FF|, and Hello|3A 00|World

I know it's an issue with the regexp, but I'm not sure what exactly.

Any help would be much appreciated.


Solution

  • It looks like a substring is either always hex i.e. (?:[A-Fa-f0-9]{2}\s)+[A-Fa-f0-9]{2} or not hex at all between | symbols?

    This works:

    for string in strings:
        for i in range(len(string)):
            string = re.sub(r"(?<=\|)((?:[A-Fa-f0-9]{2}\s)*[A-Fa-f0-9]{2})(?=\|)", convert_hex, string)
        converted_strings.append(string)
    

    (extra parentheses for a capturing group 1 - you could leave out one pair of parentheses and change your function to act on group(0) instead)

    But it breaks on your example |08 00 00 00 27 C7 CC 6B C2 FD 13 0E|, as that doesn't appear to be a valid UTF-8 encoding. The resulting error:

    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc7 in position 5: invalid continuation byte
    

    However, a valid UTF-8 encoded multi-byte string like '|74 65 73 74 20 f0 9f 98 80|' works just fine:

    import re
    import codecs
    
    def convert_hex(match):
      string = match.group(1)
      string = string.replace(" ", "")
      decode_hex = codecs.getdecoder("hex_codec")
      try:
        result = decode_hex(string)[0]
      except:
        result = bytes.fromhex((lambda s: ("%s%s00" * (len(s)//2)) % tuple(s))(string)).decode('utf-16-le')
      return result.decode("utf-8")
    
    
    strings = ['|74 65 73 74 20 f0 9f 98 80|']
    
    converted_strings = []
    
    for string in strings:
        for i in range(len(string)):
            string = re.sub(r"(?<=\|)((?:[A-Fa-f0-9]{2}\s)*[A-Fa-f0-9]{2})(?=\|)", convert_hex, string)
        converted_strings.append(string)
    
    print(converted_strings)
    

    Result:

    ['|test 😀|']
    

    If you don't really need a printable representation of the data, you could just have your function return the bytes object and only apply the function to matching parts - instead of constructing a new string.

    Based on what @Selcuk was saying, perhaps a result with byte-strings makes more sense - this works on all three types of input:

    import re
    import codecs
    
    def convert_hex(match):
      string = match.group(1)
      string = string.replace(b" ", b"")
      decode_hex = codecs.getdecoder("hex_codec")
      try:
        result = decode_hex(string)[0]
      except:
        result = bytes.fromhex((lambda s: ("%s%s00" * (len(s)//2)) % tuple(s))(string)).decode('utf-16-le')
      return result
    
    
    strings = ['|0A|Referer|3A| res|3A|/C|3A|', '|74 65 73 74 20 f0 9f 98 80|', '|08 00 00 00 27 C7 CC 6B C2 FD 13 0E|']
    
    converted_strings = []
    
    for string in strings:
        string = re.sub(rb"(?<=\|)((?:[A-Fa-f0-9]{2}\s)*[A-Fa-f0-9]{2})(?=\|)", convert_hex, string.encode())
        converted_strings.append(string)
    
    print(converted_strings)
    

    Result:

    [b'|\n|Referer|:| res|:|/C|:|', b'|test \xf0\x9f\x98\x80|', b"|\x08\x00\x00\x00'\xc7\xcck\xc2\xfd\x13\x0e|"]
    

    No encoding issues, because no encoding is chosen. (Note that I didn't attempt to change convert_hex too much - there's some encoding juggling in there that you may need to look at, I just got it to work for bytes)