[SOLVED] Remove EDIFACT messages from string in Python

Remove EDIFACT messages from string in Python

A sample EDIFACT message looks like this:

UNB+AHBI:1+.? '
UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
UNH+1+PAORES:93:1:IA'
MSG+1:45'
IFT+3+XYZCOMPANY AVAILABILITY'
ERC+A7V:1:AMD'
IFT+3+NO MORE FLIGHTS'
ODI'
TVL+240493:1000::1220+FRA+JFK+DL+400+C'
PDI++C:3+Y::3+F::1'
!ERC+21198:EC'
APD+74C:0:::6++++++6X'
TVL+240493:1740::2030+JFK+MIA+DL+081+C'
PDI++C:4'
APD+EM2:0:1630::6+++++++DA'
UNT+13+1'
UNZ+1+1'

I need to create a regex that removes this type of EDIFACT messages from strings. It should not lose any extra text from string as it may contain some important information. For example, edifact can be embedded in text like:

After discussing with team we found that wrong org segment sent in edifact message. Can you please investigate further why wrong ORG segment is sent. [EDIFACT MESSAGE]
Update information as quickly as possible

Can anybody help create a regex for that?

Solution

Going over an EDIFACT format description, the UNA part is optional and the UNB is mandatory, so either may indicate the start of a message. The UNZ part is a mandatory footer. Considering a file that contains

First
UNA:+.? '
UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
UNH+1+PAORES:93:1:IA'
MSG+1:45'
...
UNZ+1+1'
Message
Second
UNB+AHBI:1+.? '
UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
UNH+1+PAORES:93:1:IA'
MSG+1:45'
...
UNZ+1+1'
Message

with ...s comparable to your full example, here's some Python 3 code:

import re
import sys

regex = re.compile(r'(?:UNA.*?)?UNB.*?UNZ.*?(?:\r\n|\r|\n)', flags=re.DOTALL)
print(re.sub(regex, '', sys.stdin.read()), end='')

Here I assume that the UNZ part continues until the end of line, even though that may be inaccurate. That is, it also appears to have a fixed format that one could more precisely model.

The run-down of the regex itself:

(?:UNA.*?)? is an optional UNA part; the part that comes after UNA may have any size or format, but should be as small as possible.
UNB.*? is a mandatory UNB part; this marks the beginning of the EDIFACT message and continues for as long as it has to until the first occurrence of UNZ.
UNZ.*?(?:\r\n|\r|\n) is a mandatory UNZ part; it is followed by as many characters as it takes to reach the end of the line. Since this appears to be a rather old format, being conservative about the type of line endings is probably a good thing. (\r\n is Windows, and a lot of network protocols honor this for compatibility reasons, \r alone are really old Macs, and \n is Unix).
The flags=re.DOTALL part tells Python's regex engine to include newlines as part of ".".

Running this script here gives:

First
Message
Second
Message