A sample EDIFACT message looks like this:
UNB+AHBI:1+.? '
UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
UNH+1+PAORES:93:1:IA'
MSG+1:45'
IFT+3+XYZCOMPANY AVAILABILITY'
ERC+A7V:1:AMD'
IFT+3+NO MORE FLIGHTS'
ODI'
TVL+240493:1000::1220+FRA+JFK+DL+400+C'
PDI++C:3+Y::3+F::1'
!ERC+21198:EC'
APD+74C:0:::6++++++6X'
TVL+240493:1740::2030+JFK+MIA+DL+081+C'
PDI++C:4'
APD+EM2:0:1630::6+++++++DA'
UNT+13+1'
UNZ+1+1'
I need to create a regex that removes this type of EDIFACT messages from strings. It should not lose any extra text from string as it may contain some important information. For example, edifact can be embedded in text like:
After discussing with team we found that wrong org segment sent in edifact message. Can you please investigate further why wrong ORG segment is sent. [EDIFACT MESSAGE]
Update information as quickly as possible
Can anybody help create a regex for that?
Going over an EDIFACT format description, the UNA part is optional and the UNB is mandatory, so either may indicate the start of a message. The UNZ part is a mandatory footer. Considering a file that contains
First
UNA:+.? '
UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
UNH+1+PAORES:93:1:IA'
MSG+1:45'
...
UNZ+1+1'
Message
Second
UNB+AHBI:1+.? '
UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
UNH+1+PAORES:93:1:IA'
MSG+1:45'
...
UNZ+1+1'
Message
with ...
s comparable to your full example, here's some Python 3 code:
import re
import sys
regex = re.compile(r'(?:UNA.*?)?UNB.*?UNZ.*?(?:\r\n|\r|\n)', flags=re.DOTALL)
print(re.sub(regex, '', sys.stdin.read()), end='')
Here I assume that the UNZ part continues until the end of line, even though that may be inaccurate. That is, it also appears to have a fixed format that one could more precisely model.
The run-down of the regex itself:
(?:UNA.*?)?
is an optional UNA part; the part that comes after UNA may have any size or format, but should be as small as possible.UNB.*?
is a mandatory UNB part; this marks the beginning of the EDIFACT message and continues for as long as it has to until the first occurrence of UNZ.UNZ.*?(?:\r\n|\r|\n)
is a mandatory UNZ part; it is followed by as many characters as it takes to reach the end of the line. Since this appears to be a rather old format, being conservative about the type of line endings is probably a good thing. (\r\n
is Windows, and a lot of network protocols honor this for compatibility reasons, \r
alone are really old Macs, and \n
is Unix).flags=re.DOTALL
part tells Python's regex engine to include newlines as part of ".
".Running this script here gives:
First
Message
Second
Message