pythonregexedifact

Remove EDIFACT messages from string in Python


A sample EDIFACT message looks like this:

UNB+AHBI:1+.? '
UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
UNH+1+PAORES:93:1:IA'
MSG+1:45'
IFT+3+XYZCOMPANY AVAILABILITY'
ERC+A7V:1:AMD'
IFT+3+NO MORE FLIGHTS'
ODI'
TVL+240493:1000::1220+FRA+JFK+DL+400+C'
PDI++C:3+Y::3+F::1'
!ERC+21198:EC'
APD+74C:0:::6++++++6X'
TVL+240493:1740::2030+JFK+MIA+DL+081+C'
PDI++C:4'
APD+EM2:0:1630::6+++++++DA'
UNT+13+1'
UNZ+1+1'

I need to create a regex that removes this type of EDIFACT messages from strings. It should not lose any extra text from string as it may contain some important information. For example, edifact can be embedded in text like:

After discussing with team we found that wrong org segment sent in edifact message. Can you please investigate further why wrong ORG segment is sent. [EDIFACT MESSAGE]
Update information as quickly as possible

Can anybody help create a regex for that?


Solution

  • Going over an EDIFACT format description, the UNA part is optional and the UNB is mandatory, so either may indicate the start of a message. The UNZ part is a mandatory footer. Considering a file that contains

    First
    UNA:+.? '
    UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
    UNH+1+PAORES:93:1:IA'
    MSG+1:45'
    ...
    UNZ+1+1'
    Message
    Second
    UNB+AHBI:1+.? '
    UNB+IATB:1+6XPPC:ZZ+LHPPC:ZZ+940101:0950+1'
    UNH+1+PAORES:93:1:IA'
    MSG+1:45'
    ...
    UNZ+1+1'
    Message
    

    with ...s comparable to your full example, here's some Python 3 code:

    import re
    import sys
    
    regex = re.compile(r'(?:UNA.*?)?UNB.*?UNZ.*?(?:\r\n|\r|\n)', flags=re.DOTALL)
    print(re.sub(regex, '', sys.stdin.read()), end='')
    

    Here I assume that the UNZ part continues until the end of line, even though that may be inaccurate. That is, it also appears to have a fixed format that one could more precisely model.

    The run-down of the regex itself:

    Running this script here gives:

    First
    Message
    Second
    Message