I'm not sure if using regex is the correct way to go about this here, but I wanted to try solving this with regex first (if it's possible)
I have an edifact file, where the data (in bold) in certain fields in some segments need to be substituted (with different dates, same format)
UNA:+,? '
UNB+UNOC:3+000000000+000000000+20190801:1115+00001+DDMP190001'
UNH+00001+BRKE:01+00+0'
INV+ED Format 1+Brustkrebs+19880117+E000000001+**20080702**+++1+0'
FAL+087897044+0000000++name+000000000+0+**20080702**++1+++J+N+N+N+N+N+++0'
INL+181095200+385762115+++0'
BEE+20080702++++0'
BAA+++J+J++++++J+++++++J++0'
BBA++++++++J++++++J+J++++++J+++++J+++J+J++++++++J+0'
BHP+J+++++J+++++J+++++0'
BLA+++J+++++++++0'
BFA++++++++++++J++0'
BSA++J+++J+J+++0'
BAT+20190801+0'
DAT+**20080702**++++0'
UNT+000014+00001'
UNZ+00001+00001'
at first I was able to match those fields using a positive lookahead and a lookbehind (I had different expressions for matching each date).
Here, for example is the expression I intially used to match the date in the "FAL" segment: (?<=\+[\d]{1}\+)\d{8}(?=\+\+)
, but then i saw that this date is sometimes preceeded by 9 digits, and sometimes by 1 (based on version) and followed by a either ++ or a + and a date so I added a logiacl OR like this: (?<=\+[\d]{9}\+|\+[\d]{1}\+)\d{8}(?=\+[\d]{8}\+|\+\+)
and quickly realized it's not sustainable because I saw that these edifact files vary (far beyond only either 9 and 1 digits)
(I have 6 versions for each type, and i have 6 types total)
Because I have a scheme/map indicating what each version should be built like and I know on what position (based on the + separator) the date is written in each version, I thought about maybe matching the date based on the +, so after the 7th occurence (say in the FAL segment) of plus in a certain line, match the next 8 digits.
is this possible to achieve with regex? and if yes, could someone please tell me how?
I suggest using a pattern like
^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+)
where {7}
can be adjusted to the value you need for each type of segments, and replace with the backreference to Group 1. In Python, it is \g<1>20200101
(where 20200101
is your new date), in PHP/.NET, it is ${1}20200101
. In JS, it will be just $1
.
To run on a multiline text, use m
flag. In Python regex, you may embed it like (?m)^((?:[^+\n]*\+){7})\d{8}(?=\+(?:\d{8})?\+)
.
See the Python regex demo
Details
^
- start of string/line((?:[^+\n]*\+){7})
- Group 1: 7 repetitions of any chars other than +
and newline, and then a +
\d{8}
- 8 digits(?=\+(?:\d{8})?\+)
- that are followed with +
, and optional chunk of 8 digits and a +
.