javaregexdata-processingedifact

How to add empty string when 2 delimiters one after another with String.split()


I'm quite new to regex and I have to split EDI files for a loader I'm developing. If you are not familiar with it, here is an example of 2 segments (modified to explain all so it's not a real example):

APD+EM2:0:16?'30::6+++++++DA'APD+EM2:0:1630::6+++++++DA'

End of lines are marked with ' and I ignore if there's an escaping char which is the question mark - ?' is to ignore for example for the end of a line. \+ and : are the main delimiters (when data are composite like an address).

The split for the segments works fine, but I have issues with the other delimiters. I would like to have a String[] with all the elements, even if they are empty, because I need to process it after (insert in DB). With the example above, I would like to have a tab like this:

APD+EM2:0:16?'30::6+++++++DA

would transform into:

{"APD","EM2","0","16?'30","","6","","","","","","","DA"}

Currently with my code, I get a tab like this:

{"APD","EM2","0","16?'30","6","DA"}

Can I please have some help with my regex? Making it match ++ and :: is beyond my skills for now. I need to remove the escaping characters as well, but I'll work on that on my own.

BTW, I need to process a lot of data - 300gb of raw text - so if what I do is bad performance-wise, don't hesitate to tell me - like per example split with both + and : at the same time.

The EDIFACT format is not something discussed a lot around here, and the few examples I found were not working for me.

Current code:

private final String DATA_ELEMENT_DELIMITER = "(?<!\\?)\\+";
private final String DATA_COMPOSITE_ELEMENT_DELIMITER = "(?<!\\?):";

private String[] split (String segments){       
    return Stream.of(segments)
            .flatMap(Pattern.compile(DATA_ELEMENT_DELIMITER)::splitAsStream)
            .flatMap(Pattern.compile(DATA_COMPOSITE_ELEMENT_DELIMITER)::splitAsStream)
            .toArray(String[]::new);
}

EDIT : The code I'm running - BTW, I'm running on Java 8, not sure it makes a difference though:

import java.util.Arrays;
import java.util.regex.Pattern;
import java.util.stream.Stream;
public class Split {

    public static void main(String[] args) {
        Split s = new Split();
        System.out.println(
                Arrays.toString(
                    s.split("APD+EM2:0:16?'30::6+++++++DA'")
                )
            );
    }
    
    
    private static final Pattern DATA_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?)\\+");
    private static final Pattern DATA_COMPOSITE_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?):");
    
    private String[] split (String segments){       
        return Stream.of(segments)
                .flatMap(DATA_ELEMENT_DELIMITER::splitAsStream)
                .flatMap(DATA_COMPOSITE_ELEMENT_DELIMITER::splitAsStream)
                .toArray(String[]::new);
    }
}

Here is the output i get :

[APD, EM2, 0, 16?'30, , 6, DA']

EDIT EDIT

After trying to run this code in an online Java 11 compiler, the output is correct, but not on Java 8.


Solution

  • My first note is that for improved performance, you definitely want to compile the Patterns once and reuse the instance:

    private static final Pattern DATA_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?)\\+");
    private static final Pattern DATA_COMPOSITE_ELEMENT_DELIMITER = Pattern.compile("(?<!\\?):");
    // ...
    .flatMap(DATA_ELEMENT_DELIMITER::splitAsStream)
    .flatMap(DATA_COMPOSITE_ELEMENT_DELIMITER::splitAsStream)
    

    Second, as @user15244370 mentioned, running your code does produce the output you are looking for. I ran it like this:

    System.out.println(
        Arrays.toString(
            split("APD+EM2:0:16?'30::6+++++++DA'APD+EM2:0:1630::6+++++++DA'")
        )
    );
    

    and got the output:

    [APD, EM2, 0, 16?'30, , 6, , , , , , , DA'APD, EM2, 0, 1630, , 6, , , , , , , DA']
    

    Assuming there is some difference between what you have posted and what you are actually running, the documentation for splitAsStream mentions:

    Trailing empty strings will be discarded and not encountered in the stream.


    Are you doing any additional processing after the call to split? And how are you printing the array? Is it possible that the method you are using to print the string[] may be removing empty strings? As far as I can tell, your implementation should function as you intend.