javaniojava-io

Effective way to read file and parse each line


I have a text file of next format: each line starts with a string which is followed by sequence of numbers. Each line has unknown length (unknown amount of numbers, amount from 0 to 1000).

string_1 3 90 12 0 3
string_2 49 0 12 94 13 8 38 1 95 3
.......
string_n 9 43

Afterwards I must handle each line with handleLine method which accept two arguments: string name and numbers set (see code below).

How to read the file and handle each line with handleLine efficiently?

My workaround:

  1. Read file line by line with java8 streams Files.lines. Is it blocking?
  2. Split each line with regexp
  3. Convert each line into header string and set of numbers

I think it's pretty uneffective due 2nd and 3rd steps. 1st step mean that java convert file bytes to string first and then in 2nd and 3rd steps I convert them back to String/Set<Integer>. Does that influence performance a lot? If yes - how to do better?

public handleFile(String filePath) {
    try (Stream<String> stream = Files.lines(Paths.get(filePath))) {
        stream.forEach(this::indexLine);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

private void handleLine(String line) {
    List<String> resultList = this.parse(line);
    String string_i = resultList.remove(0);
    Set<Integer> numbers = resultList.stream().map(Integer::valueOf).collect(Collectors.toSet());
    handleLine(string_i, numbers); // Here is te final computation which must to be done only with string_i & numbers arguments
}

private List<String> parse(String str) {
    List<String> output = new LinkedList<String>();
    Matcher match = Pattern.compile("[0-9]+|[a-z]+|[A-Z]+").matcher(str);
    while (match.find()) {
        output.add(match.group());
    }
    return output;
}

Solution

  • Regarding your first question, it depends on how you reference the Stream. Streams are inherently lazy, and don't do work if you're not going to use it. For example, the call to Files.lines doesn't actually read the file until you add a terminal operation on the Stream.

    From the java doc:

    Read all lines from a file as a Stream. Unlike readAllLines, this method does not read all lines into a List, but instead populates lazily as the stream is consumed

    The forEach(Consumer<T>) call is a terminal operation, and, at that point, the lines of the file are read one by one and passed to your indexLine method.

    Regarding your other comments, you don't really have a question here. What are you trying to measure/minmize? Just because something is multiple steps doesn't inherently make it have poor performance. Even if you created a wizbang oneliner to convert from the File bytes directly to your String & Set, you probably just did the intermediate mapping anonymously, or you've called something that will cause the compiler to do that anyway.