swiftmacosperformancefileparsing

Fastest way to parse a TXT line by line


I'm trying to parse a large TXT-File line by line (6mio. Lines, 200MB) using if statements with the String.contains(String) method. At the moment it is very slow is there a method to improve the speed.

I know there's also String.firstIndexOf but that seems to be slower. Regex is probably slower too.

Importing the TXT and splitting lines:

   let content = try String(contentsOfFile:path, encoding: String.Encoding.ascii)
           print("LOADED 0");
           return content.components(separatedBy: "\n")

Parsing:

     if(line.contains("<TAG1>")) {
         var thisline = line;
         thisline = thisline.replacingOccurrences(of: "<TAG1>", with: "")
         thisline = thisline.replacingOccurrences(of: "</TAG1>", with: "")
         text = "\(text)\n\(thisline): ";
     } else if(line.contains("<TAG2>")) {
         var thisline = line;
         thisline = thisline.replacingOccurrences(of: "<TAG2>", with: "")
         thisline = thisline.replacingOccurrences(of: "</TAG2>", with: "")
         text = "\(text) - \(thisline) ";
     }

There will probably be more if statements (which will probably slow down the parsing even more)

It would be awesome if the speed could be improved, it takes approx. 5-10 Minutes on my Macbook (depending on the filesize)

Edit: It seems like string + " \n " + string2 is faster than "(string) \n (string2)", but it doesn't help too much

Edit2: I've added a progress-bar to the application and it seems to start fast and get slower by the end?


Solution

  • Building up your final text variable as you are causes an ever-growing string to be copied (with a small addition) for every line and then re-assigned back to text.

    // Slow
    text = "\(text)\n\(thisline): "
    

    Appending just the addition to the original variable will be much quicker:

    // Fast(er)
    text.append("\n\(thisline): ")
    

    Depending on the required level of sophistication (and whether this is just a one-time transformation or something that will happen frequently?), you may want to look into @rmaddy's suggestion of using a proper parser.