cperformancescanfstrsep

Use fscanf to read in variable numbers of integer


I have over 100,000 csv files in the below format:

1,1,5,1,1,1,0,0,6,6,1,1,1,0,1,0,13,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,1,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,2,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,3,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,4,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,5,6,4,1,0,1,0,1,0,4,8,18,20,,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,6,6,5,1,1,1,0,1,0,4,7,8,18,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,7,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,1,0,8,6,5,1,1,1,0,1,0,13,4,7,8,20,,,,,,,,,,,,,,,,,,,,,,,
1,1,5,1,1,2,0,0,12,12,1,2,4,1,1,0,13,4,7,8,18,20,21,25,27,29,31,32,,,,,,,,,,,,,,,,

All I need is field 10 and field 17 onward, field 10 is the counter indicate how many integer stored start from field 17 i.e. what I need is:

6,13,4,7,8,18,20
5,4,7,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
4,4,8,18,20
5,4,7,8,18,20
5,13,4,7,8,20
5,13,4,7,8,20
12,13,4,7,8,18,20,21,25,27,29,31,32

Max number of integer need to read is 28. I can easily achieve this by Getline in C++, however, from my previous experience, since I need to handle over 100,000 such files and each files may have 300,000~400,000 such lines. Therefore using Getline to read in the data and build a vector> may have serious performance issue for me. I tried to use fscanf to achieve this:

while (!feof(stream)){
 fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%*d,%d",&MyCounter);
 fscanf(fstream,"%*d,%*d,%*d,%*d,%*d,%*d"); // skip to column 17
 for (int i=0;i<MyCounter;i++){
  fscanf(fstream,"%d",&MyIntArr[i]);
 }
 fscanf(fstream,"%*s"); // to finish the line
}

However, this will call fscanf multiple times and may also create performance issue. Is there any way to read in variable number of integer at 1 call with fscanf ? Or I need to read into a string and then strsep/stoi it ? Compare to fscanf, which is better from performance point of view?


Solution

  • In order to maximize performance, you should map the files in memory with mmap or equivalent and parse the file with ad hoc code, typically scanning one character at a time with a pointer, checking for '\n' and/or '\r' for end of record and converting the numbers on the fly for storage to your arrays. The tricky parts are: