cfopenfseek

Reading the same file using multiple file streams for multithread applications


Context of the Problem: I have a large binary file containing data with a unique structure. A unit of this data is called an "event". Each event has 32016 bytes and a single file includes about 400000 events making the file ~12 GBs. I'm writing a program to process the events and trying to use a multithread approach with several threads reading different segments of the file(having each tread use its own file stream).

The problem is fseek fails to seek to the correct position of the file. the following is the minimal reproducible example. The program reads a binary file with 473797 events with planning to use 20 treads while each tread uses a different file stream.

#include <iostream>
#include <stream>
#include <errno.h>
#include <string.h>

using namespace std;

int main(){

        FILE *segment[20];
        int ret=0;
        int eventsPerThread=473797/20;
        int eventSize=8004;
        for(int k=0;k<20;++k){
                segment[k]=fopen("Datafile_367.bin","rb");
                if(segment[k]==NULL){
                        std::cout<<"file stream is NULL!"<<k<<"\n";

                }

                ret=fseek(segment[k],eventsPerThread*eventSize*4*k,SEEK_SET);
                std::cout<<ret<<":::"<<strerror(errno)<<"\n";


        }
                return 0;
}

The following is the output. fseek is successful sometimes and returns 0 while failing at other times with the error code 22(Invalid argument).

0:::Success
0:::Success
0:::Success
-1:::Invalid argument
-1:::Invalid argument
-1:::Invalid argument
0:::Invalid argument
0:::Invalid argument
0:::Invalid argument
-1:::Invalid argument
-1:::Invalid argument
-1:::Invalid argument
0:::Invalid argument
0:::Invalid argument
0:::Invalid argument
-1:::Invalid argument
-1:::Invalid argument
0:::Invalid argument
0:::Invalid argument
0:::Invalid argument

Any explanations for this behavior of the fseek() function?

(Note that the minimal reproducible example is a single tread. multithreading will happen once the program starts to read the events)


Solution

  • The error is the overflow in your offset calculation. You use int, which is apparently 4 bytes wide. INT_MAX is 2147483647 for this width.

    Let's see:

    k eventsPerThread * eventSize * 4 * k overflowed int return value of fseek()
    0 0 0 0
    1 758427024 758427024 0
    2 1516854048 1516854048 0
    3 2275281072 -2019686224 -1
    4 3033708096 -1261259200 -1
    5 3792135120 -502832176 -1
    6 4550562144 255594848 0
    7 5308989168 1014021872 0
    : : : :

    The resulting int becomes negative because of the overflow, and fseek() is not happy with that.

    First, make sure your longs are more than 4 bytes wide. Then change at least one operand of your multiplication to long. For example like this eventsPerThread * eventSize * 4L * k.

    Final note: Consider to use more spaces to make your code more readable.