cmultithreadingchunksfile-read

How to read a file with json-like objects in C with multithreading


I am trying to read a file with for multithread reading a file in C, but as I divide in chunks based on file size some may start/end in the middle of lines. I was trying to adjust the chunk size in case that happened. The lines don't have the same size. I am trying to read a very large file of about 100 MB.

First of all, is this a good approach or should multithread only be used for other tasks as, for example, processing the line?

My approach was to shift the start and the end of a chunk until it finds the end of a line. This way, if a chunk happens to be inside a line (for when n. of threads > n. of lines) the start and end will match and I will not start that thread. Most of the time it reads just fine, but sometimes, one line is not read, or a line is read twice by the same thread or a small portion of a line is read by another thread.

Is there any issue when different threads read the same file simultaneously?

long chunk_size = file_size / num_threads;

pthread_t threads[num_threads];
ThreadData thread_data[num_threads];

long last_end = 0;

for (uint32_t i = 0; i < num_threads; ++i)
{
    thread_data[i].stats = stats;
    thread_data[i].thread_tweets = NULL;
    thread_data[i].failed = 0;

    thread_data[i].file = file;
    thread_data[i].start = i * chunk_size;
    thread_data[i].end = (i == num_threads - 1) ? file_size : (i + 1) * chunk_size;
    
    if (i > 0)
    {
        if (thread_data[i].end < thread_data[i - 1].start)
        {
            thread_data[i].failed = 1;
            continue;
        }
    }
    int ch;
    // Adjust start position to the beginning of the next line
    if (!is_start_at_line_boundary(file, thread_data[i].start))
    {
        fseek(file, thread_data[i].start, SEEK_SET);
        while ((ch = fgetc(file)) != '\n' && ch != EOF);
        thread_data[i].start = ftell(file);
    }

    // Adjust end position to the end of the line
    fseek(file, thread_data[i].end, SEEK_SET);
    while ((ch = fgetc(file)) != '\n' && ch != EOF);
    thread_data[i].end = ftell(file);
    if (ch != '\n' && ch != EOF)
    {
        thread_data[i].end++;
    }
    // If they coincide, the chunk was inside a line and the thread shoudnt run
    if (thread_data[i].end == thread_data[i].start)
    {
        thread_data[i].failed = 1;
        continue;
    }
    if (i > 0)
    { 
        thread_data[i].start = last_end;
    }
    if (pthread_create(&threads[i], NULL, read_file_chunk, &thread_data[i]))
    {
        fprintf(stderr, "Error creating thread\n");
        exit(EXIT_FAILURE);
    }
    last_end = thread_data[i].end;
}
int is_start_at_line_boundary(FILE *file, long start)
{
    if (start == 0)
    {
        return 1; // Start of the file
    }
    fseek(file, start - 1, SEEK_SET);
    if (fgetc(file) == '\n')
    {
        return 1; // Start is at the beginning of a line
    }
    return 0;
}

The function read_file_chunk will use fseek() to go to the start of the chunk and read with fgets() the whole chunk where it calls a parsing function for each line, as each line contains an individual json like format, which instead of , has ;, e.g.:

{"created_at": "2020-01-14 12:00:00"; "hashtags": ["A", "B"]; "id": 546542; "uid": 1500}

Should I use a json library and not assume each line is a json object? Will this be more efficient and sensible?


Solution

  • I am trying to read a very large file of about 100 MB. First of all, is this a good approach or should multithread only be used for other tasks as, for example, processing the line?

    Very roughly, a typical hard drive read speed is 150 MB/s. For an SSD, it might be 400 MB/s. Many factors affect actual throughput, but I think it's safe to assume that a single threaded sequential read of a 100 MB file shouldn't take more than 1 second. It could be substantially faster if the file is "hot" (cached by the filesystem or OS).

    Both HDDs and SSDs are sequential devices, so even if you have multiple threads trying to read different chunks of a file at the same time, each thread will likely have to wait for other threads' reads.

    In my experimentation, if you need to read an entire file into memory, the simplest way is also the fastest: Issue one read for the entire file.

    (Mapping a file into memory can be a good idea if you expect that you might not need to bring the entire file into memory and/or you need read the file out of order. Neither seems to apply to your case.)

    Beware of premature optimization. Start with a working single-threaded implementation.

    If it's not fast enough, then you have to measure to figure out where the bottleneck is. In your case, it might be the time it takes to read the file into memory or it might be the JSON parsing.

    If reading the file is the bottleneck, I'd look for solutions other than multithreading.

    If parsing the JSON is the bottleneck, multithreading might be one possible solution to explore. In that case, I'd initially leave the single-threaded i/o as is and apply multithreading to the parsing of the data.

    (I'm not denying that there are some circumstances where you might want to break the read up into portions, but I wouldn't start there.)