cmmapsystems-programming

Why is sequentially reading a large file row by row with mmap and madvise sequential slower than fgets?


Overview

I have a program bounded significantly by IO and am trying to speed it up. Using mmap seemed to be a good idea, but it actually degrades the performance relative to just using a series of fgets calls.

Some demo code

I've squeezed down demos to just the essentials, testing against an 800mb file with about 3.5million lines:

With fgets:

char buf[4096];
FILE * fp = fopen(argv[1], "r");

while(fgets(buf, 4096, fp) != 0) {
    // do stuff
}
fclose(fp);
return 0;

Runtime for 800mb file:

[juhani@xtest tests]$ time ./readfile /r/40/13479/14960 

real    0m25.614s
user    0m0.192s
sys 0m0.124s

The mmap version:

struct stat finfo;
int fh, len;
char * mem;
char * row, *end;
if(stat(argv[1], &finfo) == -1) return 0;
if((fh = open(argv[1], O_RDONLY)) == -1) return 0;

mem = (char*)mmap(NULL, finfo.st_size, PROT_READ, MAP_SHARED, fh, 0);
if(mem == (char*)-1) return 0;
madvise(mem, finfo.st_size, POSIX_MADV_SEQUENTIAL);
row = mem;
while((end = strchr(row, '\n')) != 0) {
    // do stuff
    row = end + 1;
}
munmap(mem, finfo.st_size);
close(fh);

Runtime varies quite a bit, but never faster than fgets:

[juhani@xtest tests]$ time ./readfile_map /r/40/13479/14960

real    0m28.891s
user    0m0.252s
sys 0m0.732s
[juhani@xtest tests]$ time ./readfile_map /r/40/13479/14960

real    0m42.605s
user    0m0.144s
sys 0m0.472s

Other notes

Questions


Solution

  • POSIX_MADV_SEQUENTIAL is only a hint to the system and may be completely ignored by a particular POSIX implementation.

    The difference between your two solutions is that mmap requires the file to be mapped into the virtual address space entierly, whereas fgets has the IO entirely done in kernel space and just copies the pages into a buffer that doesn't change.

    This also has more potential for overlap, since the IO is done by some kernel thread.

    You could perhaps increase the perceived performance of the mmap implementation by having one (or more) independent threads reading the first byte of each page. This (or these) thread then would have all the page faults and the time your application thread would come at a particular page it would already be loaded.