c++file-iofstreammmap

mmap() vs. reading blocks


I'm working on a program that will be processing files that could potentially be 100GB or more in size. The files contain sets of variable length records. I've got a first implementation up and running and am now looking towards improving performance, particularly at doing I/O more efficiently since the input file gets scanned many times.

Is there a rule of thumb for using mmap() versus reading in blocks via C++'s fstream library? What I'd like to do is read large blocks from disk into a buffer, process complete records from the buffer, and then read more.

The mmap() code could potentially get very messy since mmap'd blocks need to lie on page sized boundaries (my understanding) and records could potentially lie across page boundaries. With fstreams, I can just seek to the start of a record and begin reading again, since we're not limited to reading blocks that lie on page sized boundaries.

How can I decide between these two options without actually writing up a complete implementation first? Any rules of thumb (e.g., mmap() is 2x faster) or simple tests?


Solution

  • I was trying to find the final word on mmap / read performance on Linux and I came across a nice post (link) on the Linux kernel mailing list. It's from 2000, so there have been many improvements to IO and virtual memory in the kernel since then, but it nicely explains the reason why mmap or read might be faster or slower.

    However,

    The discussion of mmap/read reminds me of two other performance discussions:

    Conclusion: Use memory maps if you access data randomly, keep it around for a long time, or if you know you can share it with other processes (MAP_SHARED isn't very interesting if there is no actual sharing). Read files normally if you access data sequentially or discard it after reading. And if either method makes your program less complex, do that. For many real world cases there's no sure way to show one is faster without testing your actual application and NOT a benchmark.

    (Sorry for necro'ing this question, but I was looking for an answer and this question kept coming up at the top of Google results.)