c++performanceioifstream

Fast textfile reading in C++


I am currently writing a program in C++ which includes reading lots of large text files. Each has around 400.000 lines with in extreme cases 4000 or more characters per line. Just for testing, I read one of the files using ifstream and an implementation I found on cplusplus.com. On my machine, it took around 60 seconds, which is quite long considering my workflow. Is there a straightforward way to improve reading speed?

The code I am using is more or less this:

string tmpString;
ifstream txtFile(path);
if(txtFile.is_open())
{
    while(txtFile.good())
    {
        m_numLines++;
        getline(txtFile, tmpString);
    }
    txtFile.close();
}

Notes:


After lots of back and forth discussion, a big thank you to sehe for putting so much time into this, I appreciate it a lot!


Solution

  • Updates: Be sure to check the (surprising) updates below the initial answer


    Memory mapped files have served me well1:

    #include <boost/iostreams/device/mapped_file.hpp> // for mmap
    #include <algorithm>  // for std::find
    #include <iostream>   // for std::cout
    #include <cstring>
    
    int main()
    {
        boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly);
        auto f = mmap.const_data();
        auto l = f + mmap.size();
    
        uintmax_t m_numLines = 0;
        while (f && f!=l)
            if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
                m_numLines++, f++;
    
        std::cout << "m_numLines = " << m_numLines << "\n";
    }
    

    This should be rather quick.

    Update

    In case it helps you test this approach, here's a version using mmap directly instead of using Boost: see it live on Coliru

    #include <algorithm>
    #include <iostream>
    #include <cstring>
    
    // for mmap:
    #include <sys/mman.h>
    #include <sys/stat.h>
    #include <fcntl.h>
    
    const char* map_file(const char* fname, size_t& length);
    
    int main()
    {
        size_t length;
        auto f = map_file("test.cpp", length);
        auto l = f + length;
    
        uintmax_t m_numLines = 0;
        while (f && f!=l)
            if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
                m_numLines++, f++;
    
        std::cout << "m_numLines = " << m_numLines << "\n";
    }
    
    void handle_error(const char* msg) {
        perror(msg); 
        exit(255);
    }
    
    const char* map_file(const char* fname, size_t& length)
    {
        int fd = open(fname, O_RDONLY);
        if (fd == -1)
            handle_error("open");
    
        // obtain file size
        struct stat sb;
        if (fstat(fd, &sb) == -1)
            handle_error("fstat");
    
        length = sb.st_size;
    
        const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u));
        if (addr == MAP_FAILED)
            handle_error("mmap");
    
        // TODO close fd at some point in time, call munmap(...)
        return addr;
    }
    

    Update

    The last bit of performance I could squeeze out of this I found by looking at the source of GNU coreutils wc. To my surprise using the following (greatly simplified) code adapted from wc runs in about 84% of the time taken with the memory mapped file above:

    static uintmax_t wc(char const *fname)
    {
        static const auto BUFFER_SIZE = 16*1024;
        int fd = open(fname, O_RDONLY);
        if(fd == -1)
            handle_error("open");
    
        /* Advise the kernel of our access pattern.  */
        posix_fadvise(fd, 0, 0, 1);  // FDADVICE_SEQUENTIAL
    
        char buf[BUFFER_SIZE + 1];
        uintmax_t lines = 0;
    
        while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
        {
            if(bytes_read == (size_t)-1)
                handle_error("read failed");
            if (!bytes_read)
                break;
    
            for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
                ++lines;
        }
    
        return lines;
    }
    

    1 see e.g. the benchmark here: How to parse space-separated floats in C++ quickly?