c++winapireadfilewarc

Half of read buffer is corrupt when using ReadFile


Half of the buffer used with ReadFile is corrupt. Regardless of the size of the buffer, half of it has the same corrupted character. I have look for anything that could be causing the read to stop early, etc. If I increase the size of the buffer, I see more of the file so it is not failing on a particular part of the file.

Visual Studio 2019. Windows 10.

#define MAXBUFFERSIZE 1024
DWORD bufferSize = MAXBUFFERSIZE;
_int64 fileRemaining;

HANDLE hFile;
DWORD  dwBytesRead = 0;
//OVERLAPPED ol = { 0 };
LARGE_INTEGER dwPosition;

TCHAR* buffer;

hFile = CreateFile(
    inputFilePath,         // file to open
    GENERIC_READ,          // open for reading
    FILE_SHARE_READ,       // share for reading
    NULL,                  // default security
    OPEN_EXISTING,         // existing file only
    FILE_ATTRIBUTE_NORMAL, // normal file    | FILE_FLAG_OVERLAPPED
    NULL);                 // no attr. template

if (hFile == INVALID_HANDLE_VALUE)
{
    DisplayErrorBox((LPWSTR)L"CreateFile");
    return 0;
}

LARGE_INTEGER size;
GetFileSizeEx(hFile, &size);

_int64 fileSize = (__int64)size.QuadPart;
double gigabytes = fileSize * 9.3132e-10;
sendToReportWindow(L"file size: %lld bytes \(%.1f gigabytes\)\n", fileSize, gigabytes);

if(fileSize > MAXBUFFERSIZE)
{
    buffer = new TCHAR[MAXBUFFERSIZE];
}
else
{
    buffer = new TCHAR[fileSize];
}
fileRemaining = fileSize;

sendToReportWindow(L"file remaining: %lld bytes\n", fileRemaining);

while (fileRemaining)                                       // outer loop. while file remaining, read file chunk to buffer
{
    sendToReportWindow(L"fileRemaining:%d\n", fileRemaining);

    if (bufferSize > fileRemaining)                         // as fileremaining gets smaller as file is processed, it eventually is smaller than the buffer
        bufferSize = fileRemaining;

    if (FALSE == ReadFile(hFile, buffer, bufferSize, &dwBytesRead, NULL))
    {
        sendToReportWindow(L"file read failed\n");
        CloseHandle(hFile);
        return 0;
    }

    fileRemaining -= bufferSize;

 // bunch of commented out code (verified that it does not cause the corruption)
}
delete [] buffer;

Debugger html view (512 byte buffer) 512 byte buffer

Debugger html view (1024 byte buffer). This shows that file is probably not the source of the corruption. 1025 byte buffer

Misc notes: I have been told that memory mapping the file does not provide an advantage since I am sequentially processing the file. Another advantage to this method is that when I detect particular and reoccurring tags in the WARC file I can skip ahead ~500 bytes and resume processing. This improves speed.


Solution

  • The reason is that you use a buffer array of type TCHAR, and the size of TCHAR type is 2 bytes. So the bufferSize set when you call the ReadFile function is actually filled into the buffer array every 2 bytes.

    But the actual size of the buffer is sizeof(TCHAR) * fileSize, so half of the buffer array you see is "corrupted"