Half of the buffer used with ReadFile is corrupt. Regardless of the size of the buffer, half of it has the same corrupted character. I have look for anything that could be causing the read to stop early, etc. If I increase the size of the buffer, I see more of the file so it is not failing on a particular part of the file.
Visual Studio 2019. Windows 10.
#define MAXBUFFERSIZE 1024
DWORD bufferSize = MAXBUFFERSIZE;
_int64 fileRemaining;
HANDLE hFile;
DWORD dwBytesRead = 0;
//OVERLAPPED ol = { 0 };
LARGE_INTEGER dwPosition;
TCHAR* buffer;
hFile = CreateFile(
inputFilePath, // file to open
GENERIC_READ, // open for reading
FILE_SHARE_READ, // share for reading
NULL, // default security
OPEN_EXISTING, // existing file only
FILE_ATTRIBUTE_NORMAL, // normal file | FILE_FLAG_OVERLAPPED
NULL); // no attr. template
if (hFile == INVALID_HANDLE_VALUE)
{
DisplayErrorBox((LPWSTR)L"CreateFile");
return 0;
}
LARGE_INTEGER size;
GetFileSizeEx(hFile, &size);
_int64 fileSize = (__int64)size.QuadPart;
double gigabytes = fileSize * 9.3132e-10;
sendToReportWindow(L"file size: %lld bytes \(%.1f gigabytes\)\n", fileSize, gigabytes);
if(fileSize > MAXBUFFERSIZE)
{
buffer = new TCHAR[MAXBUFFERSIZE];
}
else
{
buffer = new TCHAR[fileSize];
}
fileRemaining = fileSize;
sendToReportWindow(L"file remaining: %lld bytes\n", fileRemaining);
while (fileRemaining) // outer loop. while file remaining, read file chunk to buffer
{
sendToReportWindow(L"fileRemaining:%d\n", fileRemaining);
if (bufferSize > fileRemaining) // as fileremaining gets smaller as file is processed, it eventually is smaller than the buffer
bufferSize = fileRemaining;
if (FALSE == ReadFile(hFile, buffer, bufferSize, &dwBytesRead, NULL))
{
sendToReportWindow(L"file read failed\n");
CloseHandle(hFile);
return 0;
}
fileRemaining -= bufferSize;
// bunch of commented out code (verified that it does not cause the corruption)
}
delete [] buffer;
Debugger html view (512 byte buffer)
Debugger html view (1024 byte buffer). This shows that file is probably not the source of the corruption.
Misc notes: I have been told that memory mapping the file does not provide an advantage since I am sequentially processing the file. Another advantage to this method is that when I detect particular and reoccurring tags in the WARC file I can skip ahead ~500 bytes and resume processing. This improves speed.
The reason is that you use a buffer array of type TCHAR
, and the size of TCHAR
type is 2 bytes. So the bufferSize set when you call the ReadFile function is actually filled into the buffer array every 2 bytes.
But the actual size of the buffer is sizeof(TCHAR) * fileSize
, so half of the buffer array you see is "corrupted"