I am writing a console program that has to process a bunch of configuration files every time it is run. Since in some scenarios it is measured to be a significant portion of the entire execution time (more than half for big configs and small tasks), I decided to cache the preprocessed text configs in a binary format that is much faster to load.
To guard against stale cache after someone updates the configs, I am comparing the std::filesystem::last_write_time
of the binary cache and original file. This logic can be summarized in the following minimal program:
#include <algorithm>
#include <cstdio>
#include <filesystem>
#include <fstream>
#include <iterator>
#include <string>
#include <unordered_map>
int main(int argc, const char** argv) {
if (argc < 2) {
std::printf("Usage: %s <NAME>\n", argc ? argv[0] : "");
return 1;
}
const auto path = std::filesystem::path{ argv[1] };
auto file = std::ifstream{ path };
if (!file.good()) {
std::printf("File does not exist: '%s'\n", path.c_str());
return 1;
}
// Don't worry about collisions with paths containing literal `#`,
// real program handles it with proper escapes
const auto cache_path = [](auto s) {
std::replace(s.begin(), s.end(), '/', '#');
return std::filesystem::temp_directory_path() / s;
}(path.string());
auto cache_file = std::ifstream{ cache_path };
if (cache_file.good()
&& std::filesystem::last_write_time(cache_path)
> std::filesystem::last_write_time(path)) {
const auto cache_content
= std::string{ std::istream_iterator<char>{ cache_file },
std::istream_iterator<char>{} };
std::printf("Cache hit: %s\n", cache_content.c_str());
} else {
std::printf("Cache miss\n");
}
// Real program does not reprocess on cache hit,
// here we do it to show the bug
auto chars = std::unordered_map<char, size_t>{};
std::for_each(
std::istream_iterator<char>{ file },
std::istream_iterator<char>{},
[&chars](char c) { chars[c]++; }
);
const auto result = chars.size();
std::printf("Processed: %zu\n", result);
std::ofstream{ cache_path } << result;
}
The above test program works in the base case of writing directly to the file, but fails if another file is moved over it instead:
$ echo aaa > test.txt
$ ./alphabet test.txt
Cache miss
Processed: 1
$ ./alphabet test.txt
Cache hit: 1
Processed: 1
$ echo abc > test.txt
$ echo abcdef > test2.txt
$ ./alphabet test.txt
Cache miss
Processed: 3
$ ./alphabet test.txt
Cache hit: 3
Processed: 3
$ mv test2.txt test.txt
$ ./alphabet test.txt
Cache hit: 3
Processed: 6
How can I properly protect against it? I can think of couple options, neither of which I particularly like:
$ echo abc > a/test.txt
$ echo abcdef > b/test.txt
$ ./alphabet a/test.txt
$ rm -rf a/
$ mv b/ a/
$ ./alphabet a/test.txt
inode
to the one stored in cache. While quick and robust, there is no portable way to get it, and may not be supported by all filesystems.#ifdef
'ing OS or external libraries, plus requires some daemon or other server running between program invocations, also unsure it won't fail in some scenarios.--invalidate-cache
. Half-joking.I decided to go with combination of 1. and 5., plus file size check suggested by @Osyotr. User can select between "quick" and "strict" mode, plus command to invalidate cache.
original.filesize + '-' + encode(original.fullname)
, where encode
escapes any non-alphanumeric characters to ensure valid filename. Storing this info in filename allows quick mode to determine cache validity without reading either file.With the example program, here are some test cases and how they are handled in the two modes:
echo a > test.txt
- new file, no cache, both workecho b > test.txt
- modifies last write time, both workecho abc > test2.txt && mv test2.txt test.txt
- last write time unmodified but changes size, both workecho xyz > test2.txt && mv test2.txt test.txt
- last write time and size unmodified, quick mode fails but strict mode works