phpfilesystemsext3

How scalable is this file-based DB approach?


I have a simple PHP script that calculates some things about a given string input. It caches the results to a database and we occasionally delete entries that are older than a certain number of days.

Our programmers implemented this database as:

function cachedCalculateThing($input) {
  $cacheFile = 'cache/' . sha1($input) . '.dat';
  if (file_exists($cacheFile) {
    return json_decode(file_get_contents($cacheFile));
  }
  $retval = ...
  file_put_contents(json_encode($retval));
}
function cleanCache() {
  $stale = time() - 7*24*3600;
  foreach (new DirectoryIterator('cache/') as $fileInfo) {
    if ($fileInfo->isFile() && $fileInfo->getCTime() < $stale) {
      unlink($fileInfo->getRealPath());
    }
}

We use Ubuntu LAMP and ext3. At what number of entries does cache lookup become non-constant or violate a hard limit?


Solution

  • While that particular code is not very "scalable"* at all, there are a number of things that can improve it:

    1. sha1 takes a string. So a non-string $input variable will have to be serialized or json_encoded first, before calculating the hash. Just change the order to protect from unexpected inputs.
    2. Use crc32 instead of sha1, it's faster (Fastest hash for non-cryptographic uses?)
    3. the directory 'cache/' is relative to the current directory, so as the page working dir changes, so will the cache dir. There'll be an artificially high number of cache misses.
    4. Everytime you store a file in cachedCalculateThing(), store the name of the file in an index in /dev/shm/indexedofcaches (or something like that). Check the index before calling file_exists. ext3 is slow, and the caches, along with the kernel ext3 index will be paged out. So that means a directory scan will be hit every time you ask if the file_exists. For small caches it is fast enough, but big ones, you'll see a slowdown.
    5. writes will block, so a server load limit will be hit when the cache is empty, and collisions will occur on the cache filename when two or more php writers come along at the same time trying to write a previously non-existent cachefile. So you may want to try to catch those errors and/or do lock file testing.
    6. We're considering this code in a somewhat virgin environment. The truth is that the writes will also block an indeterminate amount of time based upon the current disk utilization. If your disk is a spinning one, or older ssd, you may see very very slow writes. check iostat -x 4 and look for your current disk utilization. If it is higher than say 25% already, putting the disk caching on will spike it to 100% at random times and slow all web service down. (Because requests to the disk will have to be queued and serviced (generally) in order (not always, but don't bank on it)).
    7. Depending upon the size of the cachefiles, maybe directly store them in /dev/shm/my_cache_files/ . If they all fit well into memory, then you gain keeping the disk entirely out of the service chain. Then you have to put a cron job to check the overall cache size and make sure it doesn't eat all your memory. Disadvantage = non-persistent. You can do backup scheduling on it too though.
    8. Do not call cleanCache() in the runtime/service code. That directory iteration scan is going to be super slow and block.

    '*For scalability, it is usually defined in terms of linear request speed or parallel server resources. That code:

    1. (-) Depending upon when/where that cleanCache() function is run -- it effectively blocks on the directory indexing, until all items in the cache dir are scanned. So it should go into a cron job. If in a cron/shell job, there are much faster ways to delete expired caches. For instance: find ./cache -type f -mtime +7 -exec rm -f "{}" \;
    2. (-) You are right to mention ext3 -- ext3's indexing and result speed for small files and very big directory contents is relatively poor. google noatime for the index, and if you can move the cache directory to a separate volume, you can turn off the journal, avoiding double-writes, or use a separate filesystem type. Or see if you have dir_index available as a mount option. Here is a benchmark link: http://fsi-viewer.blogspot.com/2011/10/filesystem-benchmarks-part-i.html
    3. (+) directory cache entries are much easier to distribute to other servers with rsync than say database replication.
    4. (+/-) Really it depends on how many different cache items you will be storing and how frequently accessed. For small numbers of files, say 10-100, less than 100K, frequent hits, then the kernel will keep the cache-files paged in memory and you'll see no serious slowdown at all (if properly implemented).

    The main takeaway point is that to achieve real scalability and good performance out of the caching system, a little more consideration has to be taken than the short block of code shows. There may be more limits than the ones I've enumerated, but even those are subject to the variables such as size, number of entries, number of requests/sec, current disk load, file system type, etc -- things that are external to the code. Which is to be expected, because a cache persists outside of the code. The code listed can perform for a small boutique set of caching with low numbers of requests, but may not for the bigger sizes that one comes to need caching for.

    Also, are you running Apache in thread or prefork mode? It is going to affect how php blocks its reads and writes.

    -- Um, I probably should have added that you want to track your object and key/hash.. If the $input is already a string, it is in it's base form/has already been computed, retrieved, serialized, etc. If $input is the key, then file_put_contents() needs to put something else (the actual variable/contents). If $input is the object to look up (which could be like a long string, or even a short one), then it needs a lookup key, otherwise no computation is being bypassed/saved.