I'm tweaking a garbage collector class whose job is to iterate through a disk cache of files in subdirs, find any that are "out of date" and remove them. Further, if it finds any empty subdirectories it should remove those too. The first half is working but I'm struggling to implement the directory removal portion.
The cache on disk is structured along these lines, with the subdirectory names being hashes in this example:
/path/to/cache
-> 7a68fba56
----> some_file.jpg
----> another_file.png
----> some_stuff.svg
-> b12a43293
----> some_picture.webp
----> selfie.avif
----> afile.png
----> profit.jpg
----> A-team-crew.jpg
-> 0fe2ba852
----> ...
Here's the code so far:
class myGarbageCollector
{
protected $timeout = 1209600; // about 2 weeks: 60x60x24x14
public function __construct(array $directories)
{
foreach ($directories as $directory) {
$this->deleteStaleFiles($directory);
}
}
private function deleteStaleFiles($path)
{
$now = time();
$dirs = new \RecursiveDirectoryIterator($path, \FilesystemIterator::SKIP_DOTS);
$iterator = new \RecursiveIteratorIterator($dirs, \RecursiveIteratorIterator::SELF_FIRST);
foreach ($iterator as $file) {
$filepath = $file->getRealPath();
// Delete file if:
// * It's an image
// * It's a dead link
// * It's last accessed more than $timeout seconds ago
if ($file->isFile()
&& strpos(mime_content_type($filepath), 'image/') === 0
&& (
($file->isLink() && !$file->isReadable()) || ($now - $file->getATime()) > $this->timeout)
) {
unlink($filepath);
}
// Remove empty directories too.
if ($file->isDir()) {
// Need to somehow count the files in this subdir and
// rmdir($filepath) if there are zero files inside
}
}
}
}
I think I'm coming unstuck with the recursive iterator. It traverses into the directory to check the files and deletes them, but I can't see a neat way of accessing the 'current' directory of the iterator to call getChildren() on it (or whatever), because $file is an instance of SPLFileInfo, which exposes limited information about the current item.
I appreciate that, on any given sweep, it may remove all files in a subdirectory and, by that time, it's "gone past" the directory so won't remove it. That's fine, because on the next sweep - providing more files haven't been added to that subdirectory in the meantime - it will remove the empty directory. I don't mind that (although if the subdirectory deletion can be done in the same sweep, that's even better).
Maybe I could instantiate a new DirectoryIterator() using the current path? But that negates the point of the Recursive Iterator, surely?
Am I better to treat this as two separate things and use a regular DirectoryIterator to traverse the subdirs immediately below cache and then, foreach one, create a new DirectoryIterator to check for files inside it?
I'm not married to Iterators so if something like scandir() or glob() is more efficient then that's cool. There could be thousands of files in the cache, so I'm looking for the most efficient solution.
You can specify CHILD_FIRST instead of SELF_FIRST, so that the folder will be visited after inner files, and you can check if there are any children in this folder after deleting files.
$iterator = new \RecursiveIteratorIterator($dirs, \RecursiveIteratorIterator::CHILD_FIRST);
foreach ($iterator as $file) {
// Remove empty directories too.
if ($file->isDir() && iterator_count($iterator->getChildren()) === 0) {
// Need to somehow count the files in this subdir and
// rmdir($filepath) if there are zero files inside
}
}