I am working on a number of websites with files dating back to 2000. These sites have grown organically over time resulting in large numbers of orphaned web pages, include files, images, CSS files, JavaScript files, etc... These orphaned files cause a number of problems including poor maintainability, possible security holes, poor customer experience, and driving OCD/GTD freaks like myself crazy.
These files number in the thousands so a completely manual solution is not feasible. Ultimately, the cleanup process will require a fairly large QA effort in order to ensure we have not inadvertently deleted needed files but I am hoping to develop a technological solution to help speed the manual effort. Additionally, I hope to put processes/utilities in place to help prevent this state of disorganization from happening in the future.
Environment Considerations:
Before I start I would like to get some feedback from others who have successfully navigated a similar process.
Specifically I am looking for:
I am not looking for:
Step 1: Establish a list of pages on your site which are definitely visible. One intelligent way to create this list is to parse your log files for pages people visit.
Step 2: Run a tool that recursively finds site topology, starting from a specially written page (that you will make on your site) which has a link to each page in step 1. One tool which can do this is Xenu's Link Sleuth. It's intended for finding dead links, but it will list live links as well. This can be run externally, so there are no security concerns with installing 'weird' software onto your server. You'll need to watch over this occasionally since your site may have infinite pages and the like if you have bugs or whatever.
Step 3: Run a tool that recursively maps your hard disk, starting from your site web directory. I can't think of any of these off the top of my head, but writing one should be trivial, and is safer since this will be run on your server.
Step 4: Take the results of steps 2 and 3 programmatically match #2 against #3. Anything in #3 not in #2 is potentially an orphan page.
Note: This technique works poorly with password-protected stuff, and also works poorly with sites relying heavily on dynamically generated links (dynamic content is fine if the links are consistent).