filesystemsdvd-burningiso9660

How can I predict the size of an ISO 9660 filesystem?


I'm archiving data to DVD, and I want to pack the DVDs full. I know the names and sizes of all the files I want on the DVD, but I don't know how much space is taken up by metadata. I want to get as many files as possible onto each DVD, so I'm using a Bubblesearch heuristic with greedy bin-packing. I try 10,000 alternatives and get the best one. Currently I know the sizes of all the files and because I don't know how files are stored in an ISO 9660 filesystem, I add a lot of slop for metadata. I'd like to cut down the slop.

I could use genisoimage -print-size except it is too slow---given 40,000 files occupying 500MB, it takes about 3 seconds. Taking 8 hours per DVD is not in the cards. I've modified the genisoimage source before and am really not keen to try to squeeze the algorithm out of the source code; I am hoping someone knows a better way to get an estimate or can point me to a helpful specification.


Clarifying the problem and the question:


Note: these files are on a machine with 3TB of hard-drive storage. In all cases the average size of the files is at least 10MB; sometimes it is significantly larger. So it is possible that genisomage will be fast enough after all, but I doubt it---it appears to work by writing the ISO image to /dev/null, and I can't imagine that will be fast enough when the image size approaches 4.7GB. I don't have access to that machine right now, or when I posted the original question. When I do have access in the evening I will try to get better numbers for the question. But I don't think genisomage is going to be a good solution---although it might be a good way to learn a model of the filesystem that tells me how how it works. Knowing that block size is 2KB is already helpful.

It may also be useful to know that files in the same directory are burned to the samae DVD, which simplifies the search. I want access to the files directly, which rules out tar-before-burning. (Most files are audio or video, which means there's no point in trying to hit them with gzip.)


Solution

  • Thanks for the detailed update. I'm satisfied that your current bin-packing strategy is pretty efficient.

    As to the question, "Exactly how much overhead does an ISO 9660 filesystem pack on for n files totalling b bytes?" there are only 2 possible answers:

    1. Someone has already written an efficient tool for measuring exactly this. A quick Google search turned up nothing however which is discouraging. It's possible someone on SO will respond with a link to their homebuilt tool, but if you get no more responses for a few days then that's probably out too.
    2. You need to read the readily available ISO 9660 specs and build such a tool yourself.

    Actually, there is a third answer:

    (3) You don't really care about using every last byte on each DVD. In that case, grab a small representative handful of files of different sizes (say 5), pad them till they are multiples of 2048 bytes, and put all 2^5 possible subsets through genisoimage -print-size. Then fit the equation nx + y = iso_size - total_input_size on that dataset, where n = number of files in a given run, to find x, which is the number of bytes of overhead per file, and y, which is the constant amount of overhead (the size of an ISO 9660 filesystem containing no files). Round x and y up and use that formula to estimate your ISO filesystem sizes for a given set of files. For safety, make sure you use the longest filenames that appear anywhere in your collection for the test filenames, and put each one under a separate directory hierarchy that is as deep as the deepest hierarchy in your collection.