filesystemsfat

Why did Windows use the FAT structure instead of a conventional linked list with a next pointer for each data block of a file?


Instead of storing references to next nodes in a table, why couldn't it be just stored like a conventional linked list, that is, with a next pointer?


Solution

  • This is due to alignment. FAT (and just about any other file system) stores file data in one or more whole sectors of the underlying storage. Because the underlying storage can only read and write whole sectors such allocation allows efficient access to the contents of a file.

    Issues with interleaving

    When a program wants to store something in a file it provides a buffer, say 1MB of data to store. Now if the file's data sectors have to also keep next pointers to their next sector, this pointer information will need to be interleaved with the actual user data. So the file system would need to build another buffer (of slightly more than the provided 1MB), for each output sector copy some of the user data and the corresponding next pointer and give this new buffer to the storage. This would be somewhat inefficient. Unless the file system always stores file data to new sectors (and most usually don't), rewriting these next pointers will also be redundant.

    The bigger problem would be when read operation is attempted on the file. Files will now work like tape devices: with only the location of the first sector known in the file's primary metadata, in order to reach sector 1000, the file system will need to read all sectors before it in order: read sector 0, find the address of sector 1 from the loaded next pointer, read sector 1, etc. With typical seek times of around 10 ms per random I/O (assuming a hard disk drive), reaching sector 1000 will take about 10 seconds. Even if sectors are sequentially ordered, while the file system driver processes sector N's data, the disk head will be flying over the next sector and when the read for sector N+1 is issued it may be too late, requiring the disk to rotate entire revolution (8.3ms for 7200 RPM drive) before being able to read the next sector again. On-disk cache can and will help with that though.

    Writing single sector is usually atomic operation (depends on hardware): reading back the sector after power failure returns either its old content or the new one without intermediate states. Database applications usually need to know which writes would be atomic. If the file system interleaves file data and metadata in the same sectors, it will need to report smaller than the actual sector size to the application. For example instead of say 512 bytes it may need to report 504. But it can't do it because sector size is usually assumed by applications to be power of 2. Furthermore file stored on such filesystem would very likely be unusable if copied to another file system with different reported sector size.

    Better approaches

    The FAT format is better because all next pointers are stored in adjacent sectors. For FAT12, FAT16 and not very large FAT32 volumes the entire table is small enough to fit in memory. FAT still records the blocks of a file in a linked list, so to have efficient random access, an implementation needs to cache the chain per file. On large enough volumes (that can sport large enough file) such cache may no longer fit in memory.

    ext3 uses direct and indirect blocks. This simple format avoids the need for preprocessing that FAT requires and goes by with only minimal amount of additional reads per I/O when indirect blocks are needed. These additional reads are cached by the operating system so that their overhead is often negligible.

    Other variants are also possible and used by various file systems.

    Random notes

    For the sake of completeness, some hard disk drives can be formatted with slightly larger sector sizes (say 520 bytes) so that the file system can pack 512 bytes of file data with several bytes of metadata in the same sector. Yet because of the above, I don't believe anyone has used such formats for storing the address of the file's next sector. These additional bytes can be put to better use: additional checksums and timestamping come to mind. The timestamping I believe is used to improve the performance of some RAID systems. Still such usage is rare, and most software can't work with them at all.

    Some file systems can save the content of small enough files in the file metadata directly without occupying distinct sectors. ReiserFS has the controversial tail packing. This is not important here: large files still benefit from having proper mapping to storage sectors.