Random Access File Format for hierarchically organised binary and text files

A Windows desktop application, whose development team I'm part of uses the legacy MFC CArchive as its application file format to serialise both text files and binary files to and from disk. The app is used to localise strings contained in these text/binary files, and the CArchive encapsulates a translation "project," so it is generated as one monolithic file containing one or more of these sub-files.

This file format is showing its age in many ways and we are looking to change to something more modern. Our main concerns are that it is slow and takes up huge amounts of memory; it's not random access, so accessing an arbitrary file within the archive or even just generating a directory listing requires loading the entire thing into memory, so the space and time consumption of manipulating the archive is dependant on its size, and it's not feasible to do in-place updates of the archive either.

Finally, extending the format is painful, as it involves us littering our code with conditional statements that serialise certain fields (or not) to or from the archive depending on the value of the archive's version stamp.

I've spent some time looking around at the alternatives, and the ones that stand out are ZIP/7Z or SQLite, as ZIP already has most of the file management/indexing functionality built in, while SQLite would be ideal for the storage, retrieval and searching of strings, so I'm thinking some combination of the two technologies might be the way to go.

As far I can see, the trick would be organising or partitioning the SQLite DB in such a way that as it grows, it does not slow down, and searches can be restricted to individual files, either by creating one table per file or one DB per file, I'm not sure.

Has anyone else tried such a thing and if so, any advice?

Thanks

Solution

As a file-based database, SQLite can be used to implement an application file format.

If all you want to do is to store embedded files, you can put a bunch of blobs into a table (see sqlar for an example). But if you want to model the internal structure of those files, you can of course have more complex tables.

To restrict searches to files, you'd just have to store something to identify the file:

CREATE TABLE Strings (
    StringID  INTEGER PRIMARY KEY,
    FileID    REFERENCES FileTable(FileID),
    Value     TEXT,
    [...]
);

so that you can restrict your queries:

SELECT * FROM Strings WHERE Value = 'hello' AND FileID = 42;

If you do not want to search for entire strings but words inside them, consider using the full-text search extension.