I'm trying to design a privacy-first document management system, where the user's content should not even be readable by our team. The user should, however, be able to share the documents with a specific set of users without compromising this level of privacy. When searching for information, the user should be able to find the matching content that they themselves created or was shared to them.
Assume 1000-10000 users. Content could often be < 1MB (mostly plain text files) but might have attachments like PDFs that could be multiple MBs.
Here's my solution but, I feel I might be missing something or trying to reinvent a pattern that might already exist.
For each user, generate a secure key (let's call it K
), encrypt it with the user's password and save it in the database. We'll never store the user's password (even encrypted) in the DB and will require the user to enter it (or fetch it from a 3rd party IDP after user login).
Index each document on the fly (we don't search through attachments), then encrypt the content using K
and store in the DB. We do not encrypt the index.
When user searches and find his/her own content, we decrypt them and serve to the client.
When a user shares a document with others, we create an implicit group and generate a group key (say Kg
) and encrypt it with the owner's password and save it against the owner. Then for each recipient of the share, we generate a unique sharing password and encrypt Kg
using it and share against each of the users. The password is encrypted using a server key (common accross all users) and embedded in the sharing link send to the recipient.
Does this approach seem secure? Are the loop holes here? The sharing password can be decrypted by someone who already compromised the server key. But, that seems a low risk problem. Of course the plain-text index isn't ideal. One could argue that an attacker can infer the contents of a document by finding all index entries pointing to that.
Is there a better approach?
First and foremost: If you are serious about privacy first, then you must generate any encryption key on the client side, so you as a service provider never ever see the key, and thus you must perform all encryption and decryption on the client, too.
In general, I recommend the hybrid encryption approach, which is commonly used:
The client generates an asymmetric key pair (used as a KEK = Key Encryption Key). The public key is published on your server, while the private key never leaves the client.
To store content
Now the content can only be decrypted by the owner of the private KEK.
To share content with other users, the original client retrieves the other users' public KEK, encrypts the CEK with it and uploads the encrypted keys. Now the other users can download the encrypted CEK, decrypt it with their own KEK and thus access the CEK to decrypt the content.
Hybrid encryption has two big advantages over using asymmetric encryption alone:
If you really need to store the user's private KEKs on the server, then you must encrypt the key using the user's password on the client side before uploading it.
You may consider transient use of encryption keys (i.e. use them "on the fly" but never store the on the server) secure enough, but once they are on the server, anything can happen, and the user has lost control about it. This doesn't go well with privacy first.