I've been working on a wrapper for the Presidio API and I can't seem to figure out how to encrypt identified text. Right now if I send the following post to a Docker instance, I am getting a different result each time. I believe the key is being read properly, if I reduce the number of bytes in it, I get an error.
POST http://localhost:5001/anonymize HTTP/1.1
content-type: application/json
{
"text": "My Name is Bruce Haack",
"language": "en",
"analyzer_results": [
{
"analysis_explanation": null,
"end": 22,
"entity_type": "PERSON",
"recognition_metadata": {
"recognizer_identifier": "SpacyRecognizer_140445899326704",
"recognizer_name": "SpacyRecognizer"
},
"score": 0.85,
"start": 11
}
],
"anonymizers": {
"PERSON": {
"type": "encrypt",
"key": "WmZq4t7w!z%C&F)J"
}
}
}
examples of the results
"text": "My Name is SmcaK2vH-6jd3nG5PRMM1DAJUw3Bjw8J7dMuEDiE4WM=",
"text": "My Name is gYq6LuoOFqHaSDo5TW-W4oyEPj1PSHsNvkO_MRfp5pc=",
"text": "My Name is zx1lxGQD2jziBkR9m0yYp2tPW31Wa_YNHHXcC7aoJ9c="
obviously if I try to deanonymize/decrypt I get different invalid results. If anyone knows what I'm doing wrong please let me know. Unless I am missing something MS does not have an API example of the anonymize endpoint using encryption.
Thanks in advance,
an awful hack
TL;DR: The ciphertexts posted are fine.
The Presidio API applies AES in CBC mode for encryption/decryption, see here.
With CBC (like any scheme with an IV), a new, random 16 bytes IV must be generated for each encryption for security reasons, so that a different ciphertext is generated for each encryption (even with identical plaintext and key). This is intentional and not an error.
As the IV is required for decryption, it must be passed to the decrypting side together with the ciphertext. For this purpose, the IV and ciphertext are concatenated in this order (i.e. the first 16 bytes are the IV). During decryption, the IV and ciphertext are first separated so that all the data required for decryption is available.
The Presidio API does all this (generation of a random IV and concatenation of IV and ciphertext during encryption, separation of IV and ciphertext during decryption) under the hood.
The validity of the ciphertexts posted in the question can easily be proven with CyberChef, e.g. using the example of the first ciphertext, by first separating the IV and ciphertext (see CyberChef), and then decrypting the ciphertext using the IV and key (see CyberChef).
Incidentally, the successful decryption with CyberChef also proves that PKCS#7 is used as padding. This can be seen more directly if the padding is disabled so that the PKCS#7 padding bytes are recognizable at the end (see CyberChef).
Or with the Presidio API (using the decryption code from the Presidio documentation, here):
from presidio_anonymizer import DeanonymizeEngine
from presidio_anonymizer.entities import OperatorResult, OperatorConfig
engine = DeanonymizeEngine()
deanonymized_result = engine.deanonymize(
text="My Name is SmcaK2vH-6jd3nG5PRMM1DAJUw3Bjw8J7dMuEDiE4WM=",
entities=[OperatorResult(start=11, end=55, entity_type="PERSON")],
operators={"DEFAULT": OperatorConfig("decrypt", {"key": "WmZq4t7w!z%C&F)J"})},
)
print(deanonymized_result.text) # My name is Bruce Haack