pythonencryptionazure-databricksgnupgpgp

Python GNUPG Unknown system error when loading private key


Please note: Even though I mention Azure Databricks here, I believe this is a Python/GNUPG problem at heart, and as such, can be answered by anybody with Python/GNUPG encryption experience.


I have the following Python code in my Azure Databricks notebook:

%python

from pyspark.sql import SparkSession
from pyspark.sql.functions import input_file_name, lit
from pyspark.sql.types import StringType
import os
import gnupg
from azure.storage.blob import BlobServiceClient, BlobPrefix
import hashlib
from pyspark.sql import Row
from pyspark.sql.functions import collect_list

# Initialize Spark session
spark = SparkSession.builder.appName("DecryptData").getOrCreate()

storage_account_name = "mycontainer"
storage_account_key = "<redacted>"
spark.conf.set(f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net", storage_account_key)

clientsDF = spark.read.table("myapp.internal.Clients")
row = clientsDF.first()
clientsLabel = row["Label"]
encryptedFilesSource = f"wasbs://{clientsLabel}@mycontainer.blob.core.windows.net/data/*"

decryptedDF = spark.sql(f"""
SELECT
  REVERSE(SUBSTRING_INDEX(REVERSE(input_file_name()), '/', 1)) AS FileName,
  REPLACE(value, '"', '[Q]') AS FileData,
  '{clientsLabel}' as ClientLabel
FROM
  read_files(
    '{encryptedFilesSource}',
    format => 'text',
    wholeText => true
  )
""")

decryptedDF.show()
decryptedDF = decryptedDF.select("FileData");
encryptedData = decryptedDF.first()['FileData']

def decrypt_pgp_data(encrypted_data, private_key_data, passphrase):
    # Initialize GPG object
    gpg = gnupg.GPG()

    print("Loading private key...")

    # Load private key
    private_key = gpg.import_keys(private_key_data)
    if private_key.count == 1:
        keyid = private_key.fingerprints[0]
        gpg.trust_keys(keyid, 'TRUST_ULTIMATE')    
    print("Private key loaded, attempting decryption...")

    try:
        decrypted_data = gpg.decrypt(encrypted_data, passphrase=passphrase, always_trust=True)
    except Exception as e:
        print("Error during decryption:", e)
        return
    
    print("Decryption finished and decrypted_data is of type: " + str(type(decrypted_data)))

    if decrypted_data.ok:
        print("Decryption successful!")
        print("Decrypted Data:")
        print(decrypted_data.data.decode())
    else:
        print("Decryption failed.")
        print("Status:", decrypted_data.status)
        print("Error:", decrypted_data.stderr)
        print("Trust Level:", decrypted_data.trust_text)
        print("Valid:", decrypted_data.valid)


private_key_data = '''-----BEGIN PGP PRIVATE KEY BLOCK-----

<redacted>

-----END PGP PRIVATE KEY BLOCK-----'''

passphrase = '<redacted>'

encrypted_data = b'encryptedData'

decrypt_pgp_data(encrypted_data, private_key_data, passphrase)

As you can see, I am reading PGP-encrypted files from an Azure Blob Storage account container into a Dataframe, and then sending the first row (I'll change this notebook to work on all rows later) through a decrypter function that uses GNUPG.

When this runs it gives me the following output in the driver logs:

+--------------------+--------------------+-------+
|      FileName|            FileData| ClientLabel |
+--------------------+--------------------+-------+
|      fizz.pgp|���mIj�h�#{... |         acme|
+--------------------+--------------------+-------+

Decrypting: <redacted>
Loading private key...
WARNING:gnupg:gpg returned a non-zero error code: 2
Private key loaded, attempting decryption...
Decryption finished and decrypted_data is of type: <class 'gnupg.Crypt'>
Decryption failed.
Status: no data was provided
Error: gpg: no valid OpenPGP data found.
[GNUPG:] NODATA 1
[GNUPG:] NODATA 2
[GNUPG:] FAILURE decrypt 4294967295
gpg: decrypt_message failed: Unknown system error

Trust Level: None
Valid: False

Can anyone spot why decryption is failing, or help me troubleshoot it to pin down the culprit? Setting a debugger is not an option since this is happening inside a notebook. I'm thinking:

  1. Perhaps I'm using the GNUPG API completely wrong
  2. Perhaps there's something malformed or improperly formatted with the private key I'm reading in from an in-memory string variable
  3. Perhaps the encrypted data is malformed (I've seen some internet rumblings of endianness causing this type of error)
  4. Maybe GNUPG isn't trusting my private key for some reason

Can anyone spot where I'm going awry?


Solution

  • The problem at hand is that Python does not have any modern modules/libraries that can perform PGP decryption without a dependency on the gpg native binary installed and accessible from a shell.

    I ended up writing a Scala notebook that uses PainlessGPG, although I had to create a custom "fat" (shaded) JAR for all of PainlessPGP's transitive dependencies, and this would not be feasible for any developer who isn't strong with Java.

    TL;DR --> Python-based decryption from inside an ADB notebook is not advisable.