Please note: Even though I mention Azure Databricks here, I believe this is a Python/GNUPG problem at heart, and as such, can be answered by anybody with Python/GNUPG encryption experience.
I have the following Python code in my Azure Databricks notebook:
%python
from pyspark.sql import SparkSession
from pyspark.sql.functions import input_file_name, lit
from pyspark.sql.types import StringType
import os
import gnupg
from azure.storage.blob import BlobServiceClient, BlobPrefix
import hashlib
from pyspark.sql import Row
from pyspark.sql.functions import collect_list
# Initialize Spark session
spark = SparkSession.builder.appName("DecryptData").getOrCreate()
storage_account_name = "mycontainer"
storage_account_key = "<redacted>"
spark.conf.set(f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net", storage_account_key)
clientsDF = spark.read.table("myapp.internal.Clients")
row = clientsDF.first()
clientsLabel = row["Label"]
encryptedFilesSource = f"wasbs://{clientsLabel}@mycontainer.blob.core.windows.net/data/*"
decryptedDF = spark.sql(f"""
SELECT
REVERSE(SUBSTRING_INDEX(REVERSE(input_file_name()), '/', 1)) AS FileName,
REPLACE(value, '"', '[Q]') AS FileData,
'{clientsLabel}' as ClientLabel
FROM
read_files(
'{encryptedFilesSource}',
format => 'text',
wholeText => true
)
""")
decryptedDF.show()
decryptedDF = decryptedDF.select("FileData");
encryptedData = decryptedDF.first()['FileData']
def decrypt_pgp_data(encrypted_data, private_key_data, passphrase):
# Initialize GPG object
gpg = gnupg.GPG()
print("Loading private key...")
# Load private key
private_key = gpg.import_keys(private_key_data)
if private_key.count == 1:
keyid = private_key.fingerprints[0]
gpg.trust_keys(keyid, 'TRUST_ULTIMATE')
print("Private key loaded, attempting decryption...")
try:
decrypted_data = gpg.decrypt(encrypted_data, passphrase=passphrase, always_trust=True)
except Exception as e:
print("Error during decryption:", e)
return
print("Decryption finished and decrypted_data is of type: " + str(type(decrypted_data)))
if decrypted_data.ok:
print("Decryption successful!")
print("Decrypted Data:")
print(decrypted_data.data.decode())
else:
print("Decryption failed.")
print("Status:", decrypted_data.status)
print("Error:", decrypted_data.stderr)
print("Trust Level:", decrypted_data.trust_text)
print("Valid:", decrypted_data.valid)
private_key_data = '''-----BEGIN PGP PRIVATE KEY BLOCK-----
<redacted>
-----END PGP PRIVATE KEY BLOCK-----'''
passphrase = '<redacted>'
encrypted_data = b'encryptedData'
decrypt_pgp_data(encrypted_data, private_key_data, passphrase)
As you can see, I am reading PGP-encrypted files from an Azure Blob Storage account container into a Dataframe, and then sending the first row (I'll change this notebook to work on all rows later) through a decrypter function that uses GNUPG.
When this runs it gives me the following output in the driver logs:
+--------------------+--------------------+-------+
| FileName| FileData| ClientLabel |
+--------------------+--------------------+-------+
| fizz.pgp|���mIj�h�#{... | acme|
+--------------------+--------------------+-------+
Decrypting: <redacted>
Loading private key...
WARNING:gnupg:gpg returned a non-zero error code: 2
Private key loaded, attempting decryption...
Decryption finished and decrypted_data is of type: <class 'gnupg.Crypt'>
Decryption failed.
Status: no data was provided
Error: gpg: no valid OpenPGP data found.
[GNUPG:] NODATA 1
[GNUPG:] NODATA 2
[GNUPG:] FAILURE decrypt 4294967295
gpg: decrypt_message failed: Unknown system error
Trust Level: None
Valid: False
Can anyone spot why decryption is failing, or help me troubleshoot it to pin down the culprit? Setting a debugger is not an option since this is happening inside a notebook. I'm thinking:
Can anyone spot where I'm going awry?
The problem at hand is that Python does not have any modern modules/libraries that can perform PGP decryption without a dependency on the gpg
native binary installed and accessible from a shell.
I ended up writing a Scala notebook that uses PainlessGPG, although I had to create a custom "fat" (shaded) JAR for all of PainlessPGP's transitive dependencies, and this would not be feasible for any developer who isn't strong with Java.
TL;DR --> Python-based decryption from inside an ADB notebook is not advisable.