I am using Pydantic
to validate and type an incoming S3 Event
in an AWS Lambda
function.
The event looks like this (only including relevant bits):
{
"Records": [
{
"s3": {
"bucket": {
"name": "my-bucket"
},
"object": {
"key": "MYKEY%28CSV%29/XXXX.CSV"
}
}
}
]
}
I define my Model
like this to get the relevant information.
from pydantic import BaseModel
class ObjectInfo(BaseModel):
key: str
class BucketInfo(BaseModel):
name: str
class S3Schema(BaseModel):
bucket: BucketInfo
object: ObjectInfo
class Record(BaseModel):
s3: S3Schema
class DeletionEvent(BaseModel):
Records: list[Record]
def handler(event: dict, _):
eventTyped = DeletionEvent(**event)
return True
Now the problem is that the correct value for key
is supposed to be MYKEY(CSV)/XXXX.CSV
, not MYKEY%28CSV%29/XXXX.CSV
. I usually fix this issue using urllib.parse.unquote_plus
to decode the %XX
bits representing special characters. I think I can define a custom decoder but this seems like overkill.
Is there any way to get pydantic
to do this decoding for me? It has a bunch of classes for working with URLs but I don't see anything about decoding URL encoded strings by themselves.
I took my own advice and looked into building a custom decoder
. It still feels like Pydantic
should have a better way. Here is the solution I've found:
from urllib.parse import unquote
from typing_extensions import Annotated
from pydantic import (
BaseModel,
EncodedStr,
EncoderProtocol
)
# This is the class that will be used to "decode" my URL string
class MyEncoder(EncoderProtocol):
@classmethod
def decode(cls, data: bytes) -> bytes:
# We have to use unquote rather than unquote_plus because only unquote can work with bytes objects.
# This may be a limitation if your URL string contains encoded spaces.
return str.encode(unquote(data))
MyEncodedStr = Annotated[str, EncodedStr(encoder=MyEncoder)]
class ObjectInfo(BaseModel):
key: MyEncodedStr
class BucketInfo(BaseModel):
name: str
class S3Schema(BaseModel):
bucket: BucketInfo
object: ObjectInfo
class Record(BaseModel):
s3: S3Schema
class DeletionEvent(BaseModel):
Records: list[Record]
event = {
"Records": [
{
"s3": {
"bucket": {
"name": "my-bucket"
},
"object": {
"key": "MYKEY%28CSV%29/XXXX.CSV"
}
}
}
]
}
eventTyped = DeletionEvent(**event)
This properly converts the URL encoded string "MYKEY%28CSV%29/XXXX.CSV"
to the normal string "MYKEY(CSV)/XXXX.CSV"
.
My understanding:
Pydantic
first converts the str
to bytes
behind the scenes.MyEncoder.decode
is called on the bytes
object.urllib.parse.unquote
is used to decode the URL string and returns a str
.Pydantic
expects decode
to return a bytes
object.Pydantic
converts the bytes
object back to a str