pythonurllibpydanticurldecode

Decode URL strings with Pydantic


I am using Pydantic to validate and type an incoming S3 Event in an AWS Lambda function.

The event looks like this (only including relevant bits):

{
  "Records": [
    {
      "s3": {
        "bucket": {
          "name": "my-bucket"
        },
        "object": {
          "key": "MYKEY%28CSV%29/XXXX.CSV"
        }
      }
    }
  ]
}

I define my Model like this to get the relevant information.

from pydantic import BaseModel

class ObjectInfo(BaseModel):
    key: str


class BucketInfo(BaseModel):
    name: str


class S3Schema(BaseModel):
    bucket: BucketInfo
    object: ObjectInfo


class Record(BaseModel):
    s3: S3Schema


class DeletionEvent(BaseModel):
    Records: list[Record]

def handler(event: dict, _):
    eventTyped = DeletionEvent(**event)
    return True

Now the problem is that the correct value for key is supposed to be MYKEY(CSV)/XXXX.CSV, not MYKEY%28CSV%29/XXXX.CSV. I usually fix this issue using urllib.parse.unquote_plus to decode the %XX bits representing special characters. I think I can define a custom decoder but this seems like overkill.

Is there any way to get pydantic to do this decoding for me? It has a bunch of classes for working with URLs but I don't see anything about decoding URL encoded strings by themselves.


Solution

  • I took my own advice and looked into building a custom decoder. It still feels like Pydantic should have a better way. Here is the solution I've found:

    from urllib.parse import unquote
    from typing_extensions import Annotated
    
    from pydantic import (
        BaseModel,
        EncodedStr,
        EncoderProtocol
    )
    
    # This is the class that will be used to "decode" my URL string
    class MyEncoder(EncoderProtocol):
        @classmethod
        def decode(cls, data: bytes) -> bytes:
    # We have to use unquote rather than unquote_plus because only unquote can work with bytes objects. 
    # This may be a limitation if your URL string contains encoded spaces.
            return str.encode(unquote(data))
    
    MyEncodedStr = Annotated[str, EncodedStr(encoder=MyEncoder)]
    
    class ObjectInfo(BaseModel):
        key: MyEncodedStr
    
    
    class BucketInfo(BaseModel):
        name: str
    
    
    class S3Schema(BaseModel):
        bucket: BucketInfo
        object: ObjectInfo
    
    
    class Record(BaseModel):
        s3: S3Schema
    
    
    class DeletionEvent(BaseModel):
        Records: list[Record]
    
    event = {
      "Records": [
        {
          "s3": {
            "bucket": {
              "name": "my-bucket"
            },
            "object": {
              "key": "MYKEY%28CSV%29/XXXX.CSV"
            }
          }
        }
      ]
    }
    
    eventTyped = DeletionEvent(**event)
    

    This properly converts the URL encoded string "MYKEY%28CSV%29/XXXX.CSV" to the normal string "MYKEY(CSV)/XXXX.CSV".

    My understanding:

    1. Pydantic first converts the str to bytes behind the scenes.
    2. MyEncoder.decode is called on the bytes object.
    3. urllib.parse.unquote is used to decode the URL string and returns a str.
    4. Pydantic expects decode to return a bytes object.
    5. Pydantic converts the bytes object back to a str