pythonyamlpydanticpyyaml

How to export a Pydantic model instance as YAML with URL type as string


I have a Pydantic model with a field of type AnyUrl. When exporting the model to YAML, the AnyUrl is serialized as individual field slots, instead of a single string URL (perhaps due to how the AnyUrl.__repr__ method is implemented).

For example:

from pydantic import BaseModel, AnyUrl
import yaml

class MyModel(BaseModel):
    url: AnyUrl


data = {'url': 'https://www.example.com'}
model = MyModel.parse_obj(data)

y = yaml.dump(model.dict(), indent=4)
print(y)

Produces:

url: !!python/object/new:pydantic.networks.AnyUrl
    args:
    - https://www.example.com
    state: !!python/tuple
    - null
    -   fragment: null
        host: www.example.com
        host_type: domain
        password: null
        path: null
        port: null
        query: null
        scheme: https
        tld: com
        user: null

Ideally, I would like the serialized YAML to contain https://www.example.com instead of individual fields.

I have tried to override the __repr__ method of AnyUrl to return the AnyUrl object itself, as it extends the str class, but no luck.


Solution

  • Unfortunately, the pyyaml documentation is just horrendous, so seemingly elemental things like customizing (de-)serialization are a pain to figure out properly. But there are essentially two ways you could solve this.

    Option A: Subclass YAMLObject

    You had the right right idea of subclassing AnyUrl, but the __repr__ method is irrelevant for YAML serialization. For that you need to do three things:

    1. Inherit from YAMLObject,
    2. define a custom yaml_tag, and
    3. override the to_yaml classmethod.

    Then pyyaml will serialize this custom class (that inherits from both AnyUrl and YAMLObject) in accordance with what you define in to_yaml.

    The to_yaml method always receives exactly two arguments:

    1. A yaml.Dumper instance with built-in capabilities to serialize standard types (via methods like represent_str for example) and
    2. the actual data to be serialized.

    To avoid adding/overriding additional methods, you can leverage the fact that AnyUrl inherits from string and the underlying str.__new__ method actually receives the full URL during construction. Therefore the str.__str__ method will return that "as is".

    from pydantic import AnyUrl, BaseModel
    from yaml import Dumper, ScalarNode, YAMLObject, dump, safe_load
    
    
    class Url(AnyUrl, YAMLObject):
        yaml_tag = "!Url"
    
        @classmethod
        def to_yaml(cls, dumper: Dumper, data: str) -> ScalarNode:
            return dumper.represent_str(str.__str__(data))
    
    
    class MyModel(BaseModel):
        foo: int = 0
        url: Url
    
    
    obj = MyModel.parse_obj({"url": "https://www.example.com"})
    print(obj)
    
    serialized = dump(obj.dict()).strip()
    print(serialized)
    
    deserialized = MyModel.parse_obj(safe_load(serialized))
    print(deserialized == obj and isinstance(deserialized.url, Url))
    

    Output:

    foo=0 url=Url('https://www.example.com', scheme='https', host='www.example.com', tld='com', host_type='domain')
    
    foo: 0
    url: https://www.example.com
    
    True
    

    Option B: Register a representer function for AnyUrl

    You can avoid defining your own subclass and instead globally register a function that defines how instances of AnyUrl should be serialized, by using the yaml.add_representer function.

    That function takes two mandatory arguments:

    1. The class for which you want to define your custom serialization behavior and
    2. the representer function that defines that serialization behavior.

    The representer function essentially has to have the same signature as the YAMLObject.to_yaml classmethod presented in option A, i.e. it takes a Dumper instance and the data to be serialized as arguments.

    from pydantic import AnyUrl, BaseModel
    from yaml import Dumper, ScalarNode, add_representer, dump, safe_load
    
    
    def url_representer(dumper: Dumper, data: AnyUrl) -> ScalarNode:
        return dumper.represent_str(str.__str__(data))
    
    
    add_representer(AnyUrl, url_representer)
    
    
    class MyModel(BaseModel):
        foo: int = 0
        url: AnyUrl
    
    
    obj = MyModel.parse_obj({"url": "https://www.example.com"})
    print(obj)
    
    serialized = dump(obj.dict()).strip()
    print(serialized)
    
    deserialized = MyModel.parse_obj(safe_load(serialized))
    print(deserialized == obj and isinstance(deserialized.url, AnyUrl))
    

    Output is the same as with the code from option A.

    The benefit of this approach is that it involves less code and potential namespace collisions between the two parent classes in option A.

    A potential drawback is that it modifies a global setting for the entire runtime of the program, which can become less transparent, if your application becomes large and is just something to be aware of, in case you decide you want to serialize AnyUrl objects differently at some point.