pythonlistperformanceduplicatespydantic

Removing duplicates from the list of pydantic objects


I tried to remove duplicates from the list of pydantic objects, but faced a problem that I could not solve. The only working method is very slow.

Is there a faster way to remove duplicates than my method?

Code:

Pydantic model (a.py)

from pydantic import BaseModel


class Photo(BaseModel):
    title: str
    url: str

Main file (b.py)

from collections import OrderedDict
from a import Photo

#  3 objects, 2 duplicates
a_obj = {
    'title': 'SOME TITLE v1',
    'url': 'http://some.url'
}
b_obj = {
    'title': 'SOME TITLE v2',
    'url': 'http://different.url'
}
c_obj = {
    'title': 'SOME TITLE v1',
    'url': 'http://some.url'
}

#  Creating list of pydantic objects
pd_obj_list = list()
pd_obj_list += [Photo(**a_obj)]
pd_obj_list += [Photo(**b_obj)]
pd_obj_list += [Photo(**c_obj)]

#  My Attempts to Remove Duplicates

#  Using OrderedDict.fromkeys
final_list_0 = list(OrderedDict.fromkeys(pd_obj_list))
#  returns TypeError: unhashable type: 'Photo'

#  Using Set
final_list_1 = list(set(pd_obj_list))
#  returns TypeError: unhashable type: 'Photo'

#  Using enumerate
final_list_2 = [i for n, i in enumerate(pd_obj_list) if i not in pd_obj_list[:n]]
#  It works but too slow when I have ~10k objects in the list

Solution

  • Use:

    pd_obj_list = [Photo(**a_obj), Photo(**b_obj), Photo(**c_obj)]
    final_list_0 = list(OrderedDict(((photo.title, photo.url), photo) for photo in pd_obj_list).values())
    print(final_list_0)
    

    Output

    [Photo(title='SOME TITLE v1', url='http://some.url'), Photo(title='SOME TITLE v2', url='http://different.url')]
    

    If Photo is inmutable you could define __hash__ as follows:

    from collections import OrderedDict
    
    from pydantic import BaseModel
    
    
    class Photo(BaseModel):
        title: str
        url: str
    
        def __hash__(self):
            return hash((self.title, self.url))
    
    
    #  3 objects, 2 duplicates
    a_obj = {
        'title': 'SOME TITLE v1',
        'url': 'http://some.url'
    }
    b_obj = {
        'title': 'SOME TITLE v2',
        'url': 'http://different.url'
    }
    c_obj = {
        'title': 'SOME TITLE v1',
        'url': 'http://some.url'
    }
    
    pd_obj_list = [Photo(**a_obj), Photo(**b_obj), Photo(**c_obj)]
    final_list_0 = list(OrderedDict.fromkeys(pd_obj_list))
    print(final_list_0)
    

    Output

    [Photo(title='SOME TITLE v1', url='http://some.url'), Photo(title='SOME TITLE v2', url='http://different.url')]