sql postgresql euclidean-distance large-data-volumes

Large SQL Request optimization for Faces Euclidean Distances calculations

I am calculating Euclidean distance between faces and want to store results in a table.

Current setup :

Each face is stored in Objects table and Distances between faces is stored in Faces_distances table.
The object table has the following columns objects_id, face_encodings, description
The faces_distances table has the following columns face_from, face_to, distance

In my my data set I have around 22 231 face objects which result in 494 217 361 pairs of faces - Although I understand it could be divided by 2 because

distance(face_from, face_to) = distance(face_to, face_from)

The database is Postgres 12.

The request below enables to insert the pairs of faces (without performing the distance calculation) that have not been calculated yet, but the execution time is very very very long (started 4 Days ago and still not done). Is there a way to optimize it ?

'''

-- public.objects definition

-- Drop table

-- DROP TABLE public.objects;

CREATE TABLE public.objects 
(
  objects_id int4 NOT NULL DEFAULT 
  nextval('objects_in_image_objects_id_seq'::regclass),
  filefullname varchar(2303) NULL,
  bbox varchar(255) NULL,
  description varchar(255) NULL,
  confidence numeric NULL,
  analyzer varchar(255) NOT NULL DEFAULT 'object_detector'::character 
  varying,
  analyzer_version int4 NOT NULL DEFAULT 100,
  x int4 NULL,
  y int4 NULL,
  w int4 NULL,
  h int4 NULL,
  image_id int4 NULL,
  derived_from_object int4 NULL,
  object_image_filename varchar(2023) NULL,
  face_encodings _float8 NULL,
  face_id int4 NULL,
  face_id_iteration int4 NULL,
  text_found varchar NULL COLLATE "C.UTF-8",
  CONSTRAINT objects_in_image_pkey PRIMARY KEY (objects_id),
  CONSTRAINT objects_in_images FOREIGN KEY (objects_id) REFERENCES 
 public.objects(objects_id)
 );

CREATE TABLE public.face_distances 
(
  face_from int8 NOT NULL,
  face_to int8 NOT NULL,
  distance float8 NULL,
  CONSTRAINT face_distances_pk PRIMARY KEY (face_from, face_to)
);


-- public.face_distances foreign keys

ALTER TABLE public.face_distances ADD CONSTRAINT face_distances_fk 
FOREIGN KEY (face_from) REFERENCES public.objects(objects_id);
ALTER TABLE public.face_distances ADD CONSTRAINT face_distances_fk_1  
FOREIGN KEY (face_to) REFERENCES public.objects(objects_id);

Indexes

CREATE UNIQUE INDEX objects_in_image_pkey ON public.objects USING btree (objects_id);
CREATE INDEX objects_description_column ON public.objects USING btree (description);
CREATE UNIQUE INDEX face_distances_pk ON public.face_distances USING btree (face_from, face_to);

Query to add all pair of faces that are not already in the table.

insert into face_distances (face_from,face_to) 
select t1.face_from , t1.face_to 
from (
   select f_from.objects_id face_from,
   f_from.face_encodings face_from_encodings, 
   f_to.objects_id face_to, 
   f_to.face_encodings face_to_encodings 
   from objects f_from, 
        objects f_to 
   where f_from.description = 'face' 
   and f_to.description = 'face' ) as t1
left join face_distances on (
t1.face_from= face_distances.face_from 
and t1.face_to = face_distances.face_to )
where face_distances.face_from is null;

Solution

try this simplified query. It took only 5 minutes on my apple M1, SQLServer, with 22231 objects 'face', generated 247.097.565 pairs, which is excatly C(22231,2) number. The syntax is compatible with postgressql.

optimizations: join instead of the old jointure way, ranking functions to remove duplicates permutations (A,B)=(B,A), removed the last [left join face_distance]: an empty table to recompute is a lot faster than checking for existance as an index search key lookup would be initiated for each key pair

insert into face_distances (face_from,face_to)
select f1,f2
from(
    select --only needed fields here as this will fill temporary tables
         f1.objects_id f1
        ,f2.objects_id f2
        ,dense_rank()over(order by f1.objects_id) rank1
        ,rank()over(partition by f2.objects_id order by f1.objects_id) rank2
    from objects f1
           -- generates all permutations
    join objects f2 on f2.objects_id <> f1.objects_id and f2.description = 'face'
    where f1.description = 'face'
  )a
where rank2 >= rank1   --removes duplicate permutations