postgresqlpg-trgm

How to group by similar values with pg_trgm


I have the following table

id error
-  ----------------------------------------
1  Error 1234eee5, can not write to disk
2  Error 83457qwe, can not write to disk
3  Error 72344ee, can not write to disk
4  Fatal barier breach on object 72fgsff
5  Fatal barier breach on object 7fasdfa
6  Fatal barier breach on object 73456xcc5

I want to be able to get a result that counts by similarity, where similarity of > 80% means two errors are equal. I've been using pg_trgm extension, and its similarity function works perfectly for me, the only thing I can figure out how to produce the grouping result below.

Error                                  Count
-------------------------------------  ------
Error 1234eee5, can not write to disk, 3
Fatal barier breach on object 72fgsff, 3

Solution

  • Basically you could join a table with itself to find similar strings, however this approach will end in a terribly slow query on a larger dataset. Also, using similarity() may cause inaccuracy in some cases (you need to find the appropriate limit value).

    You should try to find patterns. For example, if all variable words in strings begin with a digit, you can mask them using regexp_replace():

    select id, regexp_replace(error, '\d\w+', 'xxxxx') as error
    from errors;
    
     id |                error                
    ----+-------------------------------------
      1 | Error xxxxx, can not write to disk
      2 | Error xxxxx, can not write to disk
      3 | Error xxxxx, can not write to disk
      4 | Fatal barier breach on object xxxxx
      5 | Fatal barier breach on object xxxxx
      6 | Fatal barier breach on object xxxxx
    (6 rows)    
    

    so you can easily group the data by error message:

    select regexp_replace(error, '\d\w+', 'xxxxx') as error, count(*)
    from errors
    group by 1;
    
                    error                | count 
    -------------------------------------+-------
     Error xxxxx, can not write to disk  |     3
     Fatal barier breach on object xxxxx |     3
    (2 rows)
    

    The above query is only an example as the specific solution depends on the data format.

    Using pg_trgm

    The solution based on the OP's idea (see the comments below). The limit 0.8 for similarity() is certainly too high. It seems that it should be somewhere about 0.6.

    The table for unique errors (I've used a temporary table but it also be a regular one of course):

    create temp table if not exists unique_errors(
        id serial primary key, 
        error text, 
        ids int[]);
    

    The ids column is to store id of rows of the base table which contain similar errors.

    do $$
    declare
        e record;
        found_id int;
    begin
        truncate unique_errors;
        for e in select * from errors loop
            select min(id)
            into found_id
            from unique_errors u
            where similarity(u.error, e.error) > 0.6;
            if found_id is not null then
                update unique_errors
                set ids = ids || e.id
                where id = found_id;
            else
                insert into unique_errors (error, ids)
                values (e.error, array[e.id]);
            end if;
        end loop;
    end $$;
    

    The final results:

    select *, cardinality(ids) as count
    from unique_errors;
    
     id |                 error                 |   ids   | count 
    ----+---------------------------------------+---------+-------
      1 | Error 1234eee5, can not write to disk | {1,2,3} |     3
      2 | Fatal barier breach on object 72fgsff | {4,5,6} |     3
    (2 rows)