sqlpostgresqlstatistics

PostgreSQL solution to work easier on a specific row with multiple columns


I have a table where I want to do some statistics on each rows different columns. For example:

id NW2020 NW2021 NW2022 NW2023 NW2024 CA2020 ...
1 -299 4000 21 -325 2544 55
2 54 244 -5 -54 325 874

How can I have a table with the frequency of negative values of each Net Worth? Is there a way that I can ask for "NW%" columns? (a kind of GROUP BY but by columns and not by rows)

The only (UGLY) solution I had in mind was this one :

SELECT *, 
((CASE NW2020 < 0 THEN 1 ELSE 0 END) + (CASE WHEN NW2021 < 0 THEN 1 ELSE 0 END) + (CASE WHEN NW2022 < 0 THEN 1 ELSE 0 END) + ... )::double precision/
    (CASE WHEN NW2020 is not null THEN 1 ELSE 0 END) + (CASE WHEN NW2021 is not null THEN 1 ELSE 0 END) + (CASE WHEN NW2022  is not null THEN 1 ELSE 0 END) + ... )as freq_negative
FROM a

(I know coalesce but it doesn't short it enough for the amount of columns I have.

I had the same issue calculating the average and the standard deviation.

Is my DB and table organization bad?

Thanks in advance


Solution

  • If improving the database structure is not a viable solution for some reason and you are absolutely forced to do this (don't do it if you are not forced by some requirement or obstacle), then you can create a stored function, let's fall it f which:

    Then your expression would simplify to

    SELECT *, 
    ((f(NW2020)) + (f(NW2021)) + (f(NW2022)) + ... )::double precision/
        (g(NW2020)) + (g(NW2021)) + (g(NW2022)) + ... )as freq_negative
    FROM a
    

    where g is another stored function created similarly. It would look much better, but still not ideal because the database is not planned well.

    Here's how you can create a stored function in PostgreSQL.

    But, the solution above is only a bandage and the solution would be to fix your database schema. NW2020, NW2021, NW2022, ... already breaks 1NF, particularly the principle of

    There are no repeating groups or arrays in any row.

    Because NW is something in essence and having a field for each year makes it redundant and the structure volatile. You've encountered this issue from the selection perspective, but as years pass if you leave this unchanged, you will always need to alter the tables add fields and adjusts all selections that want all years.

    Instead, create an NW table like:

    NW(y, ...)
    

    where y represents the year. And one for CA too

    CA(y, ...)
    

    and whatever your a is, don't store repetitive values inside of it. Because this is a many-to-many relation, that is, an a has multiple references to MW and CA, whilst an MW and a CA is being references by multiple a records. So you need mapping tables, like

    a_MW(id, aid, y, value)
    

    and

    a_CA(id, aid, y, value)
    

    and then you can do this:

    SELECT SUM(f(a_MW.value)) AS first_field, SUM(g(a.MW.value)) AS second_field
    FROM a
    LEFT JOIN a_MW
    ON a.id = a_MW.aid
    LEFT JOIN MW
    ON a_MW.y = MW.y
    GROUP BY a.id
    

    The SUM function is an aggregation and since these are similar values, you can aggregate them as you please.