sqlpostgresqlpostgres-crosstab

Postgres query that shows the results from 2 columns with the highest counts per month


I have a database with about 2500 results from 2022. I'm trying to build queries that will show me the top industry and country results from each month throughout the entire year. The preferred output would include the top X (number) of each category. I'm fine with breaking this out into two queries, one for each category, if needed.

The table looks like this:

post_id int
post_date date
post_victim text
threat_group text
victim_industry text
victim_country text

The idea here is that I can use these queries to put together data to brief teammates or generate graphs for visualizing the resulting data.

Here's an example of what I'd like to output via a SQL query (shortened to just Jan-Apr, but the final result should cover the entire year):

Industry Jan Feb Mar Apr
Healthcare 95 79 58 12
Manufacturing 45 90 72 65
Finance 31 48 21 73
Education 63 70 46 92
Technology 86 34 99 25
Construction 53 73 16 29
Retail 70 90 95 73
Government 95 73 21 58
Agriculture 34 53 61 99
Transportation 60 89 22 38
Hospitality 70 38 96 79
Legal 89 67 92 43

I tried using crosstab and specifying the following as ct (...:

SELECT *
FROM crosstab(
  $$
  SELECT to_char(post_date, 'YYYY-MM') as month, victim_industry, count(*)
  FROM ransomware_posts
  WHERE post_date >= '2022-01-01' AND post_date < '2023-01-01'
  GROUP BY month, victim_industry
  ORDER BY month, count DESC
  $$
) as ct (month text, industry_1 text, count_1 int, industry_2 text, count_2 int)

I think the issue may have something to do with the fact that there are more than 2 results per month, so the output is receiving more results than it's expecting, but I don't know for sure as I've never built a functioning crosstab query.

Here's the error I receive:

ERROR:  return and sql tuple descriptions are incompatible
SQL state: 42601

Solution

  • The error message comes from a gap between the structure resulting ffrom the internal query and the structure returned by the crosstab function.

    This query should work :

    SELECT *
    FROM crosstab($$
      SELECT victim_industry, to_char(post_date, 'YYYY-MM') :: text as month, count(*) :: int
      FROM ransomware_posts
      WHERE post_date >= '2021-01-01' AND post_date < '2022-01-01'
      GROUP BY victim_industry, month
      ORDER BY victim_industry, month $$
    ) as ct (industry text, Jan int, Feb int, Mar int, Apr int, May int, Jun int, Jul int, Aug int, Sep int, Oct int, Nov int, Dec int)
    

    UPDTATE

    crosstab doesn't manage well the gaps if any.

    There is another solution which relies on a composite type and the standard function jsonb_populate_record :

    First creating a dedicated composite type :

    CREATE TYPE months AS (jan int, feb int, mar int, apr int, jun int, jul int, aug int, sep int, oct int, nov int, dec int)
    

    Then the following query provides the expected result :

    SELECT r.victim_industry, (jsonb_populate_record(null :: months, jsonb_object_agg(lower(r.date), r.count))).*
      FROM 
         ( SELECT victim_industry, to_char(post_date, 'Mon') AS date, count(*) AS count
             FROM ransomware_posts
            WHERE post_date >= '2021-01-01' AND post_date < '2022-01-01'
            GROUP BY victim_industry, to_char(post_date, 'Mon')
        ) AS r
    GROUP BY r.victim_industry
    

    see dbfiddle