sqlpostgresqlpostgresql-9.1crosstabgenerate-series

Transform long rows to wide, filling all cells


I have long format data on businesses, with a row for each occurrence of a move to a different location, keyed on business id -- there can be several move events for any one business establishment.

I wish to reshape to a wide format, which is typically cross-tab territory per the tablefunc module.

+-------------+-----------+---------+---------+
| business_id | year_move |  long   |   lat   |
+-------------+-----------+---------+---------+
|   001013580 |      1991 | 71.0557 | 42.3588 |
|   001015924 |      1993 | 71.0728 | 42.3504 |
|   001015924 |      1996 | -122.28 | 37.654  |
|   001020684 |      1992 | 84.3381 | 33.5775 |
+-------------+-----------+---------+---------+

Then I transform like so:

SELECT longbyyear.*
FROM crosstab($$
    SELECT 
    business_id, 
    year_move, 
    max(longitude::float)
    from business_moves
    where year_move::int between 1991 and 2010 
    group by business_id, year_move
    order by business_id, year_move;
    $$
) 
AS longbyyear(biz_id character varying, "long91" float,"long92" float,"long93" float,"long94" float,"long95" float,"long96" float,"long97" float, "long98" float, "long99" float,"long00" float,"long01" float,
"long02" float,"long03" float,"long04" float,"long05" float, 
"long06" float, "long07" float, "long08" float, "long09" float, "long10" float);

And it --mostly-- gets me to the desired output.

+---------+----------+----------+----------+--------+---+--------+--------+--------+
| biz_id  |  long91  |  long92  |  long93  | long94 | … | long08 | long09 | long10 |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| 1000223 | 121.3784 | 121.3063 | 121.3549 | 82.821 | … |        |        |        |
| 1000678 | 118.224  |          |          |        | … |        |        |        |
| 1002158 | 121.98   |          |          |        | … |        |        |        |
| 1004092 | 71.2384  |          |          |        | … |        |        |        |
| 1007801 | 118.0312 |          |          |        | … |        |        |        |
| 1007855 | 71.1769  |          |          |        | … |        |        |        |
| 1008697 | 71.0394  | 71.0358  |          |        | … |        |        |        |
| 1008986 | 71.1013  |          |          |        | … |        |        |        |
| 1009617 | 119.9965 |          |          |        | … |        |        |        |
+---------+----------+----------+----------+--------+---+--------+--------+--------+

The only snag is that I would ideally have populated values for each year and not just have values in move years. Thus all fields would be populated, with a value for each year, with the most recent address carrying over to the next year. I could hack this with manual updates if each is blank, use the previous column, I just wondered if there was a clever way to do it either with the crosstab() function, or some other way, possibly coupled with a custom function.


Solution

  • In order to get the current location for each business_id for any given year you need two things:

    1. A parameterized query to select the year, implemented as a SQL language function.
    2. A dirty trick to aggregate on year, group by the business_id, and leave the coordinates untouched. That is done by a sub-query in a CTE.

    The function then looks like this:

    CREATE FUNCTION business_location_in_year_x (int) RETURNS SETOF business_moves AS $$
      WITH last_move AS (
        SELECT business_id, MAX(year_move) AS yr
        FROM business_moves
        WHERE year_move <= $1
        GROUP BY business_id)
      SELECT lm.business_id, $1::int AS yr, longitude, latitude
      FROM business_moves bm, last_move lm
      WHERE bm.business_id = lm.business_id
      AND bm.year_move = lm.yr;
    $$ LANGUAGE sql;
    

    The sub-query selects only the most recent moves for every business location. The main query then adds the longitude and latitude columns and put the requested year in the returned table, rather than the year in which the most recent move took place. One caveat: you need to have a record in this table that gives the establishment and initial location of each business_id or it will not show up until after it has moved somewhere else.

    Call this function with the usual SELECT * FROM business_location_in_year_x(1997). See also the SQL fiddle.

    If you really need a crosstab then you can tweak this code around to give you the business location for a range of years and then feed that into the crosstab() function.