I have long format data on businesses, with a row for each occurrence of a move to a different location, keyed on business id -- there can be several move events for any one business establishment.
I wish to reshape to a wide format, which is typically cross-tab territory per the tablefunc
module.
+-------------+-----------+---------+---------+
| business_id | year_move | long | lat |
+-------------+-----------+---------+---------+
| 001013580 | 1991 | 71.0557 | 42.3588 |
| 001015924 | 1993 | 71.0728 | 42.3504 |
| 001015924 | 1996 | -122.28 | 37.654 |
| 001020684 | 1992 | 84.3381 | 33.5775 |
+-------------+-----------+---------+---------+
Then I transform like so:
SELECT longbyyear.*
FROM crosstab($$
SELECT
business_id,
year_move,
max(longitude::float)
from business_moves
where year_move::int between 1991 and 2010
group by business_id, year_move
order by business_id, year_move;
$$
)
AS longbyyear(biz_id character varying, "long91" float,"long92" float,"long93" float,"long94" float,"long95" float,"long96" float,"long97" float, "long98" float, "long99" float,"long00" float,"long01" float,
"long02" float,"long03" float,"long04" float,"long05" float,
"long06" float, "long07" float, "long08" float, "long09" float, "long10" float);
And it --mostly-- gets me to the desired output.
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| biz_id | long91 | long92 | long93 | long94 | … | long08 | long09 | long10 |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
| 1000223 | 121.3784 | 121.3063 | 121.3549 | 82.821 | … | | | |
| 1000678 | 118.224 | | | | … | | | |
| 1002158 | 121.98 | | | | … | | | |
| 1004092 | 71.2384 | | | | … | | | |
| 1007801 | 118.0312 | | | | … | | | |
| 1007855 | 71.1769 | | | | … | | | |
| 1008697 | 71.0394 | 71.0358 | | | … | | | |
| 1008986 | 71.1013 | | | | … | | | |
| 1009617 | 119.9965 | | | | … | | | |
+---------+----------+----------+----------+--------+---+--------+--------+--------+
The only snag is that I would ideally have populated values for each year and not just have values in move years. Thus all fields would be populated, with a value for each year, with the most recent address carrying over to the next year. I could hack this with manual updates if each is blank, use the previous column, I just wondered if there was a clever way to do it either with the crosstab()
function, or some other way, possibly coupled with a custom function.
In order to get the current location for each business_id for any given year you need two things:
The function then looks like this:
CREATE FUNCTION business_location_in_year_x (int) RETURNS SETOF business_moves AS $$
WITH last_move AS (
SELECT business_id, MAX(year_move) AS yr
FROM business_moves
WHERE year_move <= $1
GROUP BY business_id)
SELECT lm.business_id, $1::int AS yr, longitude, latitude
FROM business_moves bm, last_move lm
WHERE bm.business_id = lm.business_id
AND bm.year_move = lm.yr;
$$ LANGUAGE sql;
The sub-query selects only the most recent moves for every business location. The main query then adds the longitude and latitude columns and put the requested year in the returned table, rather than the year in which the most recent move took place. One caveat: you need to have a record in this table that gives the establishment and initial location of each business_id or it will not show up until after it has moved somewhere else.
Call this function with the usual SELECT * FROM business_location_in_year_x(1997)
. See also the SQL fiddle.
If you really need a crosstab then you can tweak this code around to give you the business location for a range of years and then feed that into the crosstab()
function.