postgresqlloopsgtfs

Getting amount of active trips via postgresql loop


I am searching for a way to get the amount of active trips in a GTFS feed for each minute of a day through a postgresql query.

For every trip I have the start and endtimes (in seconds) in a denormalized table. It looks something like this:

denormalized_trips table

The query to give me trips that are active for a given timeframe (e.g. here 43000 to 43600 seconds) looks like this:

SELECT
COUNT(trips.trip_id)
FROM denormalized_trips AS trips

LEFT JOIN gtfs_calendar_dates AS calendar_dates
ON calendar_dates.service_id = trips.service_id
AND calendar_dates.agency_key = trips.agency_key
AND date = '2017-07-03'
AND exception_type = 1

INNER JOIN gtfs_calendar AS calendar
  ON trips.service_id = calendar.service_id
  AND calendar.agency_key = trips.agency_key
  AND calendar.wednesday = 1

WHERE (
  trip_start_time BETWEEN 46800 AND 47100
  AND '2017-07-03' BETWEEN calendar.start_date AND calendar.end_date
)
AND NOT EXISTS (
  SELECT 0
  FROM gtfs_calendar_dates AS date_exceptions
  WHERE date = '2017-07-03'
  AND date_exceptions.agency_key = trips.agency_key
  AND date_exceptions.service_id = calendar.service_id
  AND exception_type = 2
);

This will result in 12 trips that will start between 13:00pm and 13:05pm.

Now I want to do that for the whole day. I want to get the amount of trips that get active in an interval of e.g. 1 minute or maybe more like 5 minutes. I tried it with a loop but this seems to just give me 1 result back. Here is what I came up with for now:

CREATE OR REPLACE FUNCTION get_active_trips(n int)
RETURNS INTEGER AS
$BODY$
DECLARE 
 count INTEGER;
BEGIN 
  FOR counter IN 43130..50000 BY 60 LOOP
    SELECT
    COUNT(trips.trip_id)
    INTO count
    FROM denormalized_trips AS trips

    LEFT JOIN gtfs_calendar_dates AS calendar_dates
    ON calendar_dates.service_id = trips.service_id
    AND calendar_dates.agency_key = trips.agency_key
    AND date = '2017-07-03'
    AND exception_type = 1

    INNER JOIN gtfs_calendar AS calendar
      ON trips.service_id = calendar.service_id
      AND calendar.agency_key = trips.agency_key
      AND calendar.wednesday = 1

    WHERE (
      trip_start_time BETWEEN counter AND counter + 60
      AND '2017-07-03' BETWEEN calendar.start_date AND calendar.end_date
    )
    AND NOT EXISTS (
      SELECT 0
      FROM gtfs_calendar_dates AS date_exceptions
      WHERE date = '2017-07-03'
      AND date_exceptions.agency_key = trips.agency_key
      AND date_exceptions.service_id = calendar.service_id
      AND exception_type = 2
    );
  END LOOP;
  RETURN count;
END; 
$BODY$ LANGUAGE plpgsql STABLE;

The result of calling SELECT get_active_trips(1);

sum of entries

Now I would like to get something like a table or an array of results back instead of just 1 entry. How would I do that?

Any help is highly appreciated.


Solution

  • There are two syntaxes to create function returning set of values: returns setof <type> and returns table(<columns definition>). Documentation.

    There are also several ways to return those values from the plpgsql function: return next or return query. Documentation.

    So if you want just get the series of integers you could to rewrite your function in that way:

    CREATE OR REPLACE FUNCTION get_active_trips(n int)
    RETURNS SETOF INTEGER AS
    $BODY$
    DECLARE 
     count INTEGER;
    BEGIN 
      FOR counter IN 43130..50000 BY 60 LOOP
        SELECT
        COUNT(trips.trip_id)
        INTO count
        FROM denormalized_trips AS trips
        <rest of query here>
        ;
        RETURN NEXT count;
      END LOOP;
      RETURN;
    END; 
    $BODY$ LANGUAGE plpgsql STABLE;
    

    or, using RETURN QUERY:

    CREATE OR REPLACE FUNCTION get_active_trips(n int)
    RETURNS SETOF INTEGER AS
    $BODY$
    BEGIN 
      FOR counter IN 43130..50000 BY 60 LOOP
        RETURN QUERY
          SELECT
          COUNT(trips.trip_id)
          FROM denormalized_trips AS trips
          <rest of query here>
          ;
      END LOOP;
      RETURN;
    END; 
    $BODY$ LANGUAGE plpgsql STABLE;
    

    In case if you want to return more then one column:

    CREATE OR REPLACE FUNCTION get_active_trips(n int)
    RETURNS TABLE (counter_value int, active_trips_count int) AS
    $BODY$
    BEGIN 
      FOR counter IN 43130..50000 BY 60 LOOP
        SELECT
        COUNT(trips.trip_id)
        INTO active_trips_count
        FROM denormalized_trips AS trips
        <rest of query here>
        ;
        counter_value := counter;
        RETURN NEXT; -- There is no parameters, current values of counter_value and active_trips_count will be returned
      END LOOP;
      RETURN;
    END; 
    $BODY$ LANGUAGE plpgsql STABLE;
    

    or, using RETURN QUERY:

    CREATE OR REPLACE FUNCTION get_active_trips(n int)
    RETURNS TABLE (counter_value int, active_trips_count int) AS
    $BODY$
    BEGIN 
      FOR counter IN 43130..50000 BY 60 LOOP
        RETURN QUERY
          SELECT
            counter,
            COUNT(trips.trip_id)
          FROM denormalized_trips AS trips
          <rest of query here>
          ;
      END LOOP;
      RETURN;
    END; 
    $BODY$ LANGUAGE plpgsql STABLE;
    

    Finally there is alternative declaration for returns table:

    CREATE OR REPLACE FUNCTION get_active_trips(
      n int,
      out counter_value int, 
      out active_trips_count int)
    RETURNS SETOF RECORD AS
    

    Update

    But(!) I feeling that it is possible to simplify your task using single query, without loop.

    Consider the following query (i'v used your simplified query from sqlfiddle):

    select
      counter,
      count(trips.trip_id)
    from
      generate_series(43130, 50000, 60) as counter left join
        denormalized_trips as trips on (trip_start_time between counter and counter + 60) 
    group by counter 
    order by counter;