I am using a GTFS feed for an app I am working on. I am attempting to list all of the stops for a chosen route. Currently, I am attempting to order the list by stop_sequence, but this is not working properly since some trips do not go to every stop and the data I have received increments the stop_sequence by 1 per stop per trip. The significance of this is that the stop_sequence does not account for other trips that might have more or less stops.
Here's an example:
This is the order of the stops for a route, (ignoring the fact that not every trip will stop at each stop)
Stop A
Stop B
Stop C
Stop D
Stop E
Now here are some example trips for the route:
Trip 1: A, B, C, D
Trip 2: A, B, E
What my data is doing:
For Trip 1:
Stop A: stop_sequence = 1
Stop B: stop_sequence = 2
Stop C: stop_sequence = 3
Stop D: stop_sequence = 4
For Trip 2:
Stop A: stop_sequence = 1
Stop B: stop_sequence = 2
Stop E: stop_sequence = 3
So when I try to order all potential stops for a route I end up with this:
Stop A
Stop B
Stop C
Stop E
Stop D
which clearly is incorrect.
Does anyone know of any other potential ideas to correctly order the stops, perhaps using other data that comes with the GTFS Feed?
UPDATED with a real world example
Here is the example output of a database query that gets all of the stops for route 915. This is for the AM schedule.
+---------+---------+---------------+------------------------------------------------+
| stop_id | trip_id | stop_sequence | stop_name |
+---------+---------+---------------+------------------------------------------------+
| 11771 | 1269287 | 1 | LOTTE PLAZA US 40 & US 29 |
| 11772 | 1269280 | 1 | HARPER'S FARM RD & CEDAR LA eb |
| 11773 | 1269280 | 2 | LITTLE PATUXENT & GRAY STAR wb |
| 11774 | 1269280 | 3 | LITTLE PATUXENT & WHITE CORD WAY wb |
| 11775 | 1269280 | 4 | LITTLE PATUXENT & BRIGHT PASSAGE eb |
| 11776 | 1269280 | 5 | LITTLE PATUXENT & HICKORY RID nb |
| 11777 | 1269280 | 6 | LITTLE PATUXENT & CEDAR LA eb |
| 11778 | 1269280 | 7 | LITTLE PATUXENT & HARPER'S FARM opp eb |
| 11779 | 1269280 | 8 | COLUMBIA MALL & SOUTH RING RD eb |
| 11782 | 1269280 | 9 | BROKEN LAND & HICKORY RIDGE sb |
| 11780 | 1269289 | 9 | LITTLE PATUXENT & GOV WARFIELD nb |
| 11783 | 1269280 | 10 | BROKEN LAND PARK & RIDE |
| 11781 | 1269289 | 10 | LITTLE PATUXENT & VANTAGE PT nb |
| 11784 | 1269280 | 11 | SCAGGSVILLE PARK & RIDE |
| 11785 | 1269280 | 12 | BURTONSVILLE PARK & RIDE |
| 11786 | 1269280 | 13 | COLESVILLE RD & FENTON ST sb |
| 11787 | 1269280 | 14 | SILVER SPRING METRO STATION |
| 11788 | 1269280 | 15 | WALTER REED HOSP & 16TH ST NW |
| 11789 | 1269280 | 16 | 16TH ST & P ST NW |
| 11790 | 1269280 | 17 | 16TH ST & M ST NW |
| 11718 | 1269280 | 18 | K ST & 16TH ST NW fs eb |
| 11719 | 1269280 | 19 | K ST & 14TH ST NW eb |
| 11791 | 1269280 | 20 | 13TH ST & H ST NW sb |
| 11759 | 1269280 | 21 | PENNSYLVANIA AVE & 12TH ST NW eb |
| 11793 | 1269280 | 22 | CONSTITUTION AVE & 10TH ST NW fs eb |
| 12046 | 1269280 | 23 | 7TH ST NW & CONSTITUTION AVE eb |
| 11650 | 1269280 | 24 | INDEPENDENCE AVE & 7/6 ST SW mid eb |
| 11601 | 1269280 | 25 | INDEPENDENCE AVE & 4TH/3RD ST SW eb |
| 13627 | 1269280 | 26 | M ST & 1st ST SE (NAVY YARD) sb |
| 13628 | 1269280 | 27 | M ST & 4th ST SE (SOUTHEAST FEDERAL CENTER) eb |
| 11569 | 1269280 | 28 | M ST & ISAAC HALL AVE SE eb |
| 11795 | 1269280 | 29 | M ST & 8/9TH STS mid eb |
+---------+---------+---------------+------------------------------------------------+
and here is the link to the pdf of the schedule that a lot of commuters are currently using. The first instance of where the two lists differ is after "COLUMBIA MALL & SOUTH RING RD eb"
http://mta.maryland.gov/sites/default/files/915May2011B.pdf
I am trying to make this app commuter friendly as possible, but when the stops are out of order when compared to what commuters usually use, it might cause a lot of confusion.
UPDATE 2:
I still do not see how topological sorting can be used to get the correct sequence. Yes it might give a valid sequence, but it is not guaranteed to be the correct sequence that a commuter will easily recognize. Let's look at another example using the pdf I provided. We will look at Trips 1 and 5 and up until the stop "Columbia Mall". I would create the following edges:
Edges created from Trip 1
Cedar Lane --> Gray Star Way
Gray Star Way --> White Cord Way
...
Harpers Farm Rd --> Columbia Mall
Edges created from Trip 5
Lotte Plaza --> Columbia Mall
The only thing that a topological sorting ensures is
for every directed edge uv from vertex u to vertex v, u comes before v in the ordering
That means that there are multiple valid orderings, but only one is the actual correct one that I want(but there is no way for me to progromatically choose this one over other valid orderings, at least not that I can think of).
A valid ordering might be (this is also the correct one):
Lotte Plaza,
Cedar Lane
Gray Star
...
Columbia Mall
or even
Cedar Lane
Gray Star
...
Lotte Plaza
Columbia Mall
As you can see, according to a topological sort, both of these are valid, but only one of them is the one I want. I cannot think of a way to consistently choose the correct sequence based upon the data provided by the GTFS feed.
Please let me know if I am looking at this the wrong way.
You could construct a directed graph (DAG) where each stop belonging to a route is a node and each transition between two stops in a trip is an edge. Then, you could perform a topological sorting of the graph (http://en.wikipedia.org/wiki/Topological_sorting) to get an ordering of the stops. Note that topological sorting only works for graphs that have no cycles, but some trips do in fact have cycles, so you would not want to add an edge if it created a cycle.
This happens to be the algorithm used by the OneBusAway application suite for ordering stops: https://github.com/OneBusAway/onebusaway-application-modules/blob/master/onebusaway-transit-data-federation/src/main/java/org/onebusaway/transit_data_federation/impl/beans/RouteBeanServiceImpl.java#L281
Note that sometimes routes will have forks or branches, where there are two sets of stops (one for each branch) that don't interact with each other. A naive topological sort might arbitrarily interleave these stops, but the OBA code uses the following two heuristics to get a more natural ordering:
1) Group stops in the same branch together.
2) When ordering two branches relative to each other, put the branch closer in distance to the branch point first.