javagtfsgtfstools

OneBusAway GtfsReader.setAgencies() does exactly the opposite of what would be expected


What we want to achieve:

We need to parse GTFS files but are only interested in a few agencies inside of that GTFS file. Since parsing the GTFS file takes quite a long time (depending on how many agencies/routes/trips are included in that GTFS file), it would be very helpful to specify which agencies we are interested in before parsing the whole file.

What we tried:

Using the onebusaway-gtfs-modules one can parse a GTFS file like this:

GtfsReader reader = new GtfsReader();
File gtfsFile = gtfsResourceVetterAndThuesac.getFile(); // a GTFS file containing three agencies
GtfsDaoImpl store = new GtfsDaoImpl();
reader.run(); // blocking

The reader also offers a method called void setAgencies(List<Agency> agencies) which is not documented, but sounds a lot like what we wont to achieve.

I created a GTFS file that only contains three agencies:

agencies.txt:

agency_id,agency_name,agency_url,agency_timezone,agency_lang
00786,THÜSAC,https://www.nasa.de/vu/,Europe/Berlin,de
00846,Vetter Verkehrsbetriebe,https://www.nasa.de/vu/,Europe/Berlin,de
00847,Vetter GmbH,https://www.nasa.de/vu/,Europe/Berlin,de

Now when I try using that with setting the agencies for the reader to "00786", I get the exact opposite of what I wanted to achieve. The result is that the reader read all agencies apart from the one I specified:

enter image description here

Is this supposed to what is happening? Or is this a bug within the onebusaway reader? Is there another way (preferebly with using java methods, no cli calls) to achieve what we want?


Solution

  • That method definitely won't help you achieve your stated goal. The setAgencies() method is really an internal method that was maybe added to support some edge-cases of loading multiple GTFS feeds with overlapping agencies.

    For your request, I'm assuming when you say "only interested in a few agencies", that you really mean you are interested in specific agencies plus the routes, trips, stop-times, stops, etc associated with those agencies, yes?

    If so, there is no specific way to do that at GTFS read time using the base reader methods in the OneBusAway GTFS library. And to your larger point of wanting to avoid reading the entire GTFS file for just the data you care about, this is kind of hard to avoid. Namely, you'll likely have to process each line in each file in a GTFS feed no matter what in order to determine which entities are associated with the agencies you care about. You may end up discarding non-relevant entities, saving some memory, but still.

    I know you said "no CLI" but I did want to mention the OneBusAway GTFS Transformer tool. This is a general library for doing procedural modifications to GTFS files. It includes a "retain" operation that basically allows you to "retain" an agency and all data associated with it, producing a minimal GTFS feed. You could use this tool to create a minimal GTFS feed that would speed up loading later, at the expense of loading now. While the tool is designed as a command-line application, it can also be embedded as a library.