performancearchitecturemicroservicesscalabilityhigh-load

Highload data update architecture


I'm developing a parcels tracking system and thinking of how to improve it's performance.

Right now we have one table in postgres named parcels containing things like id, last known position, etc.

Everyday about 300.000 new parcels are added to this table. The parcels data is took from external API. We need to track all parcels positions as accurate as possible and reduce time between API calls about specific parcel.

Given such requirements what could you suggest about project architecture?

Right now the only solution I can think of is producer-consumer pattern. Like having one process selecting all records from parcel table in the infinite loop and then distribute fetching data task with something like Celery.

Majors downsides of this solution are:


Solution

  • This is a very broad topic, but I can give you a few pointers. Once you reach the limits of vertical scaling (scaling based on picking more powerful machines) you have to scale horizontally (scaling based on adding more machines to the same task). So for being able to design scalable architectures you have to learn about distributed systems. Here some topics to look into:

    For your specific problem with packages I would recommend to consider a key-value store for your position data. Those could scale to billions of insertions and retrievals per day (when querying by key).

    It also sounds like your data is somewhat temporary and could be kept in an in-memory hot-storage while the package is not delivered yet (and archived afterwards). A distributed in-memory DB could scale even further in terms insertion and queries.

    Also you probably want to decouple data extraction (through your api) from processing and persistence. For that you could consider introducing stream processing systems.