node.js design-patterns cassandra microservices distributed-computing

How to reliably implement fan out write pattern?

I'm trying to RELIABLY implement that pattern. For practical purposes, assume we have something similar to a twitter clone (in cassandra and nodejs).

So, user A has 500k followers. When user A posts a tweet, we need to write to 500k feeds/timelines.

Conceptually this is easy, fetch followers for user A, for each one: write tweet to his/her timeline. But this is not "atomic" (by atomic I mean that, at some point, all of the writes will succeed or none will).

async function updateFeeds(userId, tweet) {

  let followers = await fetchFollowersFor(userId)
  for(let f of followers) {
    await insertIntoFeed(f, tweet)
  }

}

This seems like a DoS attack:


async function updateFeeds(userId, tweet) {

  let followers = await fetchFollowersFor(userId)
  await Promise.all(followers.map(f => insertIntoFeed(f, tweet)))

}

How do I keep track of the process? How do I resume in case of failure? I'm not asking for a tutorial or anything like that, just point me in the right direction (keywords to search for) if you can please.

Solution

I would start by setting up a message broker (like Kafka), and write all the tweets into a topic.

Then develop an agent that consumes the messages. For each message, the agent fetches a batch of users that are followers but have not yet the tweet into their feed, and insert the tweet into the feed of each user. When there are no more users that are followers but have not the tweet, the agent commits the message and process the following messages. The reason for proceeding this way is resilience : if for any reason the agent is restarted, it will resume from where it left.

Configure the topic with a lot of partitions, in order to be able to scale up the processing of the messages. If you have ONE partition, you can have ONE agent to process the messages. If you have N partitions, you can have up to N agents to process the messages in parallel.

To keep track of the overall processing, you can watch the "lag" into the message broker, which is the number of messages yet to be processed into the topic. If it is too high for too long, then you have to scale up the number of agents.

If you want to keep track of the processing of a given message, the agent can query how many users are still to be processed before processing a batch of users. Then the agent can log this number, or expose it through its API, or expose it as a Prometheus metric...