algorithmpopularity

Simple Popularity Algorithm


Summary

As Ted Jaspers wisely pointed out, the methodology I described in the original proposal back in 2012 is actually a special case of an exponential moving average. The beauty of this approach is that it can be calculated recursively, meaning you only need to store a single popularity value with each object and then you can recursively adjust this value when an event occurs. There's no need to record every event.

This single popularity value represents all past events (within the limits of the data type being used), but older events begin to matter exponentially less as new events are factored in. This algorithm will adapt to different time scales and will respond to varying traffic volumes. Each time an event occurs, the new popularity value can be calculated using the following formula:

(a * t) + ((1 - a) * p)

Reasonable values for a will depend on your application. A good starting place is a=2/(N+1), where N is the number of events that should significantly affect the outcome. For example, on a low-traffic website where the event is a page view, you might expect hundreds of page views over a period of a few days. Choosing N=100 (a≈0.02) would be a reasonable choice. For a high-traffic website, you might expect millions of page views over a period of a few days, in which case N=1000000 (a≈0.000002) would be more reasonable. The value for a will likely need to be gradually adjusted over time.

To illustrate how simple this popularity algorithm is, here's an example of how it can be implemented in Craft CMS in 2 lines of Twig markup:

{% set popularity = (0.02 * date().timestamp) + (0.98 * entry.popularity) %}
{% do entry.setFieldValue("popularity", popularity) %}

Notice that there's no need to create new database tables or store endless event records in order to calculate popularity.

One caveat to keep in mind is that exponential moving averages have a spin-up interval, so it takes a few recursions before the value can be considered accurate. This means the initial condition is important. For example, if the popularity of a new item is initialized using the current timestamp, the item immediately becomes the most popular item in the entire set before eventually settling down into a more accurate position. This might be desirable if you want to promote new content. Alternatively, you may want content to work its way up from the bottom, in which case you could initialize it with the timestamp of when the application was first launched. You could also find a happy medium by initializing the value with an average of all popularity values in the database, so it starts out right in the middle.


Original Proposal

There are plenty of suggested algorithms for calculating popularity based on an item's age and the number of votes, clicks, or purchases an item receives. However, the more robust methods I've seen often require overly complex calculations and multiple stored values which clutter the database. I've been contemplating an extremely simple algorithm that doesn't require storing any variables (other than the popularity value itself) and requires only one simple calculation. It's ridiculously simple:

p = (p + t) / 2

Here, p is the popularity value stored in the database and t is the current timestamp. When an item is first created, p must be initialized. There are two possible initialization methods:

  1. Initialize p with the current timestamp t
  2. Initialize p with the average of all p values in the database

Note that initialization method (1) gives recently added items a clear advantage over historical items, thus adding an element of relevance. On the other hand, initialization method (2) treats new items as equals when compared to historical items.

Let's say you use initialization method (1) and initialize p with the current timestamp. When the item receives its first vote, p becomes the average of the creation time and the vote time. Thus, the popularity value p still represents a valid timestamp (assuming you round to the nearest integer), but the actual time it represents is abstracted.

With this method, only one simple calculation is required and only one value needs to be stored in the database (p). This method also prevents runaway values, since a given item's popularity can never exceed the current time.

An example of the algorithm at work over a period of 1 day: http://jsfiddle.net/q2UCn/
An example of the algorithm at work over a period of 1 year: http://jsfiddle.net/tWU9y/

If you expect votes to steadily stream in at sub-second intervals, then you will need to use a microsecond timestamp, such as the PHP microtime() function. Otherwise, a standard UNIX timestamp will work, such as the PHP time() function.

Now for my question: do you see any major flaws with this approach?


Solution

  • The proposed algorithm is a good approach, and is a special case of an Exponential Moving Average where alpha=0.5:

    p = alpha*p + (1-alpha)*t = 0.5*p + 0.5*t = (p+t)/2    //(for alpha = 0.5)

    A way to tweak the fact that the proposed solution for alpha=0.5 tends to favor recent votes (as noted by daniloquio) is to choose higher values for alpha (e.g. 0.9 or 0.99). Note that applying this to the testcase proposed by daniloquio is not working however, because when alpha increases the algorithm needs more 'time' to settle (so the arrays should be longer, which is often true in real applications).

    Thus: