I have a loop of instructions, such as (pseudocode):
for i = 1 to 1000000
// Process the ith input
doSomething(input[i])
end
This takes a long time to complete. I would like to output some kind of progress and more importantly remaining time estimate to the user, so that they can decide whether they should just sit there twiddling their thumbs, go get some coffee, go for a walk, or go on a weeklong vacation to Europe while the algorithm crunches its numbers.
To simplify matters, you can assume the number of iterations will be large (for instance, larger than 100, so you can print progress at every percentile).
A common algorithm is to simply measure the time the last iteration took, then multiply that by the number of iterations left and give that as output. This breaks down if each iteration can vary a lot in how long it takes to execute.
Another approach is to divide the time passed since the first iteration, by the number of iterations completed, and then multiply that by remaining iterations. This breaks down if durations of iterations are not evenly distributed. For example, if the first few inputs are "difficult" and get easier towards the end of the input array, the algorithm will overestimate remaining time until it is almost finished (at which point it will overestimate very slightly).
So how can one get a better estimate of remaining time when the time each iteration will take is a non-straightforward, arbitrary function (such that simply analytically deriving and implementing the time-to-complete of each iteration is impractical) of the iteration ordinate?
Two ideas I can imagine might be fruitful avenues of research, but am unable to fully explore myself at this time:
Why computationally intensive solutions (like fitting equations) are okay:
Firstly, for truly large tasks where this discussion is worthwhile, run time may be measured in hours or days. Complicated mathematical operations these days take milliseconds, so the added burden would not be great - in my above example, clearly doSomething
takes so long as to dwarf the cost of doing some math, otherwise I would not care so much about precisely estimating remaining time in the first place.
Second, it is possible to, for instance, bin iterations into percentiles. Then, instead of operating on a dataset of "iterations complete vs. time taken" the estimator would operate on a dataset of "percent complete vs. time taken" which has at most 100 data points. This affords further complexity: Say your task takes a day or more to complete. Estimating remaining time only once each percent of it is complete means 100 evaluations of the estimator function. When you already take a day, an extra minute and a half for estimating remaining time isn't a big deal, but that already gives you a 1 second window for fitting equations and what not - 1 second is a lot of time for doing math on a modern system. As such, I welcome computationally intensive solutions.
tl;dr: How to over-engineer an accurate remaining time estimator function for very lengthy tasks.
In addition to Penguino's algorithm: instead of fitting n and f(n) you might want to fit log(n) and log(f(n)). As long as your complexity is polynomial this will work.