I am looking for an algorithm that determines percentiles for live data capture.
For example, consider the development of a server application.
The server might have response times as follows: 17 ms 33 ms 52 ms 60 ms 55 ms etc.
It is useful to report the 90th percentile response time, 80th percentile response time, etc.
The naive algorithm is to insert each response time into a list. When statistics are requested, sort the list and get the values at the proper positions.
Memory usages scales linearly with the number of requests.
Is there an algorithm that yields "approximate" percentile statistics given limited memory usage? For example, let's say I want to solve this problem in a way that I process millions of requests but only want to use say one kilobyte of memory for percentile tracking (discarding the tracking for old requests is not an option since the percentiles are supposed to be for all requests).
Also require that there is no a priori knowledge of the distribution. For example, I do not want to specify any ranges of buckets ahead of time.
I believe there are many good approximate algorithms for this problem. A good first-cut approach is to simply use a fixed-size array (say 1K worth of data). Fix some probability p. For each request, with probability p, write its response time into the array (replacing the oldest time in there). Since the array is a subsampling of the live stream and since subsampling preserves the distribution, doing the statistics on that array will give you an approximation of the statistics of the full, live stream.
This approach has several advantages: it requires no a-priori information, and it's easy to code. You can build it quickly and experimentally determine, for your particular server, at what point growing the buffer has only a negligible effect on the answer. That is the point where the approximation is sufficiently precise.
If you find that you need too much memory to give you statistics that are precise enough, then you'll have to dig further. Good keywords are: "stream computing", "stream statistics", and of course "percentiles". You can also try "ire and curses"'s approach.