Online (as opposed to bulk processed) data mining packages

By "bulk processed" I mean a static data set of facts (as in a CSV) processed all at once to extract knowledge. While "online", it uses a live backing store: facts are added as they happen ("X buys Y") and queries happen on this live data ("what would you reccomend to a person who is looking at y right now?").

I have (mis)used the term real-time, but I dont mean that results must come within a fixed time. ('''Edit: replaced real-time with online above''')

I have in mind a recommendation engine which uses live data. However all online resources (such as SO questions) I encountered make no distinction between real-time and bulk processing data mining packages. I had to search individually:

Carrot2 which reads from Lucene/Solr and other live datasets (online)
Knime which does scheduled execution on static files (bulk)
Mahout which runs on Hadoop (and Pregel-based Giraph in future) (online?)
a commercial package that integrates with Cassandra (online?)

What are the online data-mining packages?

Is there a reason why the literature makes no distinction between online and bulk processing packages? Or is all practical data-mining actually bulk operation in nature?

Solution

For some algorithms, there are online versions available. For example for LOF, the local outlier factor, there is an online variant. I believe there are also online variants of k-means (and in fact, the original MacQueen version can be seen as "online", although most people turn it into an offline version by reiterating it until convergence), but see below for the problem with the k parameter.

However, online operation often comes at a significant performance cost. Up to the point where it is faster to run the full algorithm on a snapshow every hour instead of continuously updating the results. Think of internet search engines. Most large-scale search engines still do not allow "online" queries, but instead you query the last index that was built, probably a day or more ago.

Plus, online operation needs a significant amount of additional work. It's easy to compute a distance matrix, it is much harder to online update it by adding and removing columns, and synchronize all dependant results. In general, most data-mining results are just too complex to perform this. It's easy to compute the mean of a data stream, for example. But '''often there is just no known solution on updating the results without rerunning the - expensive - process'''. In other situations, you will even need to change the algorithm paramters. So at some point, a new cluster may form. k-means however is not meant to have new clusters appear. So essentially, you can't just write an online version of k-means. It will be a different algorithm, as it needs to dynamically modify the input parameter "k".

So usually, the algorithms will already be either online or offline. And a software package will not be able to turn offline algorithms into online algorithms.