data.fu has a nice implementation of HyperLogLog for estimating cardinality here
However, it's implemented as Accumulator
which means it will run only at the reducer and not in the combiner (but it will never load the entire set into memory as in normal EvalFunc
). Why couldn't data.fu implement it as Algebraic
- and fill the registers at every combiner, then merge and reduce the result?
Am I missing something here?
Fixed in 1.3.0, and now it does use Algebraic
.
see https://issues.apache.org/jira/browse/DATAFU-91
See details of how this improves a task from 10 minutes to 2 minutes: https://docs.google.com/spreadsheets/d/1oVYSCh22kufgQ49pgsuboKOMxDgz8N5yBtRpxuo69Lk/edit#gid=0