mapreduceapache-pigcardinalityhyperloglog

why is data.fu implementing HyperLogLog as an accumulator and not as algebraic?


data.fu has a nice implementation of HyperLogLog for estimating cardinality here

However, it's implemented as Accumulator which means it will run only at the reducer and not in the combiner (but it will never load the entire set into memory as in normal EvalFunc). Why couldn't data.fu implement it as Algebraic - and fill the registers at every combiner, then merge and reduce the result? Am I missing something here?


Solution

  • Fixed in 1.3.0, and now it does use Algebraic. see https://issues.apache.org/jira/browse/DATAFU-91

    See details of how this improves a task from 10 minutes to 2 minutes: https://docs.google.com/spreadsheets/d/1oVYSCh22kufgQ49pgsuboKOMxDgz8N5yBtRpxuo69Lk/edit#gid=0