google-app-enginemapreducetipfy

Calculating unique elements from huge list in Google App Engine


I got a web widget with 15,000,000 hits/months and I log every session. When I want to generate a report I'd like to know how many unique IP there are. In normal SQL that would be easy as I'd just do a:

SELECT COUNT(*) FROM (SELECT DISTINCT IP FROM SESSIONS)

But as that's not possible with the app engine, I'm now looking into solutions on how to do it. It doesn't need to be fast.

A solution I was thinking of was to have an empty Unique-IP table, then have a MapReduce job to go through all session entities, if the entity's IP is not in the table I'll add it and add one to a counter. Then I'd have another MapReduce job that would clear the table. Would this be crazy? If so, how would you do it?

Thanks!


Solution

  • The mapreduce approach you suggest is exactly what you want. Don't forget to use transactions to update the record in your task queue task, which will allow you to run it in parallel with many mappers.

    In future, reduce support will make this possible with a single straightforward mapreduce and no hacking around with your own transactions and models.