algorithmgoogle-bigqueryhyperloglog

Using HLL_COUNT.MERGE outside of SQL


I can use the following query to general all the HLL sketches of the distinct counts:

SELECT category, count(distinct city), HLL_COUNT.INIT(city) FROM `table`
GROUP BY category

And I get something like this:

enter image description here

While I would normally use the HLL_COUNT.merge(...) function to get the total count, for example:

select 'all -- hll', HLL_COUNT.MERGE(x), null from (select category, count(distinct city), HLL_COUNT.INIT(city) x from `datadocs-163219.010ff92f6a62438aa47c10005fe98fc9.inv` group by category) _

enter image description here

For various reasons, I need to do the MERGE outside of SQL/BigQuery. Is there some sort of library/open source library where I could do something like the following:

>>> hll_set
>>> {'CHAQMBgCIAuCBz8QFBgPIBQyN8hxlqEBvMMBnLMBgWnD5gTB3AH+ROgD/YMEpM8Jr70C6Q2LwwfZlQ3QMNu8AYDSBKf7AbOSqgE=',  'CHAQDhgCIAuCBxwQBxgPIBQyFP3PBMBtibMR3sgC77oViasKwfMF', 'CHAQJxgCIAuCBzIQEBgPIBQyKshxlqEBvMMBzfECh6gJxJABoNwF/rEGwf0PgYYFvOoFmzjJPZwg2y3nbw==', 'CHAQBBgCIAuCBw4QAhgPIBQyBpSJAfapKA==', 'CHAQBRgCIAuCBxEQAxgPIBQyCbaJBfqsH57tBw==', 'CHAQGBgCIAuCBykQDRgPIBQyId6SAtNvwJ0XgO8Ct/EFlvUOskG1E87ZA7/OApwg2y3nbw==', 'CHAQZhgCIAuCB2MQIxgPIBQyW5SJAcqJAbzDAcvcAoIV2xSMFsTyA42IAYkl+Wvj/AHqdJxRlEGbywG/WNjoAqS9BP3CAuPrBNSFAfdDt+YEoeIBr+ICmIYF6CL/MaLNAqKdA8k9rxntBrPVrAE=', 'CHAQEBgCIAuCByQQChgPIBQyHN6SAqjtArAJ/esCj9wSg+8KiVKNygHrpgXIogU=', 'CHAQpgkYAiALggfZAhChARgPIBQyzwKPBMwRkAzxP+wPogyqC8qJAeBo8BHsSOypAbAJriL+MYYR/1jnKqIyzR3wJIkI/QXkecNH7WCzQZgMuDvxFLh+xkboA7QB12akDhu5E+4+3KgBjAZ4nxLBRMw0xRWvIPZYszt+v1gnz2a0BZoF4wzQggHqOewsJeAxgguGErUCjGG3KuhKgUyfCtItkjOMZZwCpi3phgHlA+wRknEhwiq1Os4slgmhELEWl1f1rgH+B6e4AdCtAdkE4R7fK/gihHSRFqipAbYY9BmqP5oBgqsBvhrvEKGRAcpj7XHEVaAUrY8BylLRDgWn1wGpT6IS6irPHewb/AbKHqgQjQPyAeU82zuSHpgQ04UBzwqkFIADiBD4X6ABjBihFsIy6wmovgHNKssPsQOvGcADrQOQevMQvxKMBtANizqbP7l21+kB0UDxY92rVYCBMcD5H8CiEA=='}

>>> hll_merge_method(hll_set)
>>> 193

Is it possible to do this in any way using a library outside of BQ with the hash generated from it?


Solution

  • That's a feature request you might already find in the issue tracker: the current hash is Google proprietary, but one day BigQuery could use an open one. Vote that request up.

    There might be news soon, and subscribing to the issue will keep you updated.


    2019 Update: Find the open source version of BigQuery's HyperLogLog++ at: