python random data-mining data-stream bigdata

Implementing the Alon-Matias-Szegedy Algorithm For The Second Moment Stream Approximation

I'm trying to recreate a function in python for an estimate of the second moment of a data stream.

As stated by the Ullman Book, "Mining of Massive Datasets", a second moment:

Is the sum of the squares of the m_i ’s. It is some- times called the surprise number, since it measures how uneven the distribution of elements in the stream is.

Where the m_i elements are the univocal elements in a stream.

For example, having this toy problem\data stream:

a, b, c, b, d, a, c, d, a, b, d, c, a, a, b

We calculate the second moment like this:

5^2 + 4^2 + 3^2 + 3^2 = 59

(because 'a' occurs 5 times in the data stream, 'b' 4 times, and so on)

Because we cannot store all the data stream in memory, we can use an algorithm for the estimate of the second moment:

The Alon-Matias-Szegedy Algorithm (AMS algorithm), that estimate the second moment using this formula:

E(n *(2 * X.value − 1))

In which X is an univocal element of the stream, randomically selected, and X.value is a counter, that, as we read the stream, add to 1 each time we encounter another occurrence of the x element from the time we selected it.

n represents the length of the data stream, and "E" is the mean notation.

Doing an example with the previous data stream, let's assume we selected "a" at the 13th position of the data stream, "d" at the 8th and "c" at the 3th. We haven't selected "b".

a, b, c, b, d, a, c, d, a, b, d, c, a, a, b
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15
      x              x              x

Selected like this, we have:

X.element = "a"   X.value = 2
X.element = "c"   X.value = 3
X.element = "d"   X.value = 2

The estimate by the AMS algorithm is:

(15*(2 * 2 - 1) + 15*(2 * 3 - 1) + 15*(2 * 2 - 1))/3 = 55

Which is pretty near the true value of the second moment calculated before (59).

Now focusing on my code, I have written this function for the calculation of the "true" Second Moment, simulating the data stream by vector(1d array) and a for:

def secondMoment(vector):
    mydict = dict()
    for el in vector:
        if el not in mydict:
            mydict[el] = 1
        else:
            mydict[el] += 1
    return (sum([pow(value, 2) for key, value in mydict.items()]))

and the AMS function which calculate an estimate of the second moment:

def AMSestimate(vector):
    lenvect = len(vector)
    elements = dict()
    for el in vector:
        if el in elements:
            elements[el] += 1
        elif random.choice(range(0, 10)) == 0:
            elements[el] = 1
    # E(n * (2 * x.value - 1))
    lendict = len(elements)
    estimateM2 = 0
    for key, value in elements.items():
        estimateM2 += lenvect * ((2 * value) - 1)
    print(lendict)
    if lendict > 0:
        return estimateM2/lendict

The problem is that, when I try to calculate the esteem of a little toy problem (like the one above) the values are somewhat correct, but when I try to extend the vector to, say, 10000 elements, the values, true Second Moment and esteem, are pretty different.

I think that the problem is tied at the manner in which I generate the data-stream, and at the manner in which I decide to select the X.element.

That is:

[random.choice(string.ascii_letters) for x in range(size)]

For the generation of a random vector\data stream

And

elif random.choice(range(0, 10)) == 0:
    elements[el] = 1

For the X.element selection (done in the code above, in the AMS function)

For the generation of a random vector\data stream, a thought that the problem may be due to the lack of "variability" of the vector (string.ascii_letters got only 52 elements).

Solution

This is an interesting question.

Say we start with

import random
import string

size = 100000
seq = [random.choice(string.ascii_letters) for x in range(size)]

Then the first implementation is similar to yours (note the use of collections.Counter, though):

from collections import Counter

def secondMoment(seq):
    c = Counter(seq)
    return sum(v**2 for v in c.values())

>>> secondMoment(seq)
192436972

The second implementation differs more significantly than yours, though. Note that first the random indices are found. Then, an element is counted only after its first occurrence (if any) at one of the indices:

from collections import defaultdict

def AMSestimate(seq, num_samples=10):
    inds = list(range(len(seq)))
    random.shuffle(inds)
    inds = sorted(inds[: num_samples])

    d = {}
    for i, c in enumerate(seq):
        if i in inds and c not in d:
            d[c] = 0
        if c in d:
            d[c] += 1
    return int(len(seq) / float(len(d)) * sum((2 * v - 1) for v in d.values()))

>>> AMSestimate(seq)
171020000

Edit Regarding The Original Code In The Question

In the code in the question, consider your loop

for el in vector:
    if el in elements:
        elements[el] += 1
    elif random.choice(range(0, 10)) == 0:
        elements[el] = 1

(Minor) The sampling is problematic: it is hard-coded probabilistic at 0.1

Also consider:

    estimateM2 += lenvect * ((2 * value) - 1)

This lacks a division by the number of sampled elements.