algorithmdata-mining

Why does BIDE use the semi-maximum period for search space pruning?


According to the article which defines BIDE:

BIDE: Efficient Mining of Frequent Closed Sequences

Theorem 2 (BackScan search space pruning):

Let the prefix sequence be an n-sequence, Sp=e1e2...en. If ∃i(1≤i≤n) and there exists an item e′ which appears in each of the i-th semi-maximum periods of the prefix Sp in SDB, we can safely stop growing prefix Sp.

Proof:

Because item e′ appears in each of the i-th semi-maximum periods of the prefix Sp in SDB, we can get a new prefix S′p=e1e2...ei−1e′ei...en (1<i≤n) or S′p=e′e1e2...en(i=1), and both (Sp ‎⊂ S′p) and (supSDB(Sp)=supSDB(S′p)) hold. Any locally frequent item e′′ w.r.t. prefix Sp is also a locally frequent item w.r.t. S′p, in the meantime (〈Sp,e′′〉⊂〈S′p,e′′〉) and (supSDB(〈Sp,e′′〉)=supSDB(〈S′p,e′′〉)) hold. This means there is no hope to mine frequent closed sequences with prefix Sp.

I understand that for example if I have an AB pattern, and I find an e', for example C, which is in the 2nd maximum period of the pattern, so between A and B for every sequence, which contains AB, then I can stop growing AB, because I could use backward extension to make ACB, which will have the same support, but it is longer. So any pattern I would get by extending AB forward, certainly won't be a closed pattern, because the C is missing from it. That's why I have to stop growing AB and wait until PrefixSpan grows A -> AC -> ACB with forward extension.

What I don't understand is why the maximum period must be constrained to the semi-maximum period in this context and why testing for the semi-maximum period is okay.

The article does not write a real explanation. Any idea? Can you write an example where we lose closed frequent patterns by using the maximum period instead of the semi-maximum period?


Solution

  • I think I got the answer. Here is an example which has new events in the max period, but not in the semi-max period:

    [AxByB,AqBtyB], AB
    2nd max period: [xBy,qBty] -> By
    2nd semi-max period: [x,q] -> 0
    

    According to it, I cannot stop growing AB, because there is no e' in the semi-max period with the same support. On the other hand there is e' in the max period with the same support, so we could reach A{By}B pattern with backward extension from AB.

    But I found that we can reach AB{yB} with forward extension from AB too. With more overlaps, for example by patterns like ABAB we will get the same conclusion; we can grow the initial AB with forward extension into AB{*A*B}, but we cannot add any character to the semi max period A{*}B, because we cannot reach that with forward extension from AB and PrefixSpan grows pattern with forward extension only.

    So for example if we find x in the 2nd semi max period of AB with the same support, then we have to stop growing AB and wait for PrefixSpan to grow AxB in the A -> Ax -> AxB way, and continue to grow the AxB pattern rather than the AB pattern. Of course, we are looking for closed frequent patterns here, not for frequent patterns, so we can prune the search space this way safely.

    I am not sure whether it is possible to design a frequent pattern mining algorithm, which does both forward and backward extensions. Maybe later I'll try to design such an algorithm just for the challenge, but for now BIDE is fine for what I am doing.