According to the article which defines BIDE:
BIDE: Efficient Mining of Frequent Closed Sequences
Theorem 2 (BackScan search space pruning):
Let the prefix sequence be an n-sequence,
Sp=e1e2...en
. If∃i(1≤i≤n)
and there exists an iteme′
which appears in each of the i-th semi-maximum periods of the prefixSp
inSDB
, we can safely stop growing prefixSp
.Proof:
Because item
e′
appears in each of the i-th semi-maximum periods of the prefixSp
inSDB
, we can get a new prefixS′p=e1e2...ei−1e′ei...en
(1<i≤n)
orS′p=e′e1e2...en(i=1)
, and both(Sp ⊂ S′p)
and(supSDB(Sp)=supSDB(S′p))
hold. Any locally frequent iteme′′
w.r.t. prefixSp
is also a locally frequent item w.r.t.S′p
, in the meantime(〈Sp,e′′〉⊂〈S′p,e′′〉)
and(supSDB(〈Sp,e′′〉)=supSDB(〈S′p,e′′〉))
hold. This means there is no hope to mine frequent closed sequences with prefixSp
.
I understand that for example if I have an AB
pattern, and I find an e'
, for example C
, which is in the 2nd maximum period of the pattern, so between A
and B
for every sequence, which contains AB
, then I can stop growing AB
, because I could use backward extension to make ACB
, which will have the same support, but it is longer. So any pattern I would get by extending AB
forward, certainly won't be a closed pattern, because the C
is missing from it. That's why I have to stop growing AB
and wait until PrefixSpan grows A -> AC -> ACB
with forward extension.
What I don't understand is why the maximum period must be constrained to the semi-maximum period in this context and why testing for the semi-maximum period is okay.
The article does not write a real explanation. Any idea? Can you write an example where we lose closed frequent patterns by using the maximum period instead of the semi-maximum period?
I think I got the answer. Here is an example which has new events in the max period, but not in the semi-max period:
[AxByB,AqBtyB], AB
2nd max period: [xBy,qBty] -> By
2nd semi-max period: [x,q] -> 0
According to it, I cannot stop growing AB
, because there is no e'
in the semi-max period with the same support. On the other hand there is e'
in the max period with the same support, so we could reach A{By}B
pattern with backward extension from AB
.
But I found that we can reach AB{yB}
with forward extension from AB
too. With more overlaps, for example by patterns like ABAB
we will get the same conclusion; we can grow the initial AB
with forward extension into AB{*A*B}
, but we cannot add any character to the semi max period A{*}B
, because we cannot reach that with forward extension from AB
and PrefixSpan grows pattern with forward extension only.
So for example if we find x
in the 2nd semi max period of AB
with the same support, then we have to stop growing AB
and wait for PrefixSpan to grow AxB
in the A -> Ax -> AxB
way, and continue to grow the AxB
pattern rather than the AB
pattern. Of course, we are looking for closed frequent patterns here, not for frequent patterns, so we can prune the search space this way safely.
I am not sure whether it is possible to design a frequent pattern mining algorithm, which does both forward and backward extensions. Maybe later I'll try to design such an algorithm just for the challenge, but for now BIDE is fine for what I am doing.