pythonapache-sparkpysparkapache-spark-mllib

PrefixSpan sequence extraction misunderstanding


I have a set of tuples of size three in a list that represent windowed sequences. What I need is, using PySpark, to be able to get (given the two first parts of the tuple) the third one.

So I need it to create sequences of three elements based on their frequency.

This is what I am doing:

data = [[['a','b','c'],['b','c','d'],['c','d','e'],['d','e','f'],['e','f','g'],['f','g','h'],['a','b','c'],['d','e','f'],['a','b','c'],['b','c','d'],['f','g','h'],['d','e','f'],['b','c','d']]]
rdd = spark.sparkContext.parallelize(data,2)
rdd.cache()
model = PrefixSpan.train( rdd, 0.2, 3)

print(sorted(model.freqSequences().take(100)))

Although I would expect to see the sequences and the frequencies of them to follow the alphabet, they don't.

And I am getting sequences like:

FreqSequence(sequence=[[u'c'], [u'd'], [u'b']], freq=1)
FreqSequence(sequence=[[u'g'], [u'c'], [u'c']], freq=1)

which are not appearing in the defined ones. Obviously there is a problem in the way I have structure my features or I am missing something in the purpose and functionality of this algorithm.


Solution

  • First let's look at your input:

    rdd.count()
    
    1
    

    As you can see you created a dataset with only one sequence. It can be described as:

    <(abc)(bcd)(cde)(def)(efg)(fgh)(abc)(def)(abc)(bcd)(fgh)(def)(bcd)>
    

    So patterns you get are indeed correct given the input. For example

    FreqSequence(sequence=[[u'c'], [u'd'], [u'b']], freq=1)
    

    corresponds to:

    ...(abc)(def)(abc)...
    

    If each element of the dataset represents individual sequence data could have the following shape:

    rdd = sc.parallelize([
        [['a'], ['b'], ['c']], [['b'], ['c'], ['d']], [['c'], ['d'], ['e']],
        [['d'], ['e'], ['f']], [['e'], ['f'], ['g']], [['f'], ['g'], ['h']],
        [['a'], ['b'], ['c']], [['d'], ['e'], ['f']], [['a'], ['b'], ['c']],
        [['b'], ['c'], ['d']], [['f'], ['g'], ['h']], [['d'], ['e'], ['f']],
        [['b'], ['c'], ['d']]
    ])
    
    rdd.count()
    
    13
    
    rdd.first()
    
    [['a'], ['b'], ['c']]
    

    where:

    With data structured like this:

    model = PrefixSpan.train(rdd, 0.2, 3)
    model.freqSequences().top(5, key=lambda x: len(x.sequence))
    
    [FreqSequence(sequence=[['d'], ['e'], ['f']], freq=3),
     FreqSequence(sequence=[['b'], ['c'], ['d']], freq=3),
     FreqSequence(sequence=[['a'], ['b'], ['c']], freq=3),
     FreqSequence(sequence=[['f'], ['g']], freq=3),
     FreqSequence(sequence=[['d'], ['f']], freq=3)]
    
    model.freqSequences().top(5, key=lambda x: x.freq)
    
    [FreqSequence(sequence=[['d']], freq=7),
     FreqSequence(sequence=[['c']], freq=7),
     FreqSequence(sequence=[['f']], freq=6),
     FreqSequence(sequence=[['b']], freq=6),
     FreqSequence(sequence=[['b'], ['c']], freq=6)]