I'm doing a project on categorizing users based on their surfing patterns on a site.
For this I need to find patterns in the data and then cluster them, but the clustering is a problem, since the clustering algorithms I tried (k-means, agglomerative and DBSCAN) don't allow lists as input data.
I have lists with pages visited, separated by session.
Example:
data = [[1, 2, 5],
[2, 4],
[2, 3],
[1, 2, 4],
[1, 3],
[2, 3],
[1, 3],
[7, 8, 9],
[9, 8, 7],
[1, 2, 3, 5],
[1, 2, 3]]
Each list represents a session with visited pages. Each number represents a part of the URL.
Example:
1 = '/home'
2 = '/blog'
3 = '/about-us'
...
I put the data through a pattern mining script.
Code:
import pyfpgrowth # pip install pyfpgrowth
data = [[1, 2, 5],
[2, 4],
[2, 3],
[1, 2, 4],
[1, 3],
[2, 3],
[1, 3],
[7, 8, 9],
[9, 8, 7],
[1, 2, 3, 5],
[1, 2, 3]]
patterns = pyfpgrowth.find_frequent_patterns(data, 2)
print(patterns)
rules = pyfpgrowth.generate_association_rules(patterns, 0.7)
print(rules)
Result:
# print(patterns)
{(1,): 6,
(1, 2): 4,
(1, 2, 3): 2,
(1, 2, 5): 2,
(1, 3): 4,
(1, 5): 2,
(2,): 7,
(2, 3): 4,
(2, 4): 2,
(2, 5): 2,
(4,): 2,
(5,): 2,
(7,): 2,
(8,): 2,
(9,): 2}
# print(rules)
{(1, 5): ((2,), 1.0),
(2, 5): ((1,), 1.0),
(4,): ((2,), 1.0),
(5,): ((1, 2), 1.0)}
According to a paper I'm using the next step would be to use the found patterns as input for the clustering algorithm (page 118 chapter 4.3), but as far as I know the clustering algorithms don't accept lists (with variable lengths) as inputs.
I have tried this, but it didn't work.
Code:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=4, random_state=0).fit(patterns)
test = [1, 8, 2]
print(kmeans.predict(test))
What should I do to let the k-means algorithm be able to predict the group to which the surfing pattern belongs to or is there another algorithm which is more suited for this?
Both HAC and DBSCAN could be used with lists.
You just need to compute the distance matrix yourself, because you obviously cannot use Euclidean distance on this data. Instead. You could consider Jaccard, for example.
K-means cannot be used. It needs continuous data in R^d.