I have a database of about 700k users along with items they have watched/listened to/read/bought/etc.
I would like to build a recommendation engine that recommends new items based on what users with similar taste in things have enjoyed, as well as actually finding people the user might want to be friends with on a social network I'm building (similar to last.fm).
My requirements are as follows:
- Majority of the "users" in my database aren't actually users of my website. They have been data mined from third-party sources. However, when recommending users, I would like to limit the search to people who are members of my website (while still taking advantage of the bigger data set).
- I need to take multiple items into consideration. Not "people who like this one item you enjoyed...", but "people who like most of the items you enjoyed...".
- I need to compute similarities between users and show them when viewing their profiles (taste-o-meter).
- Some items are rated, others are not. Ratings are from 1-10, not boolean values. In most cases it would be possible to deduct a rating value from other stats if it's not present (e.g. if the user has favourited an item, but hasn't rated it, I could just assume a rating of 9).
- It has to interact with Python code in one way or another. Preferably, it should use a seperate (possibly NoSQL) database and expose an API to use in my web back-end. The project I'm making uses Pyramid and SQLAlchemy.
- I would like to take item genres into account.
- I would like to display similar items on item pages based on both its genre (possibly tags) and what users who enjoyed the item liked (like Amazon's "people who bought this item" and Last.fm artist pages). Items from different genres should still be shown, but have a lower similarity value.
- I would prefer a well-documented implementation of an algorithm with some examples.
Please don't give an answer like "use pysuggest or mahout", since those implement a plethora of algorithms and I'm looking for one that's most suitable for my data/use. I've been interested in Neo4j and how it all could be expressed as a graph of connections between users and items.
Actually that is one of the sweetspots of a graph database like Neo4j.
So if your data model looks like this:
user -[:LIKE|:BOUGHT]-> item
You can easily get recommendations for an user with a cypher statement like this:
start user = node:users(id="doctorkohaku")
match user -[r:LIKE]->item<-[r2:LIKE]-other-[r3:LIKE]->rec_item
where r.stars > 2 and r2.stars > 2 and r3.stars > 2
return rec_item.name, count(*) as cnt, avg(r3.stars) as rating
order by rating desc, cnt desc limit 10
This can also be done using the Neo4j Core-API or the Traversal-API.
Neo4j has an Python API that is also able to run cypher queries.
Disclaimer: I work for Neo4j
There are also some interesting articles by Marko Rodriguez about collaborative filtering.