google-app-enginegae-search

Limiting GAE Search API results by user


We have a use case where users must be able to search content that is only available in Groups that they have access to. The search must be across all groups that they have access to.

Some details: A Group has many Posts, and a user may have access to hundreds of Groups and thousands of Posts within each Group. A search for "Foo" should return all Groups with "Foo" in the name and all Posts, within the Groups that they have access to, and have "Foo" in the content.

The way I thought of dealing with it is to have a list of user_id's associated on each document index and then include the user_id in the query string to verify that the user has access. Once the results are returned we could do an additional check to see that they have access to the content before returning the results.

The document index is something like this:

fields = [
  search.TextField(name="data", value="some searchable stuff"),
  search.AtomField(name="post_id", value="id of post"),
  search.AtomField(name="group_id", value="id of group"),
  search.AtomField(name="user_id", value=user_id_1),
  search.AtomField(name="user_id", value=user_id_2),
  #.... add the thousand other users who have access to the group (done in loop)     
]

#then query run a user 123 would be as follows:
results = index.search("data = Foo AND user_id = 123")

My concern with the above approach: Every new user who subscribes to a group would require the search index to be reindexed to include their user_id on each document.

Is there a better way of handling this use case?

Thanks Rob


Solution

  • There is no simple answer to your question. You need to plan for (a) a typical use-case, and (b) extreme cases.

    If a typical user belongs to 1-3 groups, searching by group_id maybe the best solution. You will do 1-2 extra searches, but you won't need to re-index every document every time a user joins or exits a group, which is prohibitively expensive.

    You can have a separate implementation for extreme cases. If a user belongs to more than X groups, it may be more efficient to retrieve all results matching the keyword, and then filter them by group_id.

    An alternative approach is to always retrieve all results regardless of group_id/user_id, and store them in Memcache. Then you can filter them in memory.

    Users tend to search using the same keywords - depending on your corpus, 1% of words may account for up to 99% of searches. If you have a lot of users - and a big enough cache - you will get a lot of cache hits. Note that 1GB of cache can fit tens or even hundreds of thousands of query results. An additional advantage of this approach is that it speeds up all queries, especially phrase or multi-keyword searches.