data-modelingkeen-io

How can I identify users who have a performed an action X times? [keen-io]


We plan to use Keen as our tracking core, but combining requests and extracting data is a big deal for us.

If we have for example one collection named pageviews, with a user as a property of it (like so : {"name": "pageviews", "properties":{"user":{"id":"4242"},"url":"https://keen.io/"}}), and another collection named purchases with the same user property inside :

  1. How can we get the number of unique users who have visited at least 3 times one url and that have completed one other event "purchases" ?

  2. How can we extract those persons ? Is it possible to do it with the Amazon S3 data replication ? (with Amazon Athena ?). The "extraction" feature does not seem to satisfy our extraction problematic as we can't apply "group by" statements and we can't combine multiple event for the extraction (am I wrong ?).

Our goal is to use Keen not only as a statistic analyser but also as a powerful data support for our segmentation/extraction use.


Solution

  • Here are a variety of options to solve this problem:

    Solution 1: Could you use a simpler criteria set? Keen's funnel analysis type has a natural syntax for a identifying a specific inventory of users who have done (or not done) action A, B, C, etc. It doesn't, however, have the ability to quickly filter based on the number of times the action has been done. How critical is that part of your criteria? Could you instead identify users who have:

    Funnel Steps

    1. viewed /keen.io/ and
    2. viewed /keen.io/products and
    3. made a purchase

    This simplification would instantly make your question answerable in a single funnel query.

    Solution 2: Two-step query process. In this solution, we'll run two separate queries to arrive at the result. First, we'll use select_unique to identify the users who have performed the more rare action (purchase). In the next query, we'll count all the views of that particular page and group_by user.id to get a count of how many times each user viewed that page. We'll use the results of query 1 as a part of our filters in query 2, so that we only query on the relevant users who have made a purchase. Then, we can pick out which users have purchased and viewed the page 3 or more times.

    var client = new Keen({
      projectId: "PROJECT_ID",
      readKey: "READ_KEY"
    }); 
    
    var usersWhoPurchased = []
    
    // Query 1
    var usersWhoPurchasedQuery = new Keen.Query("select_unique", {
      event_collection: "purchases",
      target_property: "user.id",
      timeframe: "this_7_days"
    });
    
    // Get Query 1 Results
    client.run(usersWhoPurchasedQuery, function(err, response){
      usersWhoPurchased = response['result']
    });
    
    
    // Query 2
    var activityCountsByUserQuery = new Keen.Query("count", {
      event_collection: "pageviews",
      group_by: "user.id",
      timeframe: "this_7_days",
      filters: [
      	{
      	  property_name: "url",
      	  operator: "eq",
      	  property_value: https://keen.io/
      	},
      	{
      	  property_name: "user.id",
      	  operator: "in",
      	  property_value: usersWhoPurchased
      	}
      ]
    });
        
    // Get Query 2 Results
    client.run(activityCountsByUserQuery, function(err, response){
      console.log(response)
      var countsByUser = response['result']
    });
    
    
    // countsByUser = [
    //   {
    //     "user.id": "A",
    //     "result": 1
    //   },
    //   {
    //     "user.id": "B",
    //     "result": 0
    //   },
    //   {
    //     "user.id": "C",
    //     "result": 3
    //   }
    // ]
    // Sort countsByUser to identfy those with >3

    This method has some considerations as you scale and start to max out the number of users you can include in a filter (hundreds or thousands depending on ID length). They can be broken down and queried in batches.

    Solution 3: Include counts on events. This method doesn't work in all situations, but when it does it can be really powerful and elegant. The idea is to run counts on the client side and include them as user properties on your events. For example, your user object on your pageview event might contain a property like:

    {  
       "collection_name":"pageviews",
       "properties":{  
          "user":{  
             "id":"4242"
          },
          "url":"https://keen.io/",
          "product_views_this_session":4
       }
    }
    

    With that kind of data, you can again use a simple funnel with filters to identify users who meet your criteria

    Funnel Steps

    1. viewed page 'https://keen.io' where product_views_this_session gt 4
    2. made a purchase

    Solution 4: Implement counters using S3 Streaming + AWS Lambda + RDS or DynamoDB

    If you don't already have it turned on, enable streaming of all of your raw Keen data to Amazon S3. Once it's there, you can do all sorts of additional steps on incoming data and data at rest.

    Your process could be something like this:

    1. Turn on Keen IO Streaming to S3
    2. Write an AWS Lambda job that reviews the incoming data and
    3. Counts how many times each user did each action
    4. Looks up each user in an RDS table
    5. Increments the count for that user and that action
    6. Within the same Lambda job, re-post to the event back to Keen in a new collection, with a new property, like the following.

    New event:

    {  
       "collection_name":"user_product_view_enriched",
       "properties":{  
          "user":{  
             "id":"4242"
          },
          "url":"https://store.io/productA45",
          "view_history":{  
             "product":"A45",
             "lifetime_views":5,
             "counting_since":"<timestamp>"
          }
       }
    }
    

    Putting the data back into Keen in this format allows you to use standard funnels as described in Solution 3:

    Funnel Steps

    1. users with events in user_product_view_enriched with product "A45" and lifetime_views > X
    2. users who made a purchase

    Limitation to this method is that you can't arbitrarily change when and how counters "start". For example, maybe you decide later that you only want to count views within the last 3 weeks rather than lifetime. You'd have to query the raw data in S3 to arrive at a new count like that, but you already have the data there so it's not that bigga deal.

    Solution 5: Implement more complex querying on raw S3 data using EMR and/or Athena

    Write a regular batch job to query your raw S3 data and identify the users that you are interested in. There are a variety of options here that I am not an expert on, but I know are possible.