amazon-web-servicesamazon-redshiftamazon-kinesis-firehoseamazon-kinesis-analyticsclickstream

Website Click stream flow + customer 360 using AWS Kinesis Firehose


We are trying to implement a clickstream flow of our e-commerce on AWS. The clickstream will catch all actions done by 'Anonymous' users. Anonymous users are tracked via a UUID , generated during their first visit , that is stored in a cookie. We used the AWS example here to suggest a solution architecture like the diagram below :

enter image description here

Now 2 questions :

  1. Different pages in the e-commerce have different clickstream data. For example on the Item view page , we would like to send Item related info such as itemId as well. Or on Checkout page , we would like to have few order related info tied to the clickstream data. Should we have separate Firehose delivery streams for different pages to support custom clickstream data? Or we should send a generic clickstream record (with possible null values for some attributes) to a FH delivery stream?

  2. At some point our anonymous users become identified (ex. they login so we know their User_ID) So we would like to link the {UUID and User_ID} to be able to have a customer 360 view. Should we consider a separate stream flow + separate S3 bucket for tracking UUID+ User_ID mappings? Should we then use Athena for showing aggregated reports for customer 360? Should we aggregate the data and create a customer dimension in the Redshift? What would be a good solution for this?

Regards, Lina

[Update]: Is the following diagram an acceptable solution for the question? enter image description here


Solution

  • You should make the decision based on how you intend to access the data. Given the rate at which clickstream data grows, if you want to generate any meaningful insights on the data with a reasonable response time and cost, you would want to make use of data partitioning. Read more about this here.

    To be able to reliably do that you will have to use multiple Kinesis streams.

    The only scenario in which you would choose to not use multiple streams is due to cost. But given that you will be using it in a clickstream application, and if you are using it on a website with active users, the number of incoming events can be easily used to effectively use the shards.

    Disclaimer: Personal Opinion: I would suggest that you move this to Kinesis Firehose just so that you will have the flexibility to start loading the data into redhift with minimal process changes at later stage while at the same time also backup the data in S3 for cold storage/backup. Given the volume, Athena might not be a good choice for performing analytical queries on the data. You can look at using Redhift external tables, wherein the data still lies on S3. As for the cost of the redshift instance itself, you can now pause the cluster. Read the announcement here.

    To address the updated architecture diagram that you have added, you can completely skip Glue. Kinesis can directly load the data to S3 and you can define external tables with RedShift spectrum.

    The general approach is to load the data into Redshift as well as back it up to S3. Then on the Redshift you can periodically delete old data (say more than a year ago). This balances the cost vs performance as the query would be more performance on the data lying with the Redshift.

    As for transformations, you can directly make use of Lambdas with the Kinesis Firehose. Read more about this here.

    Edit 1: Added the opinion around using Redshift and why it will be useful and cost effective

    Edit 2: Added details around simplifying the new proposed architecture.