We are using Hbase, Hadoop as event stores for our universal recommender apps which uses PredictionIO internally. The data has grown very large and after much thought, we think it would be better to delete data which is older than 6 months. (Adding another machine as a data node is totally out of question).
After looking through multiple times, the only way I see to delete events is by querying the event-server, getting eventIDs and calling delete request for each of those eventIDs.
The problem is at random times, the event-server responds with Internal Server Error
, because of which the deleting gets stopped. When I hit the same query in Postman, it sometimes responds with events and sometimes with The server was not able to produce a timely response to your request.
To confirm if actually, no events are present, I checked in Hbase. There are events older than the ones for which I ask in the query.
The query is as follows: http://server:7070/events.json?accessKey=key&entityType=user&event=add_item&untilTime=2017-05-01T00:00:00.000Z&limit=2
Need help regrading how I can delete events in such a case.
From your question, I can understand that you ultimately want to remove data which are 6months old. My suggestion for a clean and automated way of cleaning your data would be to use HBase TTL.
TTL can be set for columnFamily. On setting TTL of 6months for your columnFamily, Hbase major compaction will take care of removing those records after 6months.