I need to clean a few buckets in a massive riak database. For certain buckets, since we had indexes, I just queried those and deleted the keys. But now I'm dealing with two buckets that don't have any indexes. As I read many many times I shouldn't use keys?keys=true
or keys?keys=stream
on production systems, however another way of getting all the keys is to use the $bucket
index, as suggested in the documentation, which doesn't warn of not using this in production. I believe this was also known as $keys
previously. Our system seems to work with either.
However, just before running this on production I've been playing around and found that the $bucket
index returns keys that were deleted, just like keys?keys=true
/stream
, while this was not the case when I used the indexes that we were maintaining ourselves.
Is the $bucket index safe to use in production?
Note that our system runs on the LevelDB backend, which I've been told is bucket scoped, and therefore it would be safe to even run keys?keys=true
/stream
on it. Is this true?
The reality is that there are no guarantees when using $bucket or $key, especially if you combine with Map/Reduce, that they will not impact the stability of the cluster. However, if you know there to be a relatively small number of keys in a bucket, or if you use max_results in your query - then $bucket should be relatively safe (certainly a lot better than using list keys).
There are safer ways of erasing masses of keys in Riak KV 2.9.1, which are available regardless of backend (assuming the use of Tictac AAE).
As for $bucket or $key queries returning deleted keys, I suspect that this may well be true due to the fact that this is an "internal" leveldb query and within leveldb it wan't be able to distinguish between a Riak object and a riak tombstone. The improvements in Riak 2.9.x handle this situation, and won't return deleted keys (unless you're looking for tombstones to reap).