so having an issue with high volume reads from an Accumulo table. I can definitely say I understand the basics of Accumulo, but some of the finer detail I am still learning.
I have two tables in an Accumulo database, one which holds the attributes of an object, and one that holds an index that points to the object for O(1)
lookup times. I'll use a person
object as an example.
Person Table
row_id colf colq val
uuid person name joe
uuid person age 25
uuid person country usa
Person Index Table
row_id colf colq val
joe person uuid
So the way I am currently getting all people is scanning the index table and for every entry in the index table, I am then scanning the person table with the key gathered from the index table, then constructing the person
object based on what is pulled from the person table.
For low volumes this poses no issues, but as I scale up, 10k person records, the query takes ~3 seconds. When I scale up to 100k person records, the query can go past 30 seconds.
My initial thought is that since I am querying the index table then subsequently querying the person table, this is taking up more time, ~2x what it would normally (though I cannot confirm).
If I know I am going to want all objects in the person table, is there a faster way to just only query that table? Like just scan, and when keys change, you know you are on the next object? Or is what I am currently doing the preferred method, and these kind of queries take a while since they are so large? (i'm new to large scale operations).
Would it be recommended to just limit the query to say 5k records, and then re-query when I need to get the next 5k?
Any advice welcome!
If I know I am going to want all objects in the person table, is there a faster way to just only query that table?
If you're going to read all of the people, it's kind of pointless to use the index. Just scan your people table. If you only want certain attributes on each person, you can use the fetchColumn(Text, Text)
method on Scanner/BatchScanner.
when keys change, you know you are on the next object
If you're just dealing with each row, you can try using the WholeRowIterator.