databaseaccumulonosql

Scanning high volumes of data in Accumulo


so having an issue with high volume reads from an Accumulo table. I can definitely say I understand the basics of Accumulo, but some of the finer detail I am still learning.

I have two tables in an Accumulo database, one which holds the attributes of an object, and one that holds an index that points to the object for O(1) lookup times. I'll use a person object as an example.

Person Table

row_id    colf    colq    val
uuid      person  name    joe
uuid      person  age     25
uuid      person  country usa

Person Index Table

row_id    colf    colq    val
joe       person  uuid

So the way I am currently getting all people is scanning the index table and for every entry in the index table, I am then scanning the person table with the key gathered from the index table, then constructing the person object based on what is pulled from the person table.

For low volumes this poses no issues, but as I scale up, 10k person records, the query takes ~3 seconds. When I scale up to 100k person records, the query can go past 30 seconds.

My initial thought is that since I am querying the index table then subsequently querying the person table, this is taking up more time, ~2x what it would normally (though I cannot confirm).

If I know I am going to want all objects in the person table, is there a faster way to just only query that table? Like just scan, and when keys change, you know you are on the next object? Or is what I am currently doing the preferred method, and these kind of queries take a while since they are so large? (i'm new to large scale operations).

Would it be recommended to just limit the query to say 5k records, and then re-query when I need to get the next 5k?

Any advice welcome!


Solution

  • If I know I am going to want all objects in the person table, is there a faster way to just only query that table?

    If you're going to read all of the people, it's kind of pointless to use the index. Just scan your people table. If you only want certain attributes on each person, you can use the fetchColumn(Text, Text) method on Scanner/BatchScanner.

    when keys change, you know you are on the next object

    If you're just dealing with each row, you can try using the WholeRowIterator.