accumulo

Accumulo: How to index data for efficient queries?


I need an efficient indexing schema to store alerts. Each alert has the following attributes:

I need to perform the following queries:

My initial thought was to index a follows:

ROWID -> alertId_timestamp,
CF -> attribute (type, location),
CQ -> value

However I think the query performance will be poor for my #2 & #3 query use cases. How should I index my data?


Solution

  • If you know those are going to be your only queries, you could insert your data as such:

    RowID     |  CF    |  CQ    | Value 
    Timestamp | "TYPE" | <type> | alertID
    Timestamp | "LOC"  | <loc>  | alertID
    

    Yes, you are inserting the data 2x, but this is "Big Data", right, so storage space should not be a big concern. All of your queries are time based and are just doing counts, so you should put the timestamp as your RowID so that you can do Range over the rows and filter what you want on the CF/CQ with fetchColumnFamily or . You could also try combining the CF/CQ together such as "TYPE_Foo", and "LOC_Boston" if you know that you're only going to be looking for specific types/locations, but this will limit you if you're trying to find all/range of types/locations. Note that you'll also probably want to use some sort of iterator to do server side aggregation Does Accumulo support aggregation?