database-designcassandradatabase-normalizationcassandra-2.2nosql

Cassandra denormalization vs normalization


Denormalized reality

In my database I have following denormalized table which perfectly suits to my use-case and I am receiving data very fast...

CREATE TABLE IF NOT EXISTS lp_webmap.link (
    drank int,
    prank int,
    title text,
    nofollow boolean,
    created timestamp,
    updated timestamp,  
    dst_ssl boolean,
    dst_www boolean,
    src_ssl boolean,
    src_www boolean,
    dst_domain_name1st text,
    dst_domain_name2nd text,
    dst_domain_name3rd text,
    src_domain_name1st text,
    src_domain_name2nd text,
    src_domain_name3rd text,
    dst_page text,
    src_page text,
    dst_page_title text,
    src_page_title text,
    src_domain_ownerreg text,
    PRIMARY KEY (
        (
            dst_domain_name1st, 
            dst_domain_name2nd, 
            dst_domain_name3rd 
        ), 
        created, 
        dst_page,
        src_page,
        src_domain_name1st,
        src_domain_name2nd,
        src_domain_name3rd
    )
);

However, there are billions of rows in this table and this is an issue for our hardware. Thus, every spared byte in the link table design has a significant benefit for us.

Normalized solution?

An average select from the link table in application contains tenths / hundreds of rows. In worst cases selects contain thousands of rows. So it could be (imho) wise to normalize the problem using this table...

CREATE TABLE IF NOT EXISTS lp_webmap.page (
   domain_name1st text,
   domain_name2nd text,
   domain_name3rd text,
   location text,
   title text,
   rank int,
   www boolean,
   update_interval smallint,
   updated timestamp,
   PRIMARY KEY (
      (domain_name1st, domain_name2nd, domain_name3rd, location),
      updated, rank, update_interval
   )
);

Question

If I used normalized link and page tables, I would need to join them in the application. That won't be a problem, but how select corresponding rows from the page table effectively? I sense that iterate through every resulting row from the link table and selecting corresponding page row one by one isn't effective.


Solution

  • It's true, JOIN is not efficient, especially one of these tables are very big. A possible solution is to build an extra materialized view, or some kind of index for quick search of specific column. This will double the storage, but there is no way to achieve the both: reduce the space consumption and increase the JOIN query performance.

    Maybe you need an additional hard driver for new views or indexes.

    One thing must be noticed is that when we build an extra view or index, it will take extra time(resource) in updating some columns. For example, we have two tables: orders and users, we search all the orders of user "jack" by JOIN. It is a normalized version. In a materialized view, a user "jack", all his columns are merged into his orders for quick access:

    primary_key, order_id, order_product, order_payment, user_name, user_age, user_favorite_color
    
    1,       1,       iphone,       1000,       jack,       25,       blue,  
    2,       3,       book,         30,         jack,       25,       blue,  
    3,       6,       car,          10000,      jack,       25,       blue,  
    

    where user_age, user_favorite_color are redundant information extracted from user table. When jack changes his favorite color, all these records have to change their corresponding column. Normally a DB system will start a back end tread to do this update job, but still it is a time consuming process, imaging jack has thousands of orders.