I have looked at the Twissandra examples. I asked a similar question regarding this a few days back and received some tips I implemented here. However, by looking at the tables (column families) I see barely any difference between this and a relational database.
My scenario: A simple address book where a user can create his own contacts and group them (one contact can be placed in many groups, one group can contain many contacts). A contact may have multiple addresses for example.
I want to retrieve all the contacts who live in address x and are placed in group y. Therefore, I did the following:
CREATE TABLE if not exists User (user_id uuid, contact_id uuid, type varchar, email varchar, PRIMARY KEY(id));
CREATE TABLE if not exists Contact (contact_id uuid, firstname varchar,lastname varchar, photo blob, imagelength int, note varchar, PRIMARY KEY (id));
CREATE TABLE if not exists Address (address_id uuid, contact_id uuid, street varchar, number int, zipcode varchar, country varchar, PRIMARY KEY(address_id));
CREATE TABLE if not exists Group (group_id uuid, user_id, groupname varchar, PRIMARY KEY(group_id));
CREATE TABLE if not exists Group_Contact (group_id uuid, contact_id, PRIMARY KEY(id, contact_id));
However, based on this, this is literally exactly the same as a relational database, well, except that I believe Cassandra is putting this data in a different way than a RDBMS on disk. I don't see how this can be made better in Cassandra and whether I even modeled this the right way. It just feels as a plain relational database. I feel that I did something wrong since I have to use application level joins to get the address of the contacts. I really don't know how I can de-normalize this to allow multiple addresses (and maybe even phones, emails).
Any suggestions to improve this scenario would be greatly appreciated!
As jny indicated, data duplication, denormalization and query-based modeling are keys to building good Cassandra data models. If I wanted to take your tables above, and build a table to support address/contact queries based-on country, I could do it like this:
First, I'll create a user defined type for the contact's address.
aploetz@cqlsh:stackoverflow> CREATE TYPE contactAddress (
... street varchar,
... city varchar,
... zip_code varchar,
... country varchar);
Next, I'll create a table called UserContactsByCountry
to store user contact info, as well as any user contact addresses:
aploetz@cqlsh:stackoverflow> CREATE TABLE UserContactsByCountry (
... country varchar,
... user_id uuid,
... type varchar,
... email varchar,
... firstname varchar,
... lastname varchar,
... photo blob,
... imagelength int,
... note varchar,
... addresses map<text, frozen <contactAddress>>,
... PRIMARY KEY ((country),user_id));
A couple of things to note here:
country
as a partitioning key for querying, and addding user_id
as a clustering key for uniqueness.country
is being stored multiple in each row. Once as the partiiton key, and again with each address. Note that the country
partition key is the one which allows us to run our query.Next, I'll insert three user contacts, each with two addresses, two from the USA and one from Great Britain.
aploetz@cqlsh:stackoverflow> INSERT INTO usercontactsbycountry (country, user_id, type, email, firstname, lastname, note, addresses)
VALUES ('USA',uuid(),'Tech','brycelynch@network23.com','Bryce','Lynch','Head of R&D at Network 23',{'work':{street:'101 Big Network Drive',city:'New York',zip_code:'10023',country:'USA'},'home':{street:'8192 N. 42nd St.',city:'New York',zip_code:'10025',country:'USA'}});
aploetz@cqlsh:stackoverflow> INSERT INTO usercontactsbycountry (country, user_id, type, email, firstname, lastname, note, addresses)
VALUES ('USA',uuid(),'Reporter','edisoncarter@network23.com','Edison','Carter','Reporter at Network 23',{'work':{street:'101 Big Network Drive',city:'New York',zip_code:'10023',country:'USA'},'home':{street:'76534 N. 62nd St.',city:'New York',zip_code:'10024',country:'USA'}});
aploetz@cqlsh:stackoverflow> INSERT INTO usercontactsbycountry (country, user_id, type, email, firstname, lastname, note, addresses)
VALUES ('GBR',uuid(),'Reporter','theorajones@network23.com','Theora','Jones','Controller at Network 23',{'work':{street:'101 Big Network Drive',city:'New York',zip_code:'10023',country:'USA'},'home':{street:'821 Wembley St.',city:'London',zip_code:'W11 2BQ',country:'GBR'}});
Now I can query that table for all user contacts in the USA:
aploetz@cqlsh:stackoverflow> SELECT * FROM usercontactsbycountry WHERE country ='USA';
country | user_id | addresses | email | firstname | imagelength | lastname | note | photo | type
---------+--------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+-------------+----------+---------------------------+-------+----------
USA | 2dee94e2-4887-4988-8cf5-9aee5fd0ea1e | {'home': {street: '8192 N. 42nd St.', city: 'New York', zip_code: '10025', country: 'USA'}, 'work': {street: '101 Big Network Drive', city: 'New York', zip_code: '10023', country: 'USA'}} | brycelynch@network23.com | Bryce | null | Lynch | Head of R&D at Network 23 | null | Tech
USA | b92612dd-dbaa-42f2-8ff2-d36b6c601aeb | {'home': {street: '76534 N. 62nd St.', city: 'New York', zip_code: '10024', country: 'USA'}, 'work': {street: '101 Big Network Drive', city: 'New York', zip_code: '10023', country: 'USA'}} | edisoncarter@network23.com | Edison | null | Carter | Reporter at Network 23 | null | Reporter
(2 rows)
There are probably other ways in which this could be modeled, but this is one that I hoped to use to help you understand some of the techniques available.