[SOLVED] Number of reads compared to writes for database indexing

Number of reads compared to writes for database indexing

It is known that database indexing only makes sense if you have large tables and more reads than writes as the creation of the indices leads to additional writing overhead as each modification in the database also leads to a modification of the indices.

If we assume that the indexing structure of the database is a) a B+ tree b) a hashtable what is a rule of thumb for the number of reads compared to the number of writes where it starts to make sense to implement database indexing ?

For information on how database indexing works, check out How does database indexing work?

Solution

There are many factors involved here. For example, is the index data in the cache? Are the data blocks in the cache? How many rows are retrieved by the query? How many rows in the table etc etc I have been asked this question (or a variant of it) many times. Rather than give number, I prefer to give some information so that the person asking the question can make an informed decision. Here is a "back of the envelope" calculation example with some stated assumptions.

Lets say the table has 100M rows
Lets say a row is 200 bytes on average

Than means the table consumes about 19GB (ignoring any compression)

Lets assume that the index is fully cached and takes zero time
Lets assume that the data has to be read from disk, at 5ms per read
Lets assume that your IO subsystem can deliver data at 10 GB/s
Lets assume that you need to read 50,000 rows from your table. Lets also assume that the rows are spread out such that two of the required rows live in te same block.

OK, now we can do some math.

Scenario 1. Using an index

Index reads are zero time
Data block reads: 25,000 blocks @5ms each is 125s

Scenario 2. Scanning the table

19GB scanned at 10GB/s is 2s

Therefore in this case, scanning the data is much faster than using an index.