databasedistributeddistributed-caching

Why is a distributed in-memory cache faster than a database query?


https://medium.com/@i.gorton/six-rules-of-thumb-for-scaling-software-architectures-a831960414f9 states about distributed caches:

Better, why query the database if you don’t need to? For data that is frequently read and changes rarely, your processing logic can be modified to first check a distributed cache, such as a memcached server. This requires a remote call, but if the data you need is in cache, on a fast network this is far less expensive than querying the database instance.

The claim is that a distributed in-memory cache is faster than querying the database. Looking at Latency Numbers Every Programmer Should Know, it shows that the latencies of the operations compare like this: Main memory reference <<< Round trip within same datacenter < Read 1 MB sequentially from SSD <<< Send packet CA->Netherlands->CA.

I interpret a network call to access the distributed cache as "Send packet CA->Netherlands->CA" since the cached data may not be in the same datacenter. Am I wrong? Should I assume that replication factor is high such that cached data should be available in all datacenters and instead the comparison between a distributed cache and a database is more like "Round trip within same datacenter" vs "Read 1 MB sequentially from SSD"?


Solution

  • Databases typically require accessing data from disk, which is slow. Although most will cache some data in memory, which makes frequently run queries faster, there are other overheads such as:

    All of which add latency.

    Caches have none of these overheads. In the general case, there are more reads than writes for caches, plus caches always have a value available in memory (if not a cold hit) - writing to the cache doesn't stop reading the current value - synchronised writes just mean a slight delay between the write request and the new value being available (everywhere).