Cache Strategies in Distributed Systems

π€ Why Are These Strategies Even Needed?
The Core Problem
Your system has a cache (like Redis). It stores expensive data β DB query results, ML model outputs, user dashboards. Every cache entry has a TTL (Time To Live) β after which it expires and gets deleted.
Now here's the danger moment:
The instant that cache entry expires β every request that comes in finds nothing in cache and goes straight to the database.
If 10,000 users are hitting that same key β that's 10,000 simultaneous DB queries in one second. Your database was never designed to handle that. It collapses. π₯
𧱠Why Can't You Just⦠Not Expire the Cache?
Because stale data is a real problem too.
Product prices change π°
Stock values update every second π
User balances must be accurate π¦
You need expiry. But expiry is exactly what causes the stampede. That's the tension these strategies resolve.
π§± Why Can't You Just Scale the Database?
Scaling costs money and time. More importantly β the stampede hits instantly. Auto-scaling takes 30β60 seconds to spin up new instances. By the time new DB instances are ready, the damage is already done and your system is already down.
You need a strategy that works at the moment of cache miss, not after.
π§± Why Can't You Just Let Requests Retry?
This actually makes it worse. As we saw in the Thundering Herd blog β retries multiply traffic. If 10,000 clients each retry 3 times, you now have 30,000 DB queries instead of 10,000. The system collapses even faster. ππ
π§ 1. Jitter on TTL β Spread the Expiry
What it is: Adding random variation to cache expiration times so entries don't all expire simultaneously.
π¦ Real App Example β Banking Dashboard
Imagine 10,000 users have their "account summary" cached with a TTL of 60s. Without jitter, all 10,000 entries expire at exactly the same second β your DB gets 10,000 simultaneous queries π
β With jitter (Β±10s), entries expire between 50β70s. Requests spread out naturally. DB breathes easy π
π 2. Mutex Locking β One Rebuilds, Rest Wait
What it is: Only one request regenerates the expired cache. All others wait for the fresh value.
1οΈβ£ Cache miss detected
2οΈβ£ Request #1 acquires the lock π
3οΈβ£ Request #1 fetches from DB & updates cache
4οΈβ£ Lock released π
5οΈβ£ Requests #2β5000 read the fresh cached value β
Only 1 DB hit instead of 5,000. π
π¦ Stock Trading β Order Execution
Scenario: User places a "Buy 100 shares of TCS" order. System must not execute it twice even if the request is sent twice due to a network failure.
Money is involved β absolute correctness needed
Duplicate orders = financial loss
π² 3. Probabilistic Early Expiration (PER) β Expire Before You Expire
What it is: PER uses a simple but effective formula derived from the XFetch algorithm. Instead of waiting for the cache to expire, each request has a probability of triggering a refresh that increases as the expiration time approaches.
π Analytics Dashboard
Dashboard widgets cached for 30 seconds. If all widgets expire together β Analytics DB gets flooded.
With PER:
Widgets refresh at slightly different times
Load spreads naturally
Backend remains stable
β»οΈ 4. Stale-While-Revalidate (SWR)
What it is: Serve old (stale) data immediately β refresh in background β next request gets fresh data.
Use SWR when your users care more about speed than perfect freshness β and when showing data that's a few seconds (or minutes) old is completely acceptable.
How it works:
ποΈ Cache entry has a defined TTL (Time-To-Live)
When TTL expires β instead of blocking the request:
Return the stale (expired) value immediately
Trigger a background refresh to fetch fresh data
Once new data is ready, update the cache with the fresh value
π° News Feed (Reddit, Twitter) β Users get an instant feel, no waiting at all.
ποΈ Product Recommendations β ML recompute is slow. Show last known recommendations instantly while the model quietly recomputes in the background.
π Final Thoughts
Cache Stampede is not a rare edge case β it's a ticking time bomb in any system that relies on caching at scale. The moment your cache key expires under heavy load, the herd is already at the door. π
Each strategy attacks the problem differently
There is no single silver bullet. The best systems combine these strategies based on how critical freshness is, how expensive the DB call is, and how many concurrent users they serve.
Cache smart. Stay stable. Don't let the herd win. β‘


