Minimizing cache stampedes

Warning: This blogpost has been posted over two years ago. That is a long time in development-world! The story here may not be relevant, complete or secure. Code might not be complete or obsoleted, and even my current vision might have (completely) changed on the subject. So please do read further, but use it with caution.
Posted on 29 Jul 2010
Tagged with: [ jumanji ]  [ memcache ]  [ PHP ]  [ stampede

Caching is THE magic solution when it comes to optimizing your web applications. There are a lot of caching strategies and applications outthere. Some prefer MySQL query caching, others use memcache to cache either queries, objects, html or other data. However, one of the biggest problems that a lot of people tend to ignore with memcache or other caching daemons is dealing with stampedes. This phenomenon occurs when for instance the caching server is unavailable (because the instance is down, or due to network issues) or, most of the time, when the time to live of an object expires.When that occurs, all your processes will ask the cache for data, find that it’s not present and will try to generate it for you.

Examine the following piece of code that either fetches data from the cache, or generates new data and places it into the cache when nothing was found:

// Fetch the data from the cache
$data = $cache->get ("cached_item");

// Check if the data was fetched correctly
if ($cache->getResult () == Cache::NotFound) {
  // Not found. generate data (expensive)
  $data = generateData();

  // Save the data into the cache. Expires after 30 seconds.
  $cache->set ("cached_item", $data, 30);
}

// Display the data
print $data;

Even though the fragment above will work, it does not prevent caching stampedes. If the generateData() function takes 3 seconds, and we have 10 pageviews per second, it means that at least 30(!) processes will call the generateData() function. Because this has a hughe impact on the system (which is why we cached this data in the first place), it might bring down either the database, webserver or both when you are very unlucky.

Preventing stampedes.

There are multiple ways of preventing stampedes to your database. I present two of them, both used with different types of data.

Method1 : Separate update process

Instead of having the web-server generate the data on demand, let a separate process update the cache. For instance, when you have statistical data that needs to be updated every 5 minutes (but takes a minute to calculate), let a separate process (a cronjob for instance) calculate this data and update the cache. This way, data is always present and if not, you do not need to worry about stampedes since your clients will never generate the data. Works perfectly for global data that does not need to be calculated on the fly. For hings like user-objects, friends-lists, comments etc its probably not a very good method.

Method 2: locking

Instead of an update process creating caches, you can use the normal flow of caching:  1. check cache, 2. if not exist, generate.  However, we build in a little extra measurement to prevent the stampedes. This is the code:

function get($key) {
    $result = $this->_mc->get($key);
    if ($this->_mc->getResultCode() == Memcached::RES_SUCCESS) return $result;

    // See if we can create the lock.
    $this->_mc->add ("stampede_lock_".$key, "locked", 30);

    if ($this->_mc->getResultCode() == Memcached::RES_SUCCESS) {
        $this->_setResultCode (Cache::GenerateData);
        return false;
    }

    $this->_setResultCode (Cache::NotFound);
    return false;
}

The code above does the following:

  • It first checks if the cache is present, if so, just return it and we’re done.
  • If it isn’t present, try to create a “lock”. This will work if (and only if) somebody hasn’t done it before us (since “add” is an atomic process and will succeed for only 1 process, even if multiple processes are trying at the same time).
  • When we did acquire the lock, we return a “GENERATE_DATA” response.
  • When we didn’t acquire the lock, we return a “NOT_FOUND” response.

Note that with this flow only 1 process will return a “generate_data” status. All others will return not_found. It’s up to your client to handle this properly by checking the return-status.

It is important that your stampede lock gets a time-to-live. Also make sure this TTL is high enough to actually generate the data in (when it takes 20 seconds to generate the cache, make it 30 seconds or so). Just make sure that this TTL is never higher than the TTL of the actual object. After 30 seconds, the lock will automatically be released.

One might be tempted to change the code above: instead of having a lock with a TTL, you will delete the lock when setting the data in a set() method. This, however, can trigger a race-condition which means that multiple processes still can generate the data (not at the same time, but right after we removed the lock).

Conclusion:

Even though the functionality above can be improved, they are very simple ways of preventing caching-stampedes. Always make sure your infrastructure cq database grows as your userbase grows even though caching makes it look like you can handle the load.