GraceKelly: A best effort cache synchronization library for distributed systems

GracKelly is a best effort cache synchronization library for distributed systems.

In the following blog post I’ll explore, the motivations for such a library and give a brief  introduction to GraceKelly.

GraceKelly is open source and is available here: GraceKelly on Github

A Chaotic Place

The average visitor on Flipkart is not aware of the scale at which we operate. Day in and day out we handle millions of page views, thousands of transactions, millions of updates for many things including price, availability, offers, recommendations.  Under the hood of the calm, functional facade of the website, there is a complex network of services that are constantly interacting with each other.

We have services that handle everything ranging from search to product serviceability at a given pin code.

An arbitrary web request that hit’s one of our webservers can spawn a plethora of requests to a bunch of back-end services, which in turn might be dependent on other services.

The back-end services respond with different responses at a variable latency and the responses are collated, made sense of, transformed and finally a web response is created and served to the user as a web-page.
The variability of the requests and responses that traverse the complex network of services while being transformed, multiplexed, demultiplexed and altered makes for a chaotic environment.

Distributed Service Environment

Chaos means unpredictability and unpredictability is bad. When a user requests for a page his page load time must be predictable. When a product goes out of stock, the amount of time it takes to reflect on the product page needs to be predictable. We need predictability around service SLAs. Service SLAs are dependent on the load under which the service is operating. This means, we need predictability around service load as well. We can’t operate in an environment where one minute a single server is able to handle production traffic and the next minute a whole cluster is buckling under the load. So we try to grab and hold on to as much predictability as we can, where ever possible.

Caches to the rescue

Caches act as sentinels in a distributed service environment. Although their primary function is to reduce latency, when used appropriately they excel and bringing predictability to a system. This is because a cache request is extremely predictable, with almost no variability, either in response times or the load per request. This is down to the simple data access pattern for a cache request. If a given front-end request hits caches at all the back-end services we can predict with high confidence the load and response latency of the given request on each service.  One could say that there is positive co-relation between the percentage of Cache hits and the predictability of a system/environment.

Caches To The Rescue

Throwing a spanner in the works

Every time there is a cache miss both our distributed environment and it’s SLAs become a little bit more vulnerable. In the face of these risks a common pattern of cache usage seems inappropriate. One of the most common ways of updating the data stored in caches is to have an expiry ttl for every cache entry. Once this time to live expires the cache entry is removed from the cache and is no longer accessible, until another request repopulates/synchronizes the cache. Using an expiry ttl in this way exposes the underlying system to potentially harmful request pattern load for the duration of synchronization. Imagine a sequence of events like the following

  • t0 – a heavily requested cache entry c1 expires
  • t1 – there is a cache miss for c1 and a request is sent to the service to fulfill
  • t2 – the cache has been repopulated with c2

The time between t1 and t2 is the duration of exposure. During that time all requests for c1 that miss the cache are let through into the distributed environment. The predictability of the target service and all the services it depends on during this time is affected by the the per request load and the qps of all requests that result in a cache miss for c1. Caches could to be updated better than this.

Refresh don’t expire

Refreshing the cache without removing the cache entry solves the problem of exposure that cache expiry brings. In a cache refresh strategy once a value is cached, all requests for the value are served out of the cache and don’t hit the service/services at the back-end. Periodically the cache is synchronized with values from back-end services to keep the data up-to date and consistent with back-end systems. This means for all the values that are cached, the load on the back-end systems is extremely predictable. At the same time the response latencies are highly predictable for these cached values.

Many services/systems would be better served by refreshing the cache rather than expiring it. The efficacy of such a strategy depends on the kind of service in question. For services that have zero tolerance for stale data, best effort refreshing instead of expiring the cache entry doesn’t make sense. However, many services can tolerate stale data to a certain degree. For example, a stock availability service cannot accommodate stale data, while a review and rating service can still have stale data cached for a little while.

There are some popular strategies that are used to implement a refreshing cache.

  1. Global TTL, with a refreshing process: the most common way of implementing a refreshing cache is by running a separate process or thread that periodically refreshes all the entries in the cache. The shortcoming of this strategy is that, it is only appropriate where there is no variability in the staleness of data that is cached. eg: A search engine service’s cache can be refreshed once every 30 minutes if the re-indexing happens only once every 30 minutes.
  2. Fine grained TTL, with a monitoring & refreshing process: In this strategy, a separate process or thread is constantly monitoring the cache entries to see which of them have expired and refreshes them accordingly. This approach gives finer grained control on the cache refresh lifecycle for each cache entry. However, running a separate process means one more component in your environment that needs to be monitored and maintained.

What would be good to have is a cache library with regular caching semantics but one that accommodates refreshing a cache entry rather than expiring it based on ttl. This is exactly what GraceKelly is, it’s inspired by Gooogle Guava’s LoadingCache.

Cache me if you can

GraceKelly is a best effort cache synchronization library that tries it’s best to refresh any cache entry that has expired. The refresh lifecycle is solely request triggered and doesn’t monitor/maintain the cache. This means the refresh is not started when a cache entry expires but rather when the first request for an expired cache entry is made. It is best effort because if synchronization/refresh of a cache entry fails, it can fall back to the stale version of the data already present in the cache.

For every request

  • It looks up the cache and returns the value if a cache entry is present.
  • If the returned cache entry has expired it dispatches a task to refresh the cache entry.
  • If for some reason the refresh fails, it can extend the ttl of the existing entry or do nothing.

Note that a cache entry is never removed(though it can be evicted by size constraints). This enables us to

  • Shield the backend services and systems from exposure to unnecessary request load.
  • Decouple response SLAs from backend degradation and availability concerns, there by allowing for graceful degradation with stale data as fallback.

The Library

GraceKelly the library consists of a single Class Kelly that takes implementations of two different interfaces, a CacheProvider and a CacheLoader. They pass around a generic type CacheEntry.

  • Kelly: This is the core of the library that acts as a proxy to CacheProvider and is responsible for reloading the cache using the CacheLoader.
  • CacheProvider: Interface whose implementation provides the actual caching functionality. eg: a CacheProvider implementation for CouchBase, a CacheProvider wrapper around a ConcurrentHashMap.
  • CacheLoader: Interface whose implementation allows one to reload a CacheEntry based on key and value of the expiring CacheEntry.
  • CacheEntry: Generic type that contains key, value and ttl information.

GraceKelly is open source and is available here with example code and

documentation: GraceKelly on Github

2 thoughts on “GraceKelly: A best effort cache synchronization library for distributed systems”

  1. Very good post. But all so complicated. I am pretty sure that the dev team at flipkart is very talented and skilled but I think lot of things can be taken care of by using good SSD based servers (just my opinion and I am in no way into lot of technical stuff). The latency which happens on the server is mainly due to the bottleneck of HDDs and as a result the db operations slow down. Beyond that the end user’s internet connection speed also comes into play. In my opinion it would be a waste of money if we spend on maintaining predictability of page load to such an extent because we cannot predict the end user’s page loading time.

    Between, the flipkart website is indeed fast for me.

    1. Hi gaurav,

      Although I alluded to predictability of page load time in my post, that is not the primary purpose of the library, just an example.
      We do make use of SSD drives at different places in our infrastructure, but Flipkart is not a simple two tier system with web servers and a database at the backend, which means at each service layer we might potentially have to deal with a db and disk IO (Even if it were SSD).

Comments are closed.