by Vinod V, Vivek Y S & Regunath B.
Flipkart website has grown exponentially over the last few months, if one were to go by Alexa rankings and our own internal metrics for tracking web-site traffic. While this is heartening and good for the business, it also places an unstated expectation – that of sustained quality of service at large scale.
The Flipkart motto of ‘Service, Selection and Price’, in that order, reflects in multiple things we do. The Flipkart service experience begins at the website and is treated pretty seriously. In the first of a blog series under the title “Making Deliveries Faster”, we’ll talk about the Flipkart Cache Cluster – one of the key infrastructures that helps sustain the Flipkart website experience at scale.
This blog post has three parts. The first talks about use-cases that necessitate a cache and derive criteria for its selection. The second talks about internals of the chosen solution (membase). The third talks about nuances in our deployment.
Part I – Why cache?
Larry Page once said : “Browsing should be as simple and fast as turning a page in a magazine”. A key metric that defines an on-line shopping experience is the user perceived response time. Note here the reference to user perception. Here is a great reference to timing attributes.
Response time for a page request can be broadly broken into the following stages:
- Time spent on the network
- Time spent by the application in preparing data for the page – application security, reading data from persistence, transformation, applying business logic, data preparation for rendering etc.
- Rendering the page on the client browser
As one may see, latencies introduced in any of these stages adversely affect user-perceived response times.
e-Commerce sites are rich in content and the Flipkart pages are no exception. But serving these pages is not straightforward for a variety of reasons:
- Content for a page is personalized to improve relevance, hence needs to be dynamic
- Pages are composed using shared elements. For example, the product details of a fast-moving item may be shown in multiple places – the best-sellers widget, the Search Page and the Shopping Cart page.
- A number of product attributes are subject to change – stock/availability, recent reviews, price.
All product attributes need to be read from persistent store at runtime. This persistent store needs to provide lowest possible latencies at scale – for millions of products and across millions of requests. And that’s where caching comes in.
Another use case worth discussing is that of User sessions. Websites use sessions to keep track of a number of things – details of the logged in Principal and activities performed by the user on the website, including but not limited to the shopping cart. This session data too can grow into millions pretty quickly, especially if the retention period is a few days. Low latency access to session data is a need in application security and data preparation by the application for generating a web-page response.
Most websites start off storing and serving these data from their primary persistence stores, which is often a RDBMS database. This is a very viable choice for reasons around strong data consistency and durability guarantees (ACID properties). However, over time, this infrastructure tends to become the most stressed, difficult to scale (rather, distribute) and gradually degrade into an often occurring single point of failure.
The difficulty in scaling almost all distributed data stores is nicely characterized by the CAP theorem. The Product Catalog and User Session use cases described above may be re-looked in the context of CAP and the data system characteristics re-defined as follows:
- Highly Available data store where downtimes affect a section of users, at most. Better if this can be further localized to say Geographic location, User profile, Data stored and Web-pages.
- Low Latency, High Throughput reads i.e. Serve millions of requests in tens of milli-seconds.
- Guaranteed Writes – preferably to redundant copies, else, Eventual Consistency.
- Application-selectable guarantees on read-consistency with application driven reconciliation across versions if required.
The Flipkart use cases of Product Catalog and User Sessions diverge somewhat in the requirements around consistency especially in user experience when deployments are heavily distributed- multiple data centers connected by WAN say.
It should also be fairly obvious that individual and cumulative data sizes are not very big, that when distributed across machines, can fit into memory. Data stores that leverage primary memory have least latencies – as they effectively avoid disk seeks and often disk I/O. A memory-based cache infrastructure therefore was deemed as the most viable option with Availability, Durability and Consistency guarantees becoming additional selection criteria.
Part II – Membase
This part concentrates on the internals of memcache (a high-performing and very leveraged caching solution at Flipkart), its short comings and how membase solves it. Also covers a few architectural patterns that are used in memcache/membase that may be relevant in other places.
Typically memcache servers are organized as farm of many servers. Each memcache server is owner of a sub set of the key space. Now either applications are aware of this ownership-server mapping or a client library can achieve the same. Either way keys are sharded and key value requests are sent to the right server. Here are some key characteristics of memcache.
Constant time operations
- Most of the memcache operations are of constant time. They are independent of the number of keys or the size of they keys that stored.
Least Recently Used
- Memcache server is started with a pre-determined amount of space. As long as all of the key/values and its metadata fits in that pre-determined memory, memcache server goes on allocating memory for the key value pairs. But once all of the pre-determined memory is used, memcahe starts to evict the key values from the memory to allocate space for new key values pairs. The memcache uses LRU policy to evict the keys.
- Internally it uses a counter, which is a timestamp of last access time, to figure out the key to evict.
- A single memcache server can easily saturate a 1Gbps network link. It can easily perform 100K operations (gets and puts) per second. Ideally this should result in a terrible memory fragmentation. But surprisingly this does not happen on memcache servers.
- Memcache is able to achieve this by having its own memory management system. It is called slab allocation. It is very similar to slab allocation in linux kernel. Let me explain.
- When memcache server is started with very verbose option we see something like this
slab class 1: chunk size 80 per slab 13107 slab class 2: chunk size 100 per slab 10485 slab class 3: chunk size 128 per slab 8192 slab class 4: chunk size 160 perslab 6553
- When memcache server is started, it partitions all of the allocated memory in to pages assigned to a slab class. Each page is of 1MB size, which coincides with the max size of a key value pair. Each page is divided into chunks of same size. There can be multiple pages with the same chunk size. But once a page is assigned to a slab class it cannot be re-assigned. For example there can be 10 pages with slab class 3. ie with chunk size 128 bytes.
- The smallest chunk size starts from 80 bytes and increases with a factor of 1.25 rounded to the next power of 2.
Lets take a hypothetical scenario and see how memory allocation happens in memcache.
- Start the memcache server with 100MB of space. It requests 100MB of heap space from the OS and starts to manage it. It starts creating 1 page of each slab class. The remaining memory is reserved with out partitioning.
- When client starts inserting key value pairs to the server, it uses the above created slabs to allocate.
- Once all the chunks in a particular slab are used, another page is created from the unused memory and assigned to that particular slab class.
- One of the interesting consequence of allocating space like this is on the working of LRU algorithm. Lets take another hypothetical scenario to understand this.
- Just like previous scenario, lets start the memcache server with 100MB space. It initializes the memory.
- Say the client starts to insert only 128 bytes key value pairs until all of 100MB is used up.
- Then client inserts a key value pair with 1MB. It uses the already created page of the slab class with 1MB chunk size.
- If the client inserts another 1MB key value pair, then memcache evicts the previously inserted 1MB key value pair. Even though there are many 128 bytes key value pair that were inserted before 1MB key value pair.
- From the above scenario it is clear that LRU is applied per slab class
Some of the major short comings of memcache for our use cases:
- Horizontal scaling – Whenever a new machine needs to be added for horizontal scaling, key ownership changes unless applications or client libraries are using consistent hashing.
- Availability – When a memcache server crashes or not available due to n/w partition, cache is not available. A way to solve this problem is by replication. But when replication is introduced another challenge that arises is consistency.
- Persistence – This is not a design goal for memcache.
Membase took a different approach to solve all of the above 3 problems with single solution (called vBucket). Before describing the internals of vBucket, lets start with some of the design goals membase aimed to achieve:
- Never service a request on the wrong server.
- Allow scaling up and down at will.
- Servers refuse commands that they should not service, but
- Servers still do not know about each other.
- We can hand data sets from one server to another atomically, but
- There are no temporal constraints.
- Consistency is guaranteed.
- Absolutely no network overhead is introduced in the normal case.
To achieve above design goals, Membase has cluster management server called ns_server (north scale server) written in Erlang.
Now, back to vBuckets. A vBucket is conceptually a computed subset of all possible keys.
One could think of it as 2-level hashing with first level of hashing computed dynamically and second level mapped statically. The number of vBuckets in the cluster remains constant and it is independent of the cluster topology. This means a key say x maps always to same vBucket as long as the same hash function is used.
This is a nice architectural pattern to achieve the effects of consistent hashing, without implementing a consistent hashing algorithm.
There are few terms that needs to be defined before vBucket implementation can be explained.
Cluster : A collection of membase servers.
Server : An individual machine within the cluster.
vbucket : A subset of all possible keys.
Also a vBucket can be in any one of the following states:
Active : This server is servicing all the request for this vBucket
Dead : This server is not responsible for this vbucket.
Replica : Receives only replication commands but does not server any clients.
Pending : This server will block all request for this vBucket.
Let us look at the vBucket state transitions and how horizontal scaling is achieved by this.
Initially membase cluster contains a single server. Typically the number of vBuckets in a deployment vary b/w 1024 to 4096. The max number of vBuckets can be 65536. For the sake of simplicity let us take 6 vBuckets.
Now lets add another server. So there is one active server & another new server.
Adding a new server will not unbalance the tree. All the vBuckets in the new server comes up in the dead state.
In order to make the new server usable, rebalance of the vbucket map has to be done. The rebalance process transfers the vBucket from old server to new server. This is done by an asynchronous thread. The rebalance process selects a subset of vBuckets that the new server should serve and set them to pending state. Then data is sent from the old server and placed into new server. Once the transfer is complete the state of the vBucket is set to active on the new server and dead on the old server.
These are the exact order in which the operations are performed.
- The vBucket on the new server is placed in pending state.
- A vBucket extract tap stream is started.
- The vBucket tap stream atomically sets the state to dead when the queue is in a sufficient drain state.
- The new server only transitions from pending to active state after it receives confirmation that old server is no longer servicing the requests.
By performing the operations in the exact order one can guarantee no more than one server is active for any given vBucket at any given point in time without any regard to actual chronology. (This is a major difference between data stores like Riak and Membase. Riak uses vector clock to achieve the same result).
You may notice that there is a very short time period where a vBucket has no active server at all. This occurs at the very end of the transfer mechanism and causes blocking to occur. Client rarely notices this.
High availability in membase is achieved by replication. When replication is enabled, and a new server is added, some of the vbucket state is changed to replica. Replica vbucket is like dead vBucket from client perspective. It does not serve the clients. But it listens to replication commands.
If a server crashes, vBucket to server static map is changed. The vBucket owned by the crashed servers are activated on the other servers in the cluster. Those vBuckets in replica state are changed to active state so that clients can be served. This change happens so quick that controlled fail overs will never result in client request failures.
Finally it is clear that vBucket solves two of the three short comings of memcache (availability and horizontal scaling).
Persistency in membase is an extension to memcache. Memcache has a module called engine interface. Using engine interface storage engine can be implemented. The complete RFC can be found here.
One of the best part of engine interface is, it completely asynchronous and memcache can be started with any storage engine that implements engine interface. Currently Membase uses SQLite as the storage engine. With couchbase 2.0 this will be replaced with couchdb’s storage engine.
Part III – Nuances of Membase use at Flipkart
The Flipkart systems differ slightly in their usage of the Membase cache cluster. The difference is primarily around detecting and explicitly handling data consistencies between sources of data, values in cache and data as seen by clients. One such instance (Product Catalog service) is highlighted below:
- Use the Flipkart cache wrapper library (Prometheus) to perform all cache operations. Prometheus was primarily written to support connection pooling.
- Use increment/decrement methods to implement atomic counters which allows different backend servers with versioning.
- Use add (put if absent) and cas (compare and set) operations which allows for lists and set semantics shared across multiple clients.
Shifting gears to one of the interesting properties that was observed on our production cluster was cache miss on Membase buckets when the particular bucket was using about 70% of the allocated memory. The problem was there was no free memory on the OS. OS was caching the IO objects.
To circumvent this problem a simple solution was adopted. Reduced the swapiness of the linux virtual memory system and regularly flush the OS cache.