Choosing a Datastore For The Flipkart User Engagement Platform

A User Engagement Platform

Tersely defined, a “User Engagement Platform” is a platform that solicits, accepts and displays user content and actions. These collections of content or actions are usually associated with both user and non user entities. Such a platform enables engagement paradigms like reviews, user lists, ratings, votes, comments, karma… etc. It’s easy to imagine such a system being used by millions of users, who are constantly submitting and accessing varied and rich content. This access pattern implies near real time updation of stats, ratings, votes, likes, plays, views… etc at scale. This usage creates loads of data, an abundance of which is volatile and ephemeral in nature.

In short, imagine a customer engagement platform for Flipkart, with about a million unique visitors per day, most of whom are actively engaging with the site and the community. We envision our user engagement platform handling such a load pattern with ease and ready to scale. One of the most critical, operational components in such a system is the data persistence layer. Ideally, this persistence layer should be highly performant, horizontally scalable with minimal overhead, consistent, always available and partition tolerant. But…..

CAP: Or why you can’t have your cake and eat it too

The CAP theorem, or Brewers theorem of distributed computing systems, states that it is impossible for a distributed system to simultaneously provide all three guarantees of

  1. Consistency: All nodes see the same data at the same time
  2. Availability: a guarantee that every request receives a response about whether it was successful or failed
  3. Partition tolerance: the system continues to operate despite arbitrary message loss or failure of part of the system.

The implications of the CAP theorem are crucial for selecting the persistence layer of an application or service. One has to carefully analyze their respective app/service requirements and pick the appropriate guarantees from CAP that their prospective persistence layer will support.

Data access patterns for User engagement services

Historically, the data access pattern for user generated content was low on writes and heavy on reads. However, modern social engagement patterns around user content have modified this pattern to being write heavy as well, to accommodate more likes, more ratings, more upvotes, more comments, more shares. So, today’s and tomorrow’s engagement services should accommodate, heavy write loads, heavy read loads, heavy aggregate(counter), modify and read loads. What becomes apparent if we look at user engagement services in this way is that aggregation needs to be a first class function of engagement services that is near real time, scalable and highly available.

Eventual consistency and User experience

At the same time we can also note that most of today’s engagement heavy applications tradeoff consistency for eventual consistency to achieve better scalability through horizontal partitioning and availability. The extent to which a congruent user experience can be pulled off with eventual consistency differs greatly. Reddit’s upvote user experience is a good example of using eventual consistency without it adversely affecting how the user perceives consistency on the platform.

Youtube’s “301 Views” for a video with “20000 likes” falls at the other end of the spectrum of good user experience using eventual consistency. So, with careful application and service design, effective tradeoffs on data consistency can be made without affecting the user experience. By doing this we immediately free ourselves from the “C” constraint of CAP, which leaves us free to explore the “Availability and Partition tolerance” guarantees which are very much desired in this context. The following section gives a brief example of the kinds of use cases that our engagement platform should support and what they imply for the persistence layer.

A playlist of the people

Imagine a community created playlist on Flipkart’s Flyte. This is a playlist where people add songs to a global playlist and then upvote or downvote songs added by other users. The users should have an always on experience and neither their submissions nor votes should be missed. The implication here is that there shouldn’t be a single point of failure in the write path. Hundreds/thousands of users could be simultaneously upvoting/downvoting the same song so locking of counters should be avoided and Distributed Counters should be preferred. Not every user can see the same view of the data as the nature of the data is very transient to begin with, so eventual consistency should do fine. Given the massive amount of user engagement, the sort order of the playlist is going to change very often so one should avoid query time sorting and prefer data that is natively sorted. Adequate clarity around such usage scenarios enabled us to confidently transition from requirements assessment to technology selection.

Technology Selection: What we want from a persistence layer

Let us consider the attributes of an appropriate persistence layer for engagement services.

  • Availability and Partition tolerance, with tunable consistency: as discussed above we should be able to trade off on consistency to accommodate highly available and partition tolerant engagement services.
  • Linear Scalability: In the face of massive amounts of content being created by  users the system should be able to scale without degrading performance.
  • Operational Efficiency: The operational requirements of the highly available, distributed, partition tolerant persistence layer should be minimal.
  • Community Support: There should be a thriving community of users and developers that is both effective and helpful.
  • Parsable Code Base: The code base should be gorkable both in size and complexity, with either good documentation or help from the community.
  • Professional Support: It is preferable to have companies that are providing professional support for the platform.

Though this isn’t an exhaustive list, it’s a good starting point to explore different alternatives. However, there are also functional requirements of the persistence layer that must be considered.

  • Aggregators are a first class concern: aggregator modification is potentially the most heavy write load element of an engagement service. So the persistence layer should support highly performant aggregator modification over a distributed infrastructure.
  • Sorting is a first class concern: Most user generated content is going to be in a sorted form, ranging from most recent comments(like news feed) to sort by helpfulness. Sorting large amounts of data should be handled in a highly efficient manner.
  • Multiple pivot points for data elements: each complex data entity should be accessible through its attributes through a reverse index or filtering.
  • Offline stats with map reduce: the persistence layer should support map reduce on data natively or should be able to easily be plugged into a map reduce framework like hadoop.
  • Search integration: text search should either be native or easily pluggable into the persistence layer.
  • Selectable/Updatable individual attributes: Attributes of a data entity should be individually selectable and updatable.
  • Schema Less: The data model should be flexible and should not impose any constraints on what and how much data is stored. A schema less data store like columnar or key-value data stores provide great data model flexibility.
  • Native support for ephemeral data: engagement services are going to generate lots of data of an ephemeral nature, i.e data that is important/valid only for a short period of time. Ephemeral data should not clog up the system.
  • Replication should be first class concern: replication, replication awareness should be deeply integrated into the design and implementation of the database.

Cassandra

Considering all the above factors we ruled out traditional RDBMSs and ventured into the wild west of databases aka “NoSQL”. After evaluating and eliminating a bunch databases (Document stores, key value stores, Riak – attributes not selectable/updatable, HBASE – catered to consistency and partitioning….) we arrived at Cassandra. Cassandra purportedly supports many of the above mentioned functional and non-functional requirements.

The Good

  • Online load balancing and cluster growth: Cassandra is designed from the ground up to be deployed on a cluster. Growing and shrinking clusters is transparent and can deal gracefully with node loss in a cluster.
  • Flexible Schema: Cassandra is a column oriented database and there is inherent flexibility in the schema for the data.
  • Key Oriented Queries: All queries are key,value oriented making the database highly performant if appropriately configured.
  • CAP aware: Cassandra is CAP aware, and allows one to make tradeoffs between consistency and latency (Consistency and Partition tolerance). Consistency can be configured for different levels. One can also tune consistency at a per query level.
  • No SPF: No single point of failure. Cassandra can be configured to be incredibly resilient in the face of massive node loss in the cluster. This is because replication is a first class function of Cassandra.
  • DC and rackaware: Cassandra was built to be datacenter and rack aware and can be configured appropriately to use different strategies to achieve either better DR(disaster recovery) resiliency or minimal latency.
  • Row and Key Caching: row level and key level caching is baked into Cassandra. This makes Cassandra a viable persistent cache.

The Bad

  • Limited Adhoc Querying Capability: due to the absence of a query language as comprehensive as SQL, adhoc querying are severely limited on a Cassandra cluster.
  • Tight coupling of data model: the way in which Cassandra data models are created are heavily dependent on their access patterns from the application layer. In contrast RDBMS systems model data based on entities and their relationships. Hence, the Cassandra data model is tightly coupled with application access patterns. This means any app level features, changes will impact the data model to a large degree.
  • No Codebase stability: Cassandra is still rapidly evolving and the codebase and feature set change constantly.
  • Bad documentation: the documentation is very sparse and outdated/defunct due to the rapid pace of change.
  • Lack of transactions: Cassandra does not have a mechanism for rolling back writes. Since it values AP out of CAP there is no built in transaction support.

The Ugly

  • Steep learning curve: the steep learning curve combined with having to sift through the internet to find appropriate documentation makes Cassandra very hard to get a handle on.
  • No Operational experience: Zero or low operational experience across the organization with Cassandra when compared with RDBMS systems.
  • Here be dragons: One of the scary prospects of going the Cassandra way is fear of the unknown. If something goes wrong, how will we be able to debug, do an RCA in a timely manner without any tooling/experience and fix the issue.

Appendix I

Cassandra Riak HBase
CAP AP AP CP
Language Java Erlang, C, Javascript Java
Type Column oriented Key-Value Column Oriented
Protocol Thrift / custom REST/Custom REST/Thrift
Tuneable tradeoffs
for distribution and
replication
Yes (N,R,W) Yes (N,R,W) No
Map/Reduce Through Hadoop Built In Through Hadoop
Distributed Counters Yes No Yes
Built in Sorting Yes No (map reduce) Yes

Opt-In to save your card on flipkart.com

Enabling the ‘Save Card’ feature on Flipkart.com was a big decision. It involved quite a few process changes and addition of new processes. In our roll-out plan for this feature, we also had A-B testing to figure out if opt-in was better for uptake and ease of use, as compared to opt-out.

To this end, the feature changed from the opt-in model (with which it was launched) to opt-out. This has now changed back to the opt-in model, since the test schedule is over.

Saving cards with trusted merchants has proved to be a better overall customer experience for frequent shoppers online and many sites do this today – including Amazon.com, Apple Store among others. PCI-DSS certified merchants can save customer payment information with them.

We are confident that all cards saved on flipkart.com are safe due to our strict adherence to PCI-DSS and we are duly certified for the same (http://sisainfosec.com/site/certificate/71858558845676366851). But, going by some customer feedback, we believe we should have worded our communication and messaging around the opt-out model during the A-B testing phase.

Seeking customer consent explicitly is indeed the right thing to do in the opt-out phase. Considering that a few customers felt that our opt-out phase was not done in the right spirit, we owe them an apology for the way we rolled it out. Customers who had their card saved without their explicit consent can delete the cards. Cards are indeed deleted from our system when users delete them.

For customers who desire to take advantage of a faster checkout experience on flipkart.com, we have added explicit content messaging around it. Feel free to check the relevant box to save your card details with us (again duly encrypted and tokenized) – you have our word that your card details are safe and we have spent a lot of time and effort to make this process fully secure and useful for our customers. We also plan to send out a confirmation email to notify customers who save their cards shortly.

To know more about how FlipKart’s certification of Payment Card Industry Data Security Standard 2.0 helps protect your data, click on the link below:
http://www.thehindubusinessline.com/industry-and-economy/info-tech/flipkart-gets-security-certification-on-payment-solution/article4071223.ece

Note: All product and company names herein may be trademarks of their respective owners

Making of the Saved Card feature

by Abhishek Rajan, Sr. Product Manager, Payments

A click on time saves nine. It may actually save more than nine (around twenty seconds on an average), if you use the new saved card option on Flipkart.com. This post will describe some of the product/UX challenges we faced while implementing this, and the thought process behind some of our  decisions.

One of the early mock-ups of the checkout interface

To click or not to click

The adoption of the saved card feature depended to a large extent on just one single checkbox option – “Save this card for future payments”. So it was only natural for us to give it its fair share of time and ensure that we got it right.

There were primarily two user concerns that we wanted to address when presenting this option:

  1. What’s in it for me?
  2. Why should I trust Flipkart.com with my card details?

The guiding principle was to keep the communication short & crisp, which made it challenging to comprehensively address both the concerns in one shot.

We decided to address the first concern by explaining the user benefit through the display text of the checkbox.  We addressed the second concern by using a “Learn more” mouse over tooltip. While both the concerns are important for an e-commerce user, Flipkart.com has earned a certain level of trust among its customers and hence the first concern deserved more prominence.

The next logical step was to finalize the exact text to be displayed for this checkbox. We are currently running various A/B experiments on the text to arrive at the most optimum combination.

The Mask

When displaying a card number on a website, masking is typically done to ensure that the card number is not fully visible to the users. For the saved cards, we initially planned to display the first 6 and last 4 digits of the card number, keeping the remaining digits masked. We came up with different masking options:

5566 20** **** 1234
5566 20## #### 1234
5566 20XX XXXX 1234
5566-20XX-XXXX-1234
5566 20xx xxxx 1234

We settled for the last option because it looked neater and effectively played down (literally) the masked digits. When we released the saved card feature to an internal audience, one common feedback was that we were leaving too many digits unmasked, especially for Amex cards where we were revealing 10 out of the 15 digits.

One could argue that with 6 masked digits for a typical 16 digit VISA/Master card number, the chances of correctly guessing the card number are 1 in a million, but most users typically aren’t statisticians of sorts to appreciate the laws of probability. The user feedback was thus incorporated and the version that finally went live displays only the first 2 and last 4 digits of the card number.

Cards have a shelf life too

Almost all cards have an expiry date (barring a few Maestro cards). Imagine if your user finds out at the time of checkout on your website that his card has expired, and cannot be used to make the payment.  This can be very frustrating, and defeats our goal of reducing friction.

So what do you do if your customer’s credit card is nearing expiry?

If we were the card issuing bank, we would courier a new card to the customer. Since we can’t do that, the next best thing that we can do for the customer is to remind him that it’s time he got a new card.

If a card is nearing its expiry date then the message “Card is expiring” gets flashed near the card number. Try adding a card through “My Account > My Saved Cards” on Flipkart.com and specify the expiry date as current month to see this in action.

The next question that came to mind was – what to do with the saved card once it expires?

Initially we thought we’ll keep displaying the expired card till the user gets sick of it and finally removes it on his own. But why create more noise for our wonderful users? So we decided to display the expired card only for a month, after which the card will be removed automatically. And all this while the card will not be selectable and the card logo will be displayed in gray scale to make it appear unusable.

One of the suggestions received from our internal users was that we should send an email alert to our customers as and when their credit card expires. While the intention looked noble, we felt it would be too intrusive and most online users may not particularly appreciate the intent.

Expired Card Mockup
Expired Card

My Corporate Card

We realized that power users would like to save and use multiple cards while shopping online. However, it may not be intuitive enough for them to distinguish between their cards by just looking at the first 2 and last 4 digits. If the user is required to take out the card from his wallet to match the last 4 digits and accordingly select one of the saved cards for payment, then we would feel that we haven’t done our job well. So we added the card label, which could be used to give a unique personalized name to every saved card. E.g. My Corporate Card, My Shopping Card, etc.

However, this option is available only in My Account section. We deliberately removed this option from the Checkout flow to keep the number of input fields on the card payment page to the minimum.

Remove this card

Typically, sites that offer the saved card feature do not provide the option to delete the saved card during the checkout process. However, from day one, we were clear that we wanted to offer the option to enable users to delete their saved cards even on the checkout page.

From my previous experiences, I had learnt that many users select the card save option (like the mandatory T&C checkbox) without realizing that it will cause their card to be saved on that website. When such users return for a repeat purchase, they get surprised to see their card appearing as saved and frantically look for an option to delete it. If they don’t find the delete option, they will end up contacting customer support. End result, bad customer experience and additional operations overhead. Hence the explicit “Remove this card” self-care option on checkout.

There’s another interesting fact about the “Remove this card” option. Most users who click on it would expect to receive a dialog box for a final deletion confirmation. We felt this was a redundant step/click that should be avoided.

E=MI2

If the user has a saved credit card, his saved card (irrespective of the card’s bank) appears on selecting the Credit Card payment option on the checkout page. This behavior had to be modified for the Credit Card EMI payment option. Reason being, the EMI option for a given bank applies to only the credit cards issued by that same bank.

In other words, we shouldn’t display a saved ICICI Bank credit card if the user selects HDFC Bank EMI option. We had to therefore ensure that a saved ICICI Bank credit card is displayed only if the user selects ICICI Bank’s EMI option and is not displayed for any other Bank’s EMI option. This required us to identify the issuing bank name of the credit card, without explicitly asking the user for this information. This was challenging, though not impossible. Wondering how we solved this? Read on.

Card BIN Laden

Did you ever notice that all VISA credit/debit card numbers necessarily begin with a “4” while MasterCard begin with “5”?

The first 6 digits of a credit/debit card number are referred to as the BIN or the Bank Identification Number. This is a magical number that can reveal almost everything about the card – whether it’s a VISA or MasterCard, Credit or Debit card, Platinum or Gold card, Indian or US and also the Bank that has issued this card.

Unfortunately, we didn’t come across any single authentic source of BINs that was both comprehensive and accurate. There are several paid BIN databases available online, but none had the acceptable level of accuracy that we were aiming for. So we decided to compile our own list of BINs by collating information from multiple sources including the issuing banks and online BIN dbs.

One of our sharp dev team members came up with this really cool idea (please don’t try this at home!). We generated dummy card numbers for the BINs for which we didn’t have any details. Next we attempted a card transaction using these dummy numbers. The payment gateway redirected us to the 3dsecure page. Voila, the 3dsecure page contains the bank name of the card!

Another interesting trick to confirm whether a card is credit or debit, takes advantage of RBI’s recent mandate on reduced processing fee on debit card transactions. As a merchant, we receive a daily settlement report from the processing banks for each card transaction on our site. The report contains the processing fee for each transaction. Transactions with the lower fee would be debit card transactions.

Default card label

The BIN list compilation effort helped us introduce another favourite feature of ours – the default card label.

As mentioned earlier, when the user is saving a card during checkout, there is no input field for specifying the card label. We expected 80-90% of our users to save their cards during checkout process. For the benefit of such users we wanted to specify a default card label.

We looked at two options for the card label:

  1. Card holder’s name
  2. Bank name

We realized that most of the time, users will save their own cards. In such a case, using the card holders name as the card label wouldn’t help in differentiating between different card.

On the other hand, users are likely to have cards from different banks.  We decided to use the bank name suffixed with the card type (Credit/Debit) as the default label.  This provides a reasonable default for most users.  And for duplicates, users can change the label later.

Guess what!

When displaying the saved cards on the checkout page, if a user had saved multiple cards, we had two options:

  1. Display all the cards unselected and let the user choose the card
  2. Make a smart guess for the card that the user is most likely to use and show this card as pre-selected

The 2nd option had the advantage of one less click (~ more convenience for user). And even if our guess failed, worst case scenario, the user will have to click on some other saved card. That’s no worse than option 1. So we intuitively went for the 2nd option.

Now we had to figure out an algorithm to predict the card that the user was most likely to use. We decided to build a frequency counter that would track the number of times each saved card had been used to make a payment. The most frequently used card should be the best guess.

It seemed logical till we were confronted with a use case wherein a user switches from an old frequently used card to a brand new card. An extremely probable scenario for any user. Imagine if the user’s old saved card had a frequency count of 15. He will now have to use his new card at least 16 times before it gets picked by our so called “smart” guess algorithm. We thought why not refine our frequency logic and take into account only the last ‘x’ transactions instead of all transactions, to determine the most frequently used card. Now again, depending upon the value of ‘x’, the guess may work for some and may not work for others.

Sometimes, we unnecessarily complicate a problem that may have a rather simple answer. In this case the answer seemed to be x=1. Why not just look at the last used card? That would work for most of the use cases, except if the user keeps shuffling between his saved cards, a use case that we decided to keep outside the MVP (minimum viable product).

The Launch

The Saved Card feature was finally launched on 29th October. The team created a teaser video to announce the feature launch within Flipkart.com. The response so far has been very overwhelming..

We hope that the convenience offered by the saved card feature will motivate many of our net banking and cash on delivery users to give their credit/debit cards a try.

Making Deliveries Faster – the Flipkart cache cluster

by Vinod V, Vivek Y S & Regunath B.

Flipkart website has grown exponentially over the last few months, if one were to go by Alexa rankings and our own internal metrics for tracking web-site traffic. While this is heartening and good for the business, it also places an unstated expectation – that of sustained quality of service at large scale.

The Flipkart motto of ‘Service, Selection and Price’, in that order, reflects in multiple things we do. The Flipkart service experience begins at the website and is treated pretty seriously. In the first of a blog series under the title “Making Deliveries Faster”, we’ll talk about the Flipkart Cache Cluster – one of the key infrastructures that helps sustain the Flipkart website experience at scale.

This blog post has three parts.  The first talks about use-cases that necessitate a cache and derive criteria for its selection.  The second talks about internals of the chosen solution (membase).  The third talks about nuances in our deployment.

Part I – Why cache?

Larry Page once said : “Browsing should be as simple and fast as turning a page in a magazine”.  A key metric that defines an on-line shopping experience is the user perceived response time. Note here the reference to user perception. Here is a great reference to timing attributes.

Response time for a page request can be broadly broken into the following stages:

  1. Time spent on the network
  2. Time spent by the application in preparing data for the page – application security, reading data from persistence, transformation, applying business logic, data preparation for rendering etc.
  3. Rendering the page on the client browser

As one may see, latencies introduced in any of these stages adversely affect user-perceived response times.

e-Commerce sites are rich in content and the Flipkart pages are no exception. But serving these pages is not straightforward for a variety of reasons:

  • Content for a page is personalized to improve relevance, hence needs to be dynamic
  • Pages are composed using shared elements.  For example, the product details of a fast-moving item may be shown in multiple places – the best-sellers widget, the Search Page and the Shopping Cart page.
  • A number of product attributes are subject to change – stock/availability, recent reviews, price.

All product attributes need to be read from persistent store at runtime. This persistent store needs to provide lowest possible latencies at scale – for millions of products and across millions of requests.  And that’s where caching comes in.

Another use case worth discussing is that of User sessions. Websites use sessions to keep track of a number of things – details of the logged in Principal and activities performed by the user on the website, including but not limited to the shopping cart. This session data too can grow into millions pretty quickly, especially if the retention period is a few days. Low latency access to session data is a need in application security and data preparation by the application for generating a web-page response.

Most websites start off storing and serving these data from their primary persistence stores, which is often a RDBMS database. This is a very viable choice for reasons around strong data consistency and durability guarantees (ACID properties). However, over time, this infrastructure tends to become the most stressed, difficult to scale (rather, distribute) and gradually degrade into an often occurring single point of failure.

The difficulty in scaling almost all distributed data stores is nicely characterized by the CAP theorem. The Product Catalog and User Session use cases described above may be re-looked in the context of CAP and the data system characteristics re-defined as follows:

  • Highly Available data store where downtimes affect a section of users, at most. Better if this can be further localized to say Geographic location, User profile, Data stored and Web-pages.
  • Low Latency, High Throughput reads i.e. Serve millions of requests in tens of milli-seconds.
  • Guaranteed Writes – preferably to redundant copies, else, Eventual Consistency.
  • Application-selectable guarantees on read-consistency with application driven reconciliation across versions if required.

The Flipkart use cases of Product Catalog and User Sessions diverge somewhat in the requirements around consistency especially in user experience when deployments are heavily distributed- multiple data centers connected by WAN say.

It should also be fairly obvious that individual and cumulative data sizes are not very big, that when distributed across machines, can fit into memory. Data stores that leverage primary memory have least latencies –  as they effectively avoid disk seeks and often disk I/O. A memory-based cache infrastructure therefore was deemed as the most viable option with Availability, Durability and Consistency guarantees becoming additional selection criteria.

Part II – Membase

This part concentrates on the internals of memcache (a high-performing and very leveraged caching solution at Flipkart), its short comings and how membase solves it. Also covers a few architectural patterns that are used in memcache/membase that may be relevant in other places.

Typically memcache servers are organized as farm of many servers. Each memcache server is owner of a sub set of the key space. Now either applications are aware of this ownership-server mapping or a client library can achieve the same. Either way keys are sharded and key value requests are sent to the right server.  Here are some key characteristics of memcache.

Constant time operations

  • Most of the memcache operations are of constant time. They are independent of the number of keys or the size of they keys that stored.

Least Recently Used

  • Memcache server is started with a pre-determined amount of space. As long as all of the key/values and its metadata fits in that pre-determined memory, memcache server goes on allocating memory for the key value pairs. But once all of the pre-determined memory is used, memcahe starts to evict the key values from the memory to allocate space for new key values pairs. The memcache uses LRU policy to evict the keys.
  • Internally it uses a counter, which is a timestamp of last access time, to figure out the key to evict.

Memory allocation

  • A single memcache server can easily saturate a 1Gbps network link. It can easily perform 100K operations (gets and puts) per second. Ideally this should result in a terrible memory fragmentation. But surprisingly this does not happen on memcache servers.
  • Memcache is able to achieve this by having its own memory management system. It is called slab allocation. It is very similar to slab allocation in linux kernel. Let me explain.
  • When memcache server is started with very verbose option we see something like this
slab class 1: chunk size 80 per slab 13107
slab class 2: chunk size 100 per slab 10485
slab class 3: chunk size 128 per slab 8192
slab class 4: chunk size 160 perslab 6553
  • When memcache server is started, it partitions all of the allocated memory in to pages assigned to a slab class. Each page is of 1MB size, which coincides with the max size of a key value pair. Each page is divided into chunks of same size. There can be multiple pages with the same chunk size. But once a page is assigned to a slab class it cannot be re-assigned. For example there can be 10 pages with slab class 3. ie with chunk size 128 bytes.
  • The smallest chunk size starts from 80 bytes and increases with a factor of 1.25 rounded to the next power of 2.

Lets take a hypothetical scenario and see how memory allocation happens in memcache.

  1. Start the memcache server with 100MB of space. It  requests 100MB of heap space from the OS and starts to manage it. It starts creating 1 page of each slab class. The remaining memory is reserved with out partitioning.
  2. When client starts inserting key value pairs to the server, it uses the above created slabs to allocate.
  3. Once all the chunks in a particular slab are used, another page is created from the unused memory and assigned to that particular slab class.
  4. One of the interesting consequence of allocating space like this is on the working of LRU algorithm. Lets take another hypothetical scenario to understand this.
  5. Just like previous scenario, lets start the memcache server with 100MB space. It initializes the memory.
  6. Say the client starts to insert only 128 bytes key value pairs until all of 100MB is used up.
  7. Then client inserts a key value pair with 1MB. It uses the already created page of the slab class with 1MB chunk size.
  8. If the client inserts another 1MB key value pair, then memcache evicts the previously inserted 1MB key value pair. Even though there are many 128 bytes key value pair that were inserted before 1MB key value pair.
  9. From the above scenario it is clear that LRU is applied per slab class

Some of the major short comings of memcache for our use cases:

  • Horizontal scaling – Whenever a new machine needs to be added for horizontal scaling, key ownership changes unless applications or client libraries are using consistent hashing.
  • Availability – When a memcache server crashes or not available due to n/w partition, cache is not available. A way to solve this problem is by replication. But when replication is introduced another challenge that arises is consistency.
  • Persistence – This is not a design goal for memcache.

Membase took a different approach to solve all of the above 3 problems with single solution (called vBucket). Before describing the internals of vBucket, lets start with some of the design goals membase aimed to achieve:

  • Never service a request on the wrong server.
  • Allow scaling up and down at will.
  • Servers refuse commands that they should not service, but
  • Servers still do not know about each other.
  • We can hand data sets from one server to another atomically, but
  • There are no temporal constraints.
  • Consistency is guaranteed.
  • Absolutely no network overhead is introduced in the normal case.

To achieve above design goals, Membase has cluster management server called ns_server (north scale server) written in Erlang.

Now, back to vBuckets.  A vBucket is conceptually a computed subset of all possible keys.


One could think of it as 2-level hashing with first level of hashing computed dynamically and second level mapped statically. The number of vBuckets in the cluster remains constant and it is independent of the cluster topology.  This means a key say x maps always to same vBucket as long as the same hash function is used.

This is a nice architectural pattern to achieve the effects of consistent hashing, without implementing a consistent hashing algorithm.

There are few terms that needs to be defined before vBucket implementation can be explained.
Cluster : A collection of membase servers.
Server : An individual machine within the cluster.
vbucket : A subset of all possible keys.

Also a vBucket can be in any one of the following states:
Active : This server is servicing all the request for this vBucket
Dead : This server is not responsible for this vbucket.
Replica : Receives only replication commands but does not server any clients.
Pending : This server will block all request for this vBucket.

Let us look at the vBucket state transitions and how horizontal scaling is achieved by this.


Initially membase cluster contains a single server. Typically the number of vBuckets in a deployment vary b/w 1024 to 4096. The max number of vBuckets can be 65536.  For the sake of simplicity let us take 6 vBuckets.

 

 

 

Now lets add another server. So there is one active server & another new server.

Adding a new server will not unbalance the tree. All the vBuckets in the new server comes up in the dead state.

 

 

 

In order to make the new server usable, rebalance of the vbucket map has to be done. The rebalance process transfers the vBucket from old server to new server. This is done by an asynchronous thread. The rebalance process selects a subset of vBuckets that the new server should serve and set them to pending state. Then data is sent from the old server and placed into new server. Once the transfer is complete the state of the vBucket is set to active on the new server and dead on the old server.

These are the exact order in which the operations are performed.

  • The vBucket on the new server is placed in pending state.
  • A vBucket extract tap stream is started.
  • The vBucket tap stream atomically sets the state to dead when the queue is in a sufficient drain state.
  • The new server only transitions from pending to active state after it receives confirmation that old server is no longer servicing the requests.

By performing the operations in the exact order one can guarantee no more than one server is active for any given vBucket at any given point in time without any regard to actual chronology. (This is a major difference between data stores like Riak and Membase. Riak uses vector clock to achieve the same result).

You may notice that there is a very short time period where a vBucket has no active server at all. This occurs at the very end of the transfer mechanism and causes blocking to occur.  Client rarely notices this.

High availability in membase is achieved by replication. When replication is enabled, and a new server is added, some of the vbucket state is changed to replica. Replica vbucket is like dead vBucket from client perspective. It does not serve the clients. But it listens to replication commands.

If a server crashes, vBucket to server static map is changed. The vBucket owned by the crashed servers are activated on the other servers in the cluster. Those vBuckets in replica state are changed to active state so that clients can be served. This change happens so quick that controlled fail overs will never result in client request failures.

Finally it is clear that vBucket solves two of the three short comings of memcache (availability and horizontal scaling).

Persistency in membase is an extension to memcache. Memcache has a module called engine interface. Using engine interface storage engine can be implemented.  The complete RFC can be found here.

One of the best part of engine interface is, it completely asynchronous and memcache can be started with any storage engine that implements engine interface. Currently Membase uses SQLite as the storage engine. With couchbase 2.0 this will be replaced with couchdb’s storage engine.

Part III – Nuances of Membase use at Flipkart

The Flipkart systems differ slightly in their usage of the Membase cache cluster. The difference is primarily around detecting and explicitly handling data consistencies between sources of data, values in cache and data as seen by clients. One such instance (Product Catalog service) is highlighted below:

  • Use the Flipkart cache wrapper library (Prometheus) to perform all cache operations. Prometheus was primarily written to support connection pooling.
  • Use increment/decrement methods to implement atomic counters which allows different backend servers with versioning.
  • Use add (put if absent) and cas (compare and set) operations which allows for lists and set semantics shared across multiple clients.

Shifting gears to one of the interesting properties that was observed on our production cluster was cache miss on Membase buckets when the particular bucket was using about 70% of the allocated memory. The problem was there was no free memory on the OS. OS was caching the IO objects.

To circumvent this problem a simple solution was adopted. Reduced the swapiness of the linux virtual memory system and regularly flush the OS cache.

Acknowledgment
Most of the content about the internals of Membase are based on memcache and membase source code, documentation from couchbase and finally blog posts from couchbase developers.

It’s an exponential world

presented by Ashok Banerjee, VP of Engineering, at 5th Elephant, on July 27, 2012

Flipkart is India’s largest E-Commerce entity. It sells products in 14 verticals and counting, from books to perfumes, and lists a few million different products. And this number is growing every day. Each product has its demand forecasted, items procured into inventory and demand is localized to place items in the closest warehouse.

It is critical to forecast aggregate demand and plan inventory across current numbers, trends, seasonality and the gaussian noise. We have information on customer preferences, customer curiousity (browsing patterns and searches), buying habits, buying frequency, etc. These plans and forecasts seriously impact our buying decisions. Fundamentally Flipkart’s growth is based on word of mouth models, which are exponential in nature therefore the extreme relevance of this topic for Flipkart.

We often loosely talk about exponential growth, in this talk we will delve into the mathematical models of when a domain or market will undergo exponential growth. We often mistakenly believe the execution of one company is better than that of another, when in fact the domains and fundamental mathematical growth models of the two markets are different.

Exponential growth and exponential decay are often witnessed in many domains, not just business. These mathematical models have great fertility, from the growth of bacteria in your mouth every night, to the growth of population, to the spread of infections, to distribution of allergens, dust or mosquitos, to radioactive decay, to revolutions in the middle east and the decay of interest in topics on Twitter, to the decay of your sorrows, infatuation, etc.

This presentation will help users connect to the spaces/domains their current businesses, their lives via a fundamental mathematical model.

This understanding informs everything, from scaling of database, to scaling of message systems (the 2 have very different challenges), to demand forecasting, inventory planning to operations planning for (base, trend, seasonality and spike), and even staffing. Most often organizations undergoing these changes cannot comprehend the challenges that barrel at them but this structure enables deeper thinking.

We will also talk about when the exponential growth really ends and how the “epidemic” stabilizes.

The video is up at hasgeek‘s youtube channel

The presentation is up at slideshare.

Hello world!

by Mekin Maheshwari, President of Engineering, Flipkart

Two weeks, and I would have completed 3 years at Flipkart.

I would be lying if I said the journey has been exciting for me – Exciting is too mild a word for the journey that Flipkart has had, and this holds true for most folks in the Flipkart technology team, who have been on the journey for some time.

But in this sky-diving-like-fast-paced-journey, we have not done justice to sharing back enough.

We at Flipkart Technology have been strong believers in the power of community.  A lot of how we work internally is heavily influenced by communities, esp. the Open Source community.  We have presented in a few talks (especially at HasGeek conferences), and spoken a bit in interviews.  We have also talked about the technology and culture on Quora.  But, given all that goes on inside Flipkart Technology, I think we have done a terrible job of being transparent with the rest of the community.

This blog will fix that.

To give you a taste of what you can expect on this blog, we plan to share:
  • Our learnings when dealing with technologies and the impact they have on business and customers
  • Thoughts that individuals in Flipkart Technology team have about areas: Technology, Open Source, Community, Organization, Hiring, Hack-Days, etc ..
  • Our failures – and more importantly the learnings from them

We have had an internal tech blog for a while now and the posts there are top-notch (I’ll let you decide for yourself, as we plan to put out those posts on this blog slowly).

Looking forward to this becoming an avenue for sharing, learning, heated discussions, hot debates, ideas and collaboration.