Choosing a Datastore For The Flipkart User Engagement Platform

A User Engagement Platform

Tersely defined, a “User Engagement Platform” is a platform that solicits, accepts and displays user content and actions. These collections of content or actions are usually associated with both user and non user entities. Such a platform enables engagement paradigms like reviews, user lists, ratings, votes, comments, karma… etc. It’s easy to imagine such a system being used by millions of users, who are constantly submitting and accessing varied and rich content. This access pattern implies near real time updation of stats, ratings, votes, likes, plays, views… etc at scale. This usage creates loads of data, an abundance of which is volatile and ephemeral in nature.

In short, imagine a customer engagement platform for Flipkart, with about a million unique visitors per day, most of whom are actively engaging with the site and the community. We envision our user engagement platform handling such a load pattern with ease and ready to scale. One of the most critical, operational components in such a system is the data persistence layer. Ideally, this persistence layer should be highly performant, horizontally scalable with minimal overhead, consistent, always available and partition tolerant. But…..

CAP: Or why you can’t have your cake and eat it too

The CAP theorem, or Brewers theorem of distributed computing systems, states that it is impossible for a distributed system to simultaneously provide all three guarantees of

  1. Consistency: All nodes see the same data at the same time
  2. Availability: a guarantee that every request receives a response about whether it was successful or failed
  3. Partition tolerance: the system continues to operate despite arbitrary message loss or failure of part of the system.

The implications of the CAP theorem are crucial for selecting the persistence layer of an application or service. One has to carefully analyze their respective app/service requirements and pick the appropriate guarantees from CAP that their prospective persistence layer will support.

Data access patterns for User engagement services

Historically, the data access pattern for user generated content was low on writes and heavy on reads. However, modern social engagement patterns around user content have modified this pattern to being write heavy as well, to accommodate more likes, more ratings, more upvotes, more comments, more shares. So, today’s and tomorrow’s engagement services should accommodate, heavy write loads, heavy read loads, heavy aggregate(counter), modify and read loads. What becomes apparent if we look at user engagement services in this way is that aggregation needs to be a first class function of engagement services that is near real time, scalable and highly available.

Eventual consistency and User experience

At the same time we can also note that most of today’s engagement heavy applications tradeoff consistency for eventual consistency to achieve better scalability through horizontal partitioning and availability. The extent to which a congruent user experience can be pulled off with eventual consistency differs greatly. Reddit’s upvote user experience is a good example of using eventual consistency without it adversely affecting how the user perceives consistency on the platform.

Youtube’s “301 Views” for a video with “20000 likes” falls at the other end of the spectrum of good user experience using eventual consistency. So, with careful application and service design, effective tradeoffs on data consistency can be made without affecting the user experience. By doing this we immediately free ourselves from the “C” constraint of CAP, which leaves us free to explore the “Availability and Partition tolerance” guarantees which are very much desired in this context. The following section gives a brief example of the kinds of use cases that our engagement platform should support and what they imply for the persistence layer.

A playlist of the people

Imagine a community created playlist on Flipkart’s Flyte. This is a playlist where people add songs to a global playlist and then upvote or downvote songs added by other users. The users should have an always on experience and neither their submissions nor votes should be missed. The implication here is that there shouldn’t be a single point of failure in the write path. Hundreds/thousands of users could be simultaneously upvoting/downvoting the same song so locking of counters should be avoided and Distributed Counters should be preferred. Not every user can see the same view of the data as the nature of the data is very transient to begin with, so eventual consistency should do fine. Given the massive amount of user engagement, the sort order of the playlist is going to change very often so one should avoid query time sorting and prefer data that is natively sorted. Adequate clarity around such usage scenarios enabled us to confidently transition from requirements assessment to technology selection.

Technology Selection: What we want from a persistence layer

Let us consider the attributes of an appropriate persistence layer for engagement services.

  • Availability and Partition tolerance, with tunable consistency: as discussed above we should be able to trade off on consistency to accommodate highly available and partition tolerant engagement services.
  • Linear Scalability: In the face of massive amounts of content being created by  users the system should be able to scale without degrading performance.
  • Operational Efficiency: The operational requirements of the highly available, distributed, partition tolerant persistence layer should be minimal.
  • Community Support: There should be a thriving community of users and developers that is both effective and helpful.
  • Parsable Code Base: The code base should be gorkable both in size and complexity, with either good documentation or help from the community.
  • Professional Support: It is preferable to have companies that are providing professional support for the platform.

Though this isn’t an exhaustive list, it’s a good starting point to explore different alternatives. However, there are also functional requirements of the persistence layer that must be considered.

  • Aggregators are a first class concern: aggregator modification is potentially the most heavy write load element of an engagement service. So the persistence layer should support highly performant aggregator modification over a distributed infrastructure.
  • Sorting is a first class concern: Most user generated content is going to be in a sorted form, ranging from most recent comments(like news feed) to sort by helpfulness. Sorting large amounts of data should be handled in a highly efficient manner.
  • Multiple pivot points for data elements: each complex data entity should be accessible through its attributes through a reverse index or filtering.
  • Offline stats with map reduce: the persistence layer should support map reduce on data natively or should be able to easily be plugged into a map reduce framework like hadoop.
  • Search integration: text search should either be native or easily pluggable into the persistence layer.
  • Selectable/Updatable individual attributes: Attributes of a data entity should be individually selectable and updatable.
  • Schema Less: The data model should be flexible and should not impose any constraints on what and how much data is stored. A schema less data store like columnar or key-value data stores provide great data model flexibility.
  • Native support for ephemeral data: engagement services are going to generate lots of data of an ephemeral nature, i.e data that is important/valid only for a short period of time. Ephemeral data should not clog up the system.
  • Replication should be first class concern: replication, replication awareness should be deeply integrated into the design and implementation of the database.

Cassandra

Considering all the above factors we ruled out traditional RDBMSs and ventured into the wild west of databases aka “NoSQL”. After evaluating and eliminating a bunch databases (Document stores, key value stores, Riak – attributes not selectable/updatable, HBASE – catered to consistency and partitioning….) we arrived at Cassandra. Cassandra purportedly supports many of the above mentioned functional and non-functional requirements.

The Good

  • Online load balancing and cluster growth: Cassandra is designed from the ground up to be deployed on a cluster. Growing and shrinking clusters is transparent and can deal gracefully with node loss in a cluster.
  • Flexible Schema: Cassandra is a column oriented database and there is inherent flexibility in the schema for the data.
  • Key Oriented Queries: All queries are key,value oriented making the database highly performant if appropriately configured.
  • CAP aware: Cassandra is CAP aware, and allows one to make tradeoffs between consistency and latency (Consistency and Partition tolerance). Consistency can be configured for different levels. One can also tune consistency at a per query level.
  • No SPF: No single point of failure. Cassandra can be configured to be incredibly resilient in the face of massive node loss in the cluster. This is because replication is a first class function of Cassandra.
  • DC and rackaware: Cassandra was built to be datacenter and rack aware and can be configured appropriately to use different strategies to achieve either better DR(disaster recovery) resiliency or minimal latency.
  • Row and Key Caching: row level and key level caching is baked into Cassandra. This makes Cassandra a viable persistent cache.

The Bad

  • Limited Adhoc Querying Capability: due to the absence of a query language as comprehensive as SQL, adhoc querying are severely limited on a Cassandra cluster.
  • Tight coupling of data model: the way in which Cassandra data models are created are heavily dependent on their access patterns from the application layer. In contrast RDBMS systems model data based on entities and their relationships. Hence, the Cassandra data model is tightly coupled with application access patterns. This means any app level features, changes will impact the data model to a large degree.
  • No Codebase stability: Cassandra is still rapidly evolving and the codebase and feature set change constantly.
  • Bad documentation: the documentation is very sparse and outdated/defunct due to the rapid pace of change.
  • Lack of transactions: Cassandra does not have a mechanism for rolling back writes. Since it values AP out of CAP there is no built in transaction support.

The Ugly

  • Steep learning curve: the steep learning curve combined with having to sift through the internet to find appropriate documentation makes Cassandra very hard to get a handle on.
  • No Operational experience: Zero or low operational experience across the organization with Cassandra when compared with RDBMS systems.
  • Here be dragons: One of the scary prospects of going the Cassandra way is fear of the unknown. If something goes wrong, how will we be able to debug, do an RCA in a timely manner without any tooling/experience and fix the issue.

Appendix I

Cassandra Riak HBase
CAP AP AP CP
Language Java Erlang, C, Javascript Java
Type Column oriented Key-Value Column Oriented
Protocol Thrift / custom REST/Custom REST/Thrift
Tuneable tradeoffs
for distribution and
replication
Yes (N,R,W) Yes (N,R,W) No
Map/Reduce Through Hadoop Built In Through Hadoop
Distributed Counters Yes No Yes
Built in Sorting Yes No (map reduce) Yes

20 thoughts on “Choosing a Datastore For The Flipkart User Engagement Platform”

  1. I am a little surprised you guys didn’t go with Mongo. Have you considered it completely? From what I understand, mongo delivers very well on every point plus point you mentioned about Cassandra, with a few added advantages too.

    I work for one of the bigger Indian websites, and we have taken a decision to start porting most of our data to MongoDB, as our current RDBMS servers is giving as availability issues. We have already rethought of most of our application data in Mongo Documents, and have ported a few of the smaller applications to MongoDB successfully.

    1. Hi AnonTechie,

      MongoDB is a great document store and the guys at 10Gen have done and amazing job with it. Several teams at Flipkart have extensively used MongoDB in the past (some might still be using it).

      The main reason why we did not consider MongoDB for our platform was that Mongo is inherently designed as a CP system(Cosistency and Partition tolerance). For our platform we wanted a datastore that was built from the ground up to be Highly Available and Partition Tolerant AP. This means operationally, availability is something we shouldn’t need to worry about wether we are facing node failures, replacing nodes or adding new nodes to the cluster.

      As far as suitability of MongoDB for your main data store you and your team would be ideally placed to answer that. However, I would advise anyone considering using a NoSQL database as their main datastore to become operationally proficient with it, benchmark/test it at production load before you switch your stack to using it. I would err on the side of RDBMSs most of the time when talking about the primary data store for a website because “better the devil you know, than the devil you don’t”.

      1. Thanks for taking time to reply.
        Our need for consistent data was actually one of the most important consideration, if not the most, considering the data our applications are based on. We will be keeping our RDBMS as the main data store too for now, but we will try to run the services of MongoDB servers.

        Anyway, thanks for writing an informative article. I will keep following this blog for updates now :)

    2. A few points:
      1. MongoDB has SPOF (mongos, config servers). Or at least hard to do (have multiple mongos, config servers) without ops nightmare.
      2. MongoDB is not masterless.
      3. I’m yet to hear about a Mongo cluster which scaled to say 20 or 30 of nodes
      4. Load balancing in a sharded environment can be very difficult. Once you have chosen a shard key it’s very difficult to go back and change that. This is where a Dynamo based system shines with its Ring approach.

      I would highly recommend reading http://shop.oreilly.com/product/0636920018308.do MongoDB is full of dark shades, ifs and buts.

      1. MongoDB does deliver on many of the core features that Cassandra provides. And MongoDB does have success stories. One of the front runners is FourSquare. Here is a podcast and FourSquare engineering blog which talks about MongoDB @ FourSquare: http://engineering.foursquare.com/2011/12/21/show-and-tell-mongodb-at-foursquare/
        and
        http://engineering.foursquare.com/2011/06/17/presentation-on-how-foursquare-uses-mongodb/

        With new 2.0 release they have sorted out most of the availability and SPOF issues :)

        Thanks
        NP
        http://about.me/phaneesh.n

        1. According to one of the slides, they have 8 clusters with 40 nodes. Anyone can guess what’ll the biggest cluster size be. Doesn’t sound impressive from the scalability point of view. Compare this with no. of nodes HBase and Cassandra have proved to scale upto.

          Cassandra:
          1. “The largest known Cassandra cluster has over 300 TB of data in over 400 machines. ” http://cassandra.apache.org/

          2. http://www.quora.com/Cassandra-database/What-is-the-largest-Cassandra-ring-cluster-anyone-has-used

          HBase: http://wiki.apache.org/hadoop/Hbase/PoweredBy

          1. Well! HBase & Cassandra can scale up to many 1000s of petabytes. IMHO, very few would be looking at that scale/data volume/velocity and data processing requirements (a.k.a Facebook, Google). Cassandra’s 300 TB over 400 machines deployment (Compared to HBase @ FB) is not *very* impressive either. I have seen old school Oracle RAC and MySQL deployments that have sizes nearing 1000 T on less than 2 to 3 dozen nodes. Many internet companies (Craiglist, Sourceforge, FourSquare) have been using MogoDB for quite sometime and that does prove that it can support internet scale operations. If the deployment/scale does not warrant/require 1000 petabyte datastore and data processing requirements that involves insightful analytics (like Google & Facebook) MongoDB can be a very worthy candidate to look at.

            Thanks
            NP
            http://about.me/phaneesh.n

    3. @AnonTechie
      Mongo uses MMAP to store data. I have personal experience deploying a large cluster in production (10+TB of data). All performance numbers and operational ease goes for a toss when the size of cluster grows beyond primary memory. The cluster has limited ability to self-heal, especially recovery from staleness for members in a shard. We had to perform a lot of manual data export/import to get the cluster back up. Needless to mention, we lost updates in the process. Mongo was therefore a non-starter in the selection list, adding to the fact that it was a CP system while we really needed a AP system in CAP parlance.

  2. I was wondering why you folks didn’t consider HBASE considering FB itself is powering their MESSAGES tech with HBASE instead of cassandra…

    Regarding the missing ‘A’ above from HBASE i am guessing it’s due to NN being single point of failover…which would be soon addressed with NN HA coming out very soon…

    Also it would be great, if you could share some thoughts/challenges of running cassandra in production supplemented with some estimates related to scale. Do you folks also happen to use the secondary indexing in cassandra

    1. Hi inder,

      NN HA is one issue with HBase and we were aware that NN HA was in the pipeline for it. Another availability concern with HBase is when a region server goes down. When this happens the regions served by that region server become unavailable for a while. Besides this, Hbase requires a lot of operational overhead to setup and maintain; something we wanted to avoid.

      We have been using cassandra in production for a self-contained service that sees decent load(no where enough load to test the viability of cassandra as the main data store). The learning curve for cassandra is steep and made all the more so by the sparse documentation and short release cycles. Operationally, once the cluster was set up we have had very little to do except for minor weekly maintenance. It also requires us to do a complete rethink of all the things we took for granted when working with an RDBMS. One always runs the risk of tightly coupling their data access layer with their persistence layer when working with a columnar store, this has to be carefully assessed and the right calls taken. On the one side tightly coupling with cassandra would enable us to fully utilize it’s unique features, but on the other hand it would tie us to cassandra. And yes, we do use secondary indexes heavily in our cassandra service.

    2. In the latest branches (Hadoop 2.x and Hadoop 1.x (mainstream) NN HA issue has been “patched” using NFS based implementation (which will be moved to ZAB/Multi-Paxos and Qorum Commits (ZK) – HDFS-3077: https://issues.apache.org/jira/browse/HDFS-3077. This code is already there is trunk. Also, NN automatic failover using ZK master election just like HBase is already in trunk (HDFS-3042) – https://issues.apache.org/jira/browse/HDFS-3042.

      Even with curent 1.x mainstream and 2.x NN hot standby using shared NFS implementation does resolve NN SPOF to a large extent.

      Thanks
      NP
      http://about.me/phaneesh.n

    3. @Inder Singh
      HBase is a strong contender for low latency data stores at Flipkart. HBase and the entire Hadoop technology stack is high on our tech roadmap. We are building a few systems on this. Seeing the kind of deployments HBase and Cassandra have at web scale, there were no clear winners. The AP vs CP debate (of CAP theorem) played in favor of Cassandra, assuming all things equal.

      1. @ regunathb Thanks again for your reply.

        I would very curious to know if you folks post it later in another blog or something “Challenges if any in using Cassandra” specifically related to following

        1. Splitting hot regions
        2. Read Repairs when quorum isn’t reached.
        3. Index size and query performance on using secondary index at reasonable scale.

        Minor suggestion – it would be great if there is way to get notifications as EMAIL related to updates here.

        1. Hi inder,

          There are no plans yet to do a follow up post on our operational experience with cassandra(at least not until we have had significant operational experience to write about). As for your concerns,

          Hot Spots: Hot-spotting is a concern for any distributed data store and it needs to be mitigated by making appropriate sharding and key design choices. If however in the unlikely event of hot-spotting in cassandra(even after good key design and a sharding strategy) we have decent wiggle room for splitting hotspots and rebalancing the cluster due to replication semantics of cassandra.

          Read Repairs: It’s a good idea to run read repairs as a batch on your cluster on a weekly/daily basis, this is one of the things we do as weekly maintenance activity. As for issue of read repairs on quorum not being reached is much more complex when one takes into account network outages and replication semantics.

          Index Size and Query Performance: Secondary indexes are performing well within the SLA boundaries we set for our datastore in our limited use case. Will report if we have something significant to share on this front.

  3. If performance, scaling and self management are the primary concern was Aerospike considered at any point

    1. We did not consider any data store that wasn’t open source. We only considered data stores that had a large community and a significant track record. Something like Aerospike would be ideal for low latency, highly performant transactional systems like ad exchanges or realtime ad retargeting/bidding platforms (who indeed look like a significant chunk of their customers).

    2. Aerospike was not a candidate for these reasons:
      1. It was not Open Source
      2. Adoption and community was not as established as other candidate technologies when we did the evaluation
      3. No clear differentiation, at least not very obvious. The beauty of OSS is you can read/look into the internals of software before deciding to use it.
      This is not to say Aerospike could not/would not have worked.

Comments are closed.