Tag Archives: website

Proxies for resilience and fault tolerance in distributed SOA

On-line Content and Transactions
OLTP systems are characterized by their ability to “respond immediately to user requests“. The common understanding of a transaction in OLTP is within the context of a database where data is read-written with appropriate – albeit varying; durability, integrity and consistency guarantees.
OLTP applications are quite varied and depend largely on domain and purpose. Volume and Variety characteristics of data are different as well. For example, consider these differences among a Banking application, an eCommerce web-site and a Social media portal. The OLTP classification of systems is therefore quite broad, but the basic premise remains : respond immediately to user requests.

Over the years systems have started to embrace BASE over ACID and Eventual Consistency is acceptable. Fundamental assumptions around write-time-consistency are challenged (Eric Brewer on BASE vs ACID) and Availability trumps Consistency over at scale. At Flipkart, web-site availability is treated pretty seriously as the website evolves into a platform for delivering rich, personalized and relevant content to the user.

A typical web page on Flipkart has a good mix of static content (delivered off CDN) and data sourced from a number of backend systems as shown here:

Rendering a single such page requires about 2MB of data read/write – comprising Product information, Category tree, User session, Logging and Metrics. The data volume for a single day works out to about 30TB. Delivering this Reliably is hard due to the inescapable consequences of the CAP theorem. However, Responding Immediately to users is Do-Able if loss in Consistency & Data is statistically insignificant.

The Availability Myth
The following listing maps functionality to data stores and protocol services:
website services Evidently, different stores are used and often with good reason – few examples : Category information is structured and is stored in MySQL, User sessions are many and is sharded on Couchbase and MySQL, Search uses Apache Solr as secondary index, Notification data for a user exhibits Temporal Proximity and is stored in HBase, User Recommendations is a keyed lookup on Redis and Metrics is stored in OpenTSDB time series database.

Access to the varied data stores are implemented as SOA services with the primary objective of distribution, de-coupling and interface defined abstraction. Each service cluster has redundant nodes and provides availability guarantee of 99.9% or more.
Running a website that depends on 15 services each with 99.9 % availability, we get

99.9% ^ 15 = 98.5% uptime 
(probability of all services providing 99.9% availability at the same instance of time)

This translates to 2+ hours of downtime per week.In reality, it is generally worse. The cost of running an “always available” service or data store is prohibitively high – accounting for redundancies, backup and near real-time replication of data(strong consistency), seamless failover. Again, this may be attempted with a datastore software that supports all of this and really works!

Latency impacting Availability
Part of the availability problem lies in service invocation. Different styles of service access and its use among developers is depicted in this infographic:

Service access

Services  are often synchronous and the invocation pattern easily translates to making an API call. The API method signature is complete w.r.t data types, method name and errors/exceptions. Service clients handle errors intuitively and at times is forced by the API contract. Consequently  Most of us code to handle Exceptions/Errors, not Latency! 

Handling latencies on the other hand is more involved and requires using techniques like Callbacks and its implementation such as Java Futures. Programming is not straightforward as callbacks don’t compose well – sequencing and combining async. calls is not easy. Moreover, there aren’t many service client libraries that do this transparently.

Another often repeated practice is with regard to measurements where emphasis is on service response Mean and Median times. Variance in response times at the long tail does matter at scale – for example when 10s of servers handle millions of page view requests on a web-site. Consider the Flipkart web-site that uses PHP as the front end – each web server is configured to run a fixed maximum number of concurrent PHP processes and the number of servers is sized by expected load on the website. Consequently, this means resources like CPU, Memory and Processes/Threads are limited/shared and each web-page is served by borrowing, using and immediately returning the shared resource to the pool. Each unit of work is expected to be short-lived and execute in a timely manner. Latency build up – however small and in only a subset of services can impact availability and user experience as shown here:

latency affecting availability

Fault Tolerance – Fail Fast, Recover Quickly
The engineering team at Flipkart built resilience into the website technology stack by having it deal with imminent failures in upstream services. The fk-w3-agent aka W3-agent daemon was already being used successfully to scale PHP and get around some of its limitations (See slide no. 77 onwards in this presentation : How Flipkart scales PHP). A detailed presentation on the evolution of the Flipkart web-site architecture is available here : Flipkart architecture : Mistakes & Leanings.
The W3-agent was redesigned to be a high performance RPC system that could serve as a transparent service proxy. Few design principles for this new system were:

  • Prevent cascading failures – Fail fast and Recover quickly
  • Provide Reasonable fallbacks around failures – the exact behavior can be service specific
  • Support for multiple protocols and codecs in order to enable transparent proxying – Unix Domain Sockets, TCP/IP and Http, Thrift
  • High performance runtime with low overhead – ability for a single local instance to handle hundreds of millions of API/Service calls per day

The fail fast and fallback behavior is entirely functional and implemented as alternate path logic by the respective service owner. The invocation of primary vs alternate path flow is at the discretion of the service proxy.

The Flipkart Phantom

Proxy servers & processes are used extensively as intermediaries for requests from clients seeking resources from other servers. There are different types of proxies and one specific type – the Reverse Proxy can hide the existence of origin servers, where requests from clients and  responses from servers are relayed back-and-forth in a transparent manner. The proxy also offers a runtime for implementing routing or highly localized business logic – for example executing a custom expression to sort data elements returned by the service response.

Phantom is a high performance proxy for accessing distributed services. It is an RPC system with support for different transports and protocols. Phantom is inspired by Twitter Finagle and builds on the capabilities of technologies like Netty, Unix Domain Sockets, Netflix Hystrix and Trooper (Spring).

This design diagram depicts logical layering of the Phantom tech stack and technologies used:
Phantom tech stack
The layer abstraction in the design helps to:

  • Support incoming requests using a number of protocols and transports. New ones (say UDP) may be added as needed. Mixing different incoming(e.g Http) and outgoing (e.g. Thrift) transports are also supported.
  • Create protocol specific codecs – e.g. Http, Thrift. Adding a new Thrift proxy end-point requires only configuration edits, no code change needed.
  • Automatic wrapping of API calls with Hystrix commands with reasonable defaults for Thread/Semaphore isolation and Thread pools. Users of Phantom are not required to program to the Hystrix API and focus on implementing service calls and fallback behavior. Fallback behavior is influenced by configured parameters(timeouts, thread pool size) and real time statistics comprising latent requests, thread-pool rejections and failure counts (see Hystrix’s support for this : How Hystrix works)
  • Define an API layer for calling services. This is optional and promotes request-response data driven interfaces.

Phantom is open source and available here : Phantom on Github

Phantom proxies have been used to serve hundreds of millions of API calls in production deployments at Flipkart. More than 1 billion Thread/Semaphore isolated API and service calls are executed on Phantom everyday. The proxy processes were monitored and found to incur a marginal increase in Resource utilization while response times remained same at various percentiles measured.
Phantom deployment

Responding immediately to user requests – redefining the user experience
Proxies like Phantom provide the technical infrastructure for shielding an application from latencies in upstream services in a distributed SOA. The proxies are transparent to service clients & services and therefore non-intrusive. Fallback behavior for each service however, needs to be implemented by service owners. Also, recovering from failed transactions (if required at all) is outside the scope of Phantom. For example, email campaign hits are stored in a database and the fallback behavior in case of database failure is to append this data to logs. Recovery of data from logs and appending to the database is an operational activity implemented outside Phantom. Another example is displaying product information where Phantom fails over to a local cache cluster if the Product Catalog Management System is down. This behavior can result in issues related to consistency – price changes & stock availability changes may not reflect. The application i.e web-site and the end business processes (fulfillment of orders placed based on cache data) will need to change to redefine the user experience.