All posts by Santanu Sinha

Foxtrot – Event Analytics at Scale

Health and monitoring of actors participating in a SOA based distributed system like the one we have at Flipkart is critical for production stability. When things go south (and they do), not only do we need to know about the occurrence instantly, but get down to exactly when  and where the problem has happened. This points to the need for us to get access to and act on aggregated metrics as well as raw events that can be sliced and diced at will. With server side apps spanning hundreds of virtual machines and client side apps running on millions of diverse phones this gets amplified. Individual requests need to be tracked across systems, and issues found and reported much before the customers get impacted.

Introducing Foxtrot

Foxtrot is a system built to ingest, store and analyze rich events from multiple systems, providing aggregated metrics as well as the ability to slice, dice and drill down into raw events.

Foxtrot in action

Foxtrot provides a configurable console to get a view of systems; enables people to dig into as well as aggregate raw events across hosts and/or client devices; and provides a familiar SQL based language to interact with the system both from the console as well as from the command line.

A case for event analytics at scale

Let’s consider a few real-life situations:
  • Each day, hundreds of thousands of people place orders at Flipkart through the Checkout system. The system interacts with a bunch of downstream services to make this possible. Any of these services failing in such situations degrade the buying experience considerably. We strive to be on top of such situations and proactively monitor service call rates and response times and quickly stop, debug and re-enable access to the degraded services.
  • Mobile apps need to be faster and efficient both in terms of memory and power. Interestingly, we have very little insight into what is actually happening on the customer’s device as they are interacting with the apps. Getting proper data about how our app affects these devices helps us make our app, faster lighter and more power efficient.
  • Push notifications bring important snippets of information and great offers to millions of users of our apps. It’s of critical importance for us to know what happens to these messages, and control the quality and frequency of these notifications to users.
  • Book publishers push data like book names, authors to us. We process and publish this data to be shown on the apps and the website. We need to monitor these processing systems to ensure things are working properly and quickly catch and fix any problems.

There and many more use-cases currently being satisfied by Foxtrot at Flipkart and Olacabs.

Basic Abstractions

Foxtrot works on the following very simple abstractions:

  • Table – The basic multi-tenancy container used to logically separate data coming in from different systems. So your Foxtrot system might have tables for website, checkout, oms, apps etc. A table in Foxtrot has a TTL and the data does not remain queriable from FQL, Console and JSON queries (see below) once the TTL expires. Data for the table is saved in Hbase and can be used independently by using M/R jobs on hadoop depending on the expiry time specified for the column-family in the table.
  • Document – Most granular data/event representation in Foxtrot. A document is composed of:
    • id – A unique id for the document. This must be provided.
    • timestamp – A timestamp representing the time of generation of this event.
    • data – A json node with fields representing the metadata for the event.
//A sample document
{
    "id": "569b3782-255b-48d7-a53f-7897d605db0b",
   "timestamp": 1401650819000,
    "data": {
        "event": "APP_LOAD",
        "os": "android",
        "device": "XperiaZ",
        "appVersion": {
            "major": "1",
            "minor": "2"
        }
    }
}

Usage

Foxtrot has been designed grounds up for simplicity of use. The following sections detail how events can be pushed to Foxtrot and how they can be summarized, viewed and downloaded.

Pushing data to Foxtrot

Pushing data involves POSTing documents to the single or bulk ingestion API. For more details on ingestion APIs please consult the wiki.

Accessing data from Foxtrot

Once events are ingested into Foxtrot, they can be accessed through any of the many interfaces available.

Foxtrot Console (Builder)

Foxtrot in action

Foxtrot provides a console system with configurable widgets that can be backed by JSON queries. Once configured and saved, these consoles can be shared by url and accessed from the consoles dropdown.

Currently, the following widgets are supported:

  • Pie charts – Show comparative counts of different types for a given time period
  • Bar charts – Show comparative counts of different types for a given time period
  • Histogram – Show count of events bucketed by mins/hours/days over time period
  • Classified histogram – Show count of multiple type of events bucketed by mins/hours/days over time period
  • Numeric Stats histogram – Show stats like max, min, avg and percentiles over numeric fields  bucketed by mins/hours/days over time period
  • Tables – Auto-refreshing tabular data display based on FQL query (see below)

Each of these widgets come with customizable parameters like:

  • Time window of query
  • Filters
  • Whether to show a legend or not etc.

Foxtrot Query Language

A big user of these kinds  of systems  is our analysts, and they really love SQL. We, therefore, support a subset of SQL, that we call as FQL (obviously!!).

Example: The find the count of all app loads and app crashes in the last one hour grouped by operating system:

select * from test where eventType in ('APP_LOAD', 'APP_CRASH') and last('1h') group by eventType, os

Results:
+-------------------------------+
| eventType | os      |   count |
+-------------------------------+
| APP_CRASH | android |   38330 |
| APP_CRASH | ios     |    2888 |
| APP_LOAD  | android | 2749803 |
| APP_LOAD  | ios     |   35380 |
+-------------------------------+

FQL queries can be run from the console as well by accessing the “FQL” tab.

Details about FQL can be found in the wiki.

CSV Downloads

Data can be downloaded from the system by POSTing a FQL query to the /fql/download endpoint.

Json Analytics Interface

This is the simplest and fundamental mode of access of data stored in Foxtrot. Foxtrot provides the /analytics endpoint that can be hit with simple json queries to access and analyze the data that has been ingested into Foxtrot.

Example: The find the count of all app loads crashes in the last one hour grouped by operating system and version:

//Request
{
    "opcode": "group",
    "table": "test",
    "filters": [
        {
            "field": "event",
            "operator": "in",
            "value": "APP_LOAD"
        },
        {
            "operator" : "last",
            "duration" : "1h"
        }
    ],
    "nesting": [
        "os",
        "version"
    ]
}
//Response
{
    "opcode": "group",
    "result": {
        "android": {
            "3.2": 2019,
            "4.3": 299430,
            "4.4": 100000032023
        },
        "ios": {
            "6": 12281,
            "7": 23383773637
        }
    }
}

A list of all analytics and their usage can be found in the wiki.

Raw access

The raw data is available on HBase for access from map-reduce jobs.

Technical Background

Foxtrot was designed and built for scale, with optimizations in both code and query paths to give the fastest possible experience to the user.

Requirements

We set out to build Foxtrot with the following basic requirements:
  • Rapid drill down into events with segmentation on arbitrary fields spanning over a reasonable number of days – This is a big one. Basically that, analysts and users will be able to slice and dice the data based on arbitrary fields in the data. This facility will be provided over data for a reasonable amount of time.
  • Fast ingestion into the system. We absolutely cannot slow down the client system due to it’s wanting to push data to us.
  • Metrics on systems and how they are performing in near real-time. Derived from above, basically aggregations over the events based on arbitrary criteria.
  • Multi-tenancy of clients pushing data to it. Clean abstractions regarding multi-tenancy right down to the storage layers both for query and raw stores, so that multiple-teams can push and view data without them having to worry about stability and/or their data getting corrupted by event stream from other systems
  • Consoles that are easy to build and  share. We needed to have a re-configurable console that teams could customize and share with each other. The console should not buckle under the pressure of millions of events getting ingested and many people having such consoles open to view real-time data on the same.
  • SQL interface to run queries on the data. Our analysts requested us to provide a SQL-ish interface to the system so that they can easily access and analyze the data without getting into the JSON DSL voodoo.
  • CSV downloads of recent data. This, again, was a request from analysts, as they wanted access to the raw events and aggregated results as CSV.
  • Extensibility of the system. We understood that teams will keep on requesting newer analytics and we would not be able to ship everything together at start. We had to plan to support quick development of newer analytics functions without having to touch a lot of the code.

Technology Used

Foxtrot unifies and abstracts a bunch of complex and proven technologies to bring a simple user experience in terms of event aggregation and metrics. Foxtrot uses the following tech stack:

  • HBase – Used as a simple key value store and saves data to be served out for queries and for usage as raw store in long-term map-reduce batch jobs.
  • Elasticsearch – Used as the primary query index. All fields of an event are indexed for the stipulated time and TTLs out. It does not store any document, but row keys only. We provide an optimized mapping in the source tree to the max performance out of elasticsearch.
  • Hazelcast – Used as a distributed caching layer till now, and caches query results for 30 secs. Cache keys are time-dependent and changes with the passage of time. Older data gets TTLd out.
  • SQL parser – For parsing FQL and converting them into Foxtrot JSON queries
  • Bootstrap, Jquery, HTML5, CSS3 –  Used to build the console

Architecture

The Foxtrot system uses a bunch of battle tested technologies to provide a robust and scalable system that meets the various requirements for  teams and allows for development for more types of analytics function on them.

Foxtrot Architecture

  • During ingestion, data is written to Elasticsearch quorum and HBase tables. Writes succeed only when events are written to both stores. HBase is written to before Elasticsearch. Data written to HBase will not become discoverable till it is indexed in Elasticsearch.
  • During query, the following steps are performed:
    • Translate FQL to Foxtrot JSON/native query format
    • Create cache key from the query.
    • Return results if found for this cache key.
    • Otherwise, figure out the time window for the analytics
    • Translate Foxtrot filters to elasticsearch filters
    • Execute the query on Elasticsearch and HBase and cache the results on Hazelcast if the action is cacheable and of reasonable size.

Misc

  • We try to avoid any heavy lifting during document ingestion. We recommend and internally use the /bulk API for event ingestion.
  • We have built in support for discovery of Foxtrot nodes using the /cluster/members API. We use this to distribute the indexing requests onto these different hosts, rather than sending data through a single load-balancer endpoint. We have a Java Client Library that can be embedded and used in client applications for this purpose. The Java client supports both on disk-queue based and async/direct senders for events.
  • We optimize the queries for every analytics by controlling the type of elasticsearch queries and filters being generated. Newer analytics can be easily added to the system by implementing the necessary abstract classes and with proper annotation. The server scans the classpath to pick up newer annotations, so it’s not mandatory for the analytics to be a part of the Foxtrot source tree.
  • In the runtime, the time window is detected and queries routed to indexes that are relevant to the query period. For efficiency reasons always use a time based filter in your queries. Most analytics will automatically add a filter to limit the query. Details for this can be found in individual analytics wiki pages.
  • We have added a simple console (Cluster tab of the console) with relevant parameters that can be used to monitor the health of the elasticsearch cluster.

Conclusion

Foxtrot has been in production for the better part of the year now at Flipkart and has ingested and provided analytics over hundreds of millions of events a day ranging over terabytes of data. Quite a few of our teams depend on this system for their application level metrics, as well as debugging and tracing  requirements.
Foxtrot is being released with Apache 2 License on github.
The source, wiki and issue tracker for Foxtrot is available at: Foxtrot github repository.
We look forward to you taking a look at the project and actively use and contribute to it. Please drop us a note if you find Foxtrot useful. Use github issues to file issues, feature requests. Use pull requests to send us code.
Special thanks to the Mobile API, checkout and notifications team for the support. Many thanks to Regu (@RegunathB) for providing some very important and much needed feedback on the usability of the system and the documentation as a whole and for guiding us meticulously through the process of open-sourcing this.