Is Data Locality Always Out Of The Box in Hadoop ? Not Really !

Introduction

Hadoop optimizes on Data Locality: Moving compute to data is cheaper than moving data to compute. It can schedule jobs to nodes that are local for input stream, which results in high performance. But, is this really out of the box ? Not always !  This blog explains couple of data locality issues that we identified and fixed.

User Insights Team

Flipkart collects terabytes of user data like browse, search, orders etc. User Insights Team uses these signals and derives useful insights about the user. These can be broadly classified as one of the following

  • Direct Insights Directly derived from user data. e.g. Location,
  • Descriptive Insights Describes the user, his preferences, likes/dislikes, habits, etc. e.g. Brand Affinities, Product Category Affinities, Life-stage
  • Predictive Insights Predict the next behavior or event of a user. e.g. Probability of purchase in session, cancellation/return.

These insights are used to personalize the user experience, give better recommendations and target relevant advertisements.

User Insights
User Insights

HBase as the data store for User Insights

The data collected is stored distributively and processed on Hadoop Cluster. HBase is used for fast lookup of user signals during generation of insights.

Multiple mappers hitting the same HBase region server

At times, full HBase table scans are required. During these scans some mappers finish in less than 10 minutes where as many take hours to finish for approximately same size of HBase regions. We picked the regions which were taking hours to finish and started running them one by one, to our surprise they also finished in under 10 minutes. But when ran with some other regions on the same region server they were slowing down. We started profiling the nodes where the jobs were running very slow and saw that CPU, Memory were fine, but network was choking. We used network profiling tools like nload and iftop to look at the network usage. This showed us that the outbound network traffic was maxing on the bandwidth of the network card and slowing down outbound network transfers. iftop showed us that most of the outbound traffic was from the HBase region server process.

We suspected that the way the input data was split and maps were launched caused too many mappers to hit the same region server. Digging into HBase code which creates the TableInputFormat having TableSplits for HBase scan jobs, found that if the scan range has consecutive regions belonging to the same region server, it will return them in a sequential order and mappers will be launched and allocated in the same order. This will cause all the mappers to hit the same region server and choke the network bandwidth. We wrote the below RoundRobinTableInputFormat which tries to order the splits in a round robin way for each of the used region servers, thereby reducing the probability of the having lot of mappers hitting the same region server.

Code
Gist Link: https://gist.github.com/sudhir-reddy/4eb430d4be9f9f404c38

YARN Fair Scheduler and Data Locality

Even when the cluster was free, we observed that we had 5-10% of data locality for a job. Something seemed suspicious and we started looking at how data locality happens in YARN. Client submits job to YARN Resource Manager(RM), RM allocates a container for running the Application Master(AM) on one of the cluster node. Data to be processed is split into InputSplit for each mapper task. InputSplit has information on where the data is located for the split task. AM asks RM to provide containers for running the mapper tasks preferably on the nodes where the data is located. RM uses the scheduler config to figure out which level of data locality has to be offered to the container request. RM allocates Containers to run on Node Manager(NM) and AM executes the mapper tasks in the allocated containers.

YARN Resource Manager
YARN Resource Manager

We are using FairScheduler in our YARN setup. FairScheduler has configs yarn.scheduler.fair.locality.threshold.node and yarn.scheduler.fair.locality.threshold.rack to tweak data locality. On setting these values to max data locality(=1), we were not able to observe any change in data locality tasks. Digging deeper in to the scheduler code, we found that these values will have no effect without setting the undocumented configs: yarn.scheduler.fair.locality-delay-node-ms and yarn.scheduler.fair.locality-delay-rack-ms. Even on setting these values, not even a single mapper was going to the data local node. Looking further into the scheduler code and container assignment code, figured out that the data local requests will be allocated if the available YARN Node Manager node name and Application Master requesting container for data on a given node name matches. Looking at the YARN node manager console saw that hosts were added with non-FQDN(Fully Qualified Domain Name, eg: hadoop01) and HBase region servers were returning data locality region server names with FQDN(hadoop01.something.flipkart.com). These hosts have non FQDN names in /etc/hostname and kernel(can be seen by `hostname` command) . After updating the hostname to FQDN and running the job again, we were able to get around 97.6% data locality and job’s runtime got reduced by 37% ! Looking for filing a bug on YARN, found that there is one bug already filed and fixed in version YARN 2.1.0. If anyone is using version of YARN less than 2.1.0, should set the hostname to FQDN for getting better data locality.

The above were two instances that cautions us about data locality assumptions in Hadoop and is worth addressing by teams that run large Hadoop based MR workloads.