43 Introduction 43 Overview of Disk Performance Limits 45 Disk Behavior When Using Multiple Disks 46 Disk Performance in Multi-Tenant Distributed Systems 47 Controlling Disk I/O Usage to
Trang 4Chad Carson and Sean Suchter
Effective Multi-Tenant Distributed Systems
Challenges and Solutions when Running Complex Environments
Trang 5[LSI]
Effective Multi-Tenant Distributed Systems
by Chad Carson and Sean Suchter
Copyright © 2017 Pepperdata, Inc All rights reserved.
Printed in the United States of America.
Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department:
800-998-9938 or corporate@oreilly.com.
Editor: Nicole Taché and Debbie Hardin
Production Editor: Nicholas Adams
Copyeditor: Octal Publishing Inc.
Interior Designer: David Futato
Cover Designer: Randy Comer
Illustrator: Rebecca Demarest October 2016: First Edition
Revision History for the First Edition
2016-10-10: First Release
The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Effective
Multi-Tenant Distributed Systems, the cover image, and related trade dress are trademarks
of O’Reilly Media, Inc.
While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is sub‐ ject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.
Trang 6Table of Contents
1 Introduction to Multi-Tenant Distributed Systems 1
The Benefits of Distributed Systems 1
Performance Problems in Distributed Systems 1
Lack of Visibility Within Multi-Tenant Distributed Systems 4
The Impact on Business from Performance Problems 4
Scope of This Book 6
2 Scheduling in Distributed Systems 9
Introduction 9
Dominant Resource Fairness Scheduling 10
Aggressive Scheduling for Busy Queues 12
Special Scheduling Treatment for Small Jobs 13
Workload-Specific Scheduling Considerations 14
Inefficiencies in Scheduling 16
Summary 22
3 CPU Performance Considerations 25
Introduction 25
Algorithm Efficiency 26
Kernel Scheduling 27
I/O Waiting and CPU Cache Impacts 30
Summary 32
4 Memory Usage in Distributed Systems 33
Introduction 33
Physical Versus Virtual Memory 33
Node Thrashing 34
Trang 7Kernel Out-Of-Memory Killer 37
Implications of Memory-Intensive Workloads for Multi-Tenant Distributed Systems 38
Summary 41
5 Disk Performance: Identifying and Eliminating Bottlenecks 43
Introduction 43
Overview of Disk Performance Limits 45
Disk Behavior When Using Multiple Disks 46
Disk Performance in Multi-Tenant Distributed Systems 47
Controlling Disk I/O Usage to Improve Performance for High-Priority Applications 48
Solid-State Drives and Distributed Systems 50
Measuring Performance and Diagnosing Problems 52
Summary 54
6 Network Performance Limits: Causes and Solutions 55
Introduction 55
Bandwidth Problems in Distributed Systems 55
Other Network-Related Bottlenecks and Problems 61
Measuring Network Performance and Debugging Problems 62
Summary 64
7 Other Bottlenecks in Distributed Systems 67
Introduction 67
NameNode Contention 67
ResourceManager Contention 69
ZooKeeper 70
Locks 71
External Databases and Related Systems 72
DNS Servers 72
Summary 73
8 Monitoring Performance: Challenges and Solutions 75
Introduction 75
Why Monitor? 76
What to Monitor 78
Systems and Performance Aspects of Monitoring 80
Algorithmic and Logical Aspects of Monitoring 85
Measuring the Effect of Attempted Improvements 88
Allocating Cluster Costs Across Tenants 89
Trang 8Summary 90
9 Conclusion: Performance Challenges and Solutions for Effective Multi-Tenant Distributed Systems 91
Trang 10CHAPTER 1
Introduction to Multi-Tenant
Distributed Systems
The Benefits of Distributed Systems
The past few decades have seen an explosion of computing power.Search engines, social networks, cloud-based storage and comput‐ing, and similar services now make seemingly infinite amounts ofinformation and computation available to users across the globe.The tremendous scale of these services would not be possible
without distributed systems Distributed systems make it possible for
many hundreds or thousands of relatively inexpensive computers tocommunicate with one another and work together, creating the out‐ward appearance of a single, high-powered computer The primarybenefit of a distributed system is clear: the ability to massively scalecomputing power relatively inexpensively, enabling organizations toscale up their businesses to a global level in a way that was not possi‐ble even a decade ago
Performance Problems in Distributed Systems
As more and more nodes are added to the distributed system andinteract with one another, and as more and more developers writeand run applications on the system, complications arise Operators
of distributed systems must address an array of challenges that affectthe performance of the system as a whole as well as individual appli‐cations’ performance
Trang 11These performance challenges are different from those faced whenoperating a data center of computers that are running more or lessindependently, such as a web server farm In a true distributed sys‐tem, applications are split into smaller units of work, which arespread across many nodes and communicate with one anothereither directly or via shared input/output data.
Additional performance challenges arise with multi-tenant dis‐
tributed systems, in which different users, groups, and possibly busi‐ness units run different applications on the same cluster (This is incontrast to a single, large distributed application, such as a searchengine, which is quite complex and has intertask dependencies but
is still just one overall application.) These challenges that come withmultitenancy result from the diversity of applications runningtogether on any node as well as the fact that the applications arewritten by many different developers instead of one engineeringteam focused on ensuring that everything in a single distributedapplication works well together
Scheduling
One of the primary challenges in a distributed system is in schedul‐ing jobs and their component processes Computing power might bequite large, but it is always finite, and the distributed system mustdecide which jobs should be scheduled to run where and when, andthe relative priority of those jobs Even sophisticated distributed-system schedulers have limitations that can lead to underutilization
of cluster hardware, unpredictable job run times, or both Examplesinclude assuming the worst-case resource usage to avoid overcom‐mitting, failing to plan for different resource types across differentapplications, and overlooking one or more dependencies, thus caus‐ing deadlock or starvation
The scheduling challenges become more severe on multi-tenantclusters, which add fairness of resource access among users as ascheduling goal, in addition to (and often in conflict with) the goals
of high overall hardware utilization and predictable run times forhigh-priority applications Aside from the challenge of balancingutilization and fairness, in some extreme cases the scheduler might
go too far in trying to ensure fairness, scheduling just a few tasksfrom many jobs for many users at once This can result in latencyfor every job on the cluster and cause the cluster to use resources
Trang 12inefficiently because the system is trying to do too many disparatethings at the same time.
Hardware Bottlenecks
Beyond scheduling challenges, there are many ways a distributedsystem can suffer from hardware bottlenecks and other inefficien‐cies For example, a single job can saturate the network or disk I/O,slowing down every other job These potential problems are onlyexacerbated in a multi-tenant environment—usage of a given hard‐ware resource such as CPU or disk is often less efficient when anode has many different processes running on it In addition, opera‐tors cannot tune the cluster for a particular access pattern, becausethe access patterns are both diverse and constantly changing.(Again, contrast this situation with a farm of servers, each of which
is independently running a single application, or a large cluster run‐ning a single coherently designed and tuned application like a searchengine.)
Distributed systems are also subject to performance problems due tobottlenecks from centralized services used by every node in the sys‐tem One common example is the master node performing jobadmission and scheduling; others include the master node for a dis‐tributed file system storing data for the cluster as well as commonservices like domain name system (DNS) servers
These potential performance challenges are exacerbated by the factthat a primary design goal for many modern distributed systems is
to enable large numbers of developers, data scientists, and analysts
to use the system simultaneously This is in stark contrast to earlierdistributed systems such as high-performance computing (HPC)systems in which the only people who could write programs to run
on the cluster had a systems programming background Today, dis‐tributed systems are opening up enormous computing power topeople without a systems background, so they often don’t under‐stand or even think about system performance Such a user mighteasily write a job that accidentally brings a cluster to its knees,affecting every other job and user
Trang 13Lack of Visibility Within Multi-Tenant
Distributed Systems
Because multi-tenant distributed systems simultaneously run manyapplications, each with different performance characteristics andwritten by different developers, it can be difficult to determinewhat’s going on with the system, whether (and why) there’s a prob‐lem, which users and applications are the cause of any problem, andwhat to do about such problems
Traditional cluster monitoring systems are generally limited totracking metrics at the node level; they lack visibility into detailedhardware usage by each process Major blind spots can result—when there’s a performance problem, operators are unable to pin‐point exactly which application caused it, or what to do about it.Similarly, application-level monitoring systems tend to focus onoverall application semantics (overall run times, data volumes, etc.)and do not drill down to performance-level metrics for actual hard‐ware resources on each node that is running a part of the applica‐tion
Truly useful monitoring for multi-tenant distributed systems musttrack hardware usage metrics at a sufficient level of granularity foreach interesting process on each node Gathering, processing, andpresenting this data for large clusters is a significant challenge, interms of both systems engineering (to process and store the dataefficiently and in a scalable fashion) and the presentation-level logicand math (to present it usefully and accurately) Even for limited,node-level metrics, traditional monitoring systems do not scale well
on large clusters of hundreds to thousands of nodes
The Impact on Business from Performance Problems
The performance challenges described in this book can easily lead tobusiness impacts such as the following:
Inconsistent, unpredictable application run times
Batch jobs might run late, interactive applications mightrespond slowly, and the ingestion and processing of new incom‐ing data for use by other applications might be delayed
Trang 14Underutilized hardware
Job queues can appear full even when the cluster hardware isnot running at full capacity This inefficiency can result inhigher capital and operating expenses; it can also result in sig‐nificant delays for new projects due to insufficient hardware, oreven the need to build out new data-center space to add newmachines for additional processing power
Cluster instability
In extreme cases, nodes can become unresponsive or a dis‐tributed file system (DFS) might become overloaded, so appli‐cations cannot run or are significantly delayed in accessing data.Aside from these obvious effects, performance problems also causebusinesses to suffer in subtler but ultimately more significant ways.Organizations might informally “learn” that a multi-tenant cluster isunpredictable and build implicit or explicit processes to workaround the unpredictability, such as the following:
• Limit cluster access to a subset of developers or analysts, out of
a concern that poorly written jobs will slow down or even crashthe cluster for everyone
• Build separate clusters for different groups or different work‐loads so that the most important applications are insulated fromothers Doing so increases overall cost due to inefficiency inresource usage, adds operational overhead and cost, andreduces the ability to share data across groups
• Set up “development” and “production” clusters, with a commit‐tee or other cumbersome process to approve jobs before theycan be run on a production cluster Adding these hurdles candramatically hinder innovation, because they significantly slowthe feedback loop of learning from production data, building
Trang 151 We saw an example of the benefits of having an extremely short feedback loop at Yahoo
in 2006–2007, when the sponsored search R&D team was an early user of the very first production Hadoop cluster anywhere By moving to Hadoop and being able to deploy new click prediction models directly into production, we increased the number of simultaneous experiments by five times or more and reduced the feedback loop time by
a similar factor As a result, our models could improve an order of magnitude faster, and the revenue gains from those improvements similarly compounded that much faster.
and testing a new model or new feature, deploying it to produc‐tion, and learning again.1
These responses to unpredictable performance can limit a business’sability to fully benefit from the potential of distributed systems.Eliminating performance problems on the cluster can improve per‐formance of the business overall
Scope of This Book
In this book, we consider the performance challenges that arise fromscheduling inefficiencies, hardware bottlenecks, and lack of visibil‐ity We examine each problem in detail and present solutions thatorganizations use today to overcome these challenges and benefitfrom the tremendous scale and efficiency of distributed systems
Hadoop: An Example Distributed System
This book uses Hadoop as an example of a multi-tenant distributedsystem Hadoop serves as an ideal example of such a system because
of its broad adoption across a variety of industries, from healthcare
to finance to transportation Due to its open source availability and
a robust ecosystem of supporting applications, Hadoop’s adoption isincreasing among small and large organizations alike
Hadoop is also an ideal example because it is used in highly tenant production deployments (running jobs from many hundreds
multi-of developers) and is multi-often used to simultaneously run large batchjobs, real-time stream processing, interactive analysis, andcustomer-facing databases As a result, it suffers from all of the per‐formance challenges described herein
Trang 162 Various distributed systems are designed to make different tradeoffs among Consis‐ tency, Availability, and Partition tolerance For more information, see Gilbert, Seth, and Nancy Ann Lynch “Perspectives on the CAP Theorem.” Institute of Electrical and Elec‐ tronics Engineers, 2012 (http://hdl.handle.net/1721.1/79112) and https://
www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed.
Of course, Hadoop is not the only important distributed system; afew other examples include the following:2
• Classic HPC clusters using MPI, TORQUE, and Moab
• Distributed databases such as Oracle RAC, Teradata, Cassandra,and MongoDB
• Render farms used for animation
• Simulation systems used for physics and manufacturing
An atomic unit of work that is part of a job This work is done
on a single node, generally running as a single (sometimes mul‐tithreaded) process on the node
Host, machine, or node
A single computing node, which can be an actual physical com‐puter or a virtual machine
Trang 18CHAPTER 2
Scheduling in Distributed Systems
Introduction
In distributed computing, a scheduler is responsible for managing
incoming container requests and determining which containers torun next, on which node to run them, and how many containers to
run in parallel on the node (Container is a general term for individ‐ ual parts of a job; some systems use other terms such as task to refer
to a container.) Schedulers range in complexity, with the simplesthaving a straightforward first-in–first-out (FIFO) policy Differentschedulers place more or less importance on various (often conflict‐ing) goals, such as the following:
• Utilizing cluster resources as fully as possible
• Giving each user and group fair access to the cluster
• Ensuring that high-priority or latency-sensitive jobs complete
on time
Multi-tenant distributed systems generally prioritize fairness amongusers and groups over optimal packing and maximal resource usage;without fairness, users would be likely to maximize their own access
to the cluster without regard to others’ needs Also, different groupsand business units would be inclined to run their own smaller, lessefficient cluster to ensure access for their users
In the context of Hadoop, one of two schedulers is most commonly
used: the capacity scheduler and the fair scheduler Historically, each
scheduler was written as an extension of the simple FIFO scheduler,
Trang 191 The new architecture is referred to as Yet Another Resource Negotiator (YARN) or MapReduce v2 See https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn- site/YARN.html.
2 See http://mesos.apache.org/api/latest/java/org/apache/mesos/Scheduler.html.
3 See https://www.quora.com/How-does-two-level-scheduling-work-in-Apache-Mesos.
and initially each had a different goal, as their names indicate Overtime, the two schedulers have experienced convergent evolution,with each incorporating improvements from the other; today, theyare mostly different in details Both schedulers have the concept of
multiple queues of jobs to be scheduled, with admission to each
queue determined based on user- or operator-specified policies.Recent versions of Hadoop1 perform two-level scheduling, in which a
centralized scheduler running on the ResourceManager node
assigns cluster resources (containers) to each application, and an
ApplicationMaster running in one of those containers uses the othercontainers to run individual tasks for the application The Applica‐tionMaster manages the details of the application, including com‐munication and coordination among tasks This architecture ismuch more scalable than Hadoop’s original one-level scheduling, inwhich a single central node (the JobTracker) did the work of boththe ResourceManager and every ApplicationMaster
Many other modern distributed systems like Dryad and Mesos haveschedulers that are similar to Hadoop’s schedulers For example,Mesos also supports a pluggable scheduler interface much likeHadoop,2 and it performs two-level scheduling,3 with a centralscheduler that registers available resources and assigns them toapplications (“frameworks”)
Dominant Resource Fairness Scheduling
Historically, most schedulers considered only a single type of hard‐ware resource when deciding which container to schedule next—both in calculating the free resources on each node and in calculat‐ing how much a given user, group, or queue was already using (e.g.,from the point of view of fairness in usage) In the case of Hadoop,only memory usage was considered
However, in a multi-tenant distributed system, different jobs andcontainers generally have widely different hardware usage profiles—
Trang 204 Ghodsi, Ali, et al “Dominant Resource Fairness: Fair Allocation of Multiple Resource
Types.” NSDI Vol 11 2011 https://www.cs.berkeley.edu/~alig/papers/drf.pdf
some containers require significant memory, whereas some useCPU much more heavily (see Figure 2-1) Not considering CPUusage in scheduling meant that the system might be significantlyunderutilized, and some users would end up getting more or less
than their true fair share of the cluster A policy called Dominant
Resource Fairness (DRF)4 addresses these limitations by consideringmultiple resource types and expressing the usage of each resource in
a common currency (the share of the total allocation of thatresource), and then scheduling based on the resource each container
is using most heavily
Figure 2-1 Per-container physical memory usage versus CPU usage during a representative period of time on a production cluster Note that some jobs consume large amounts of memory while using rela‐ tively little CPU; others use significant CPU but relatively little mem‐ ory.
Trang 215 See http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairSchedu ler.html and http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ CapacityScheduler.html.
In Hadoop, operators can configure both the Fair Scheduler and theCapacity Scheduler to consider both memory and CPU (using theDRF framework) when considering which container to launch next
on a given node.5
Aggressive Scheduling for Busy Queues
Often a multi-tenant cluster might be in a state where some but notall queues are full; that is, some tenants currently don’t have enoughwork to use their full share of the cluster, but others have more workthan they are guaranteed based on the scheduler’s configured alloca‐tion In such cases, the scheduler might launch more containersfrom the busy queues to keep the cluster fully utilized
Sometimes, after those extra containers are launched, new jobs aresubmitted to a queue that was previously empty; based on thescheduler’s policy, containers from those jobs should be scheduledimmediately, but because the scheduler has already opportunisticallylaunched extra containers from other queues, the cluster is full In
those cases, the scheduler might preempt those extra containers by
killing some of them in order to reflect the desired fairness policy(see Figure 2-2) Preemption is a common feature in schedulers formulti-tenant distributed systems, including both popular Hadoopschedulers (capacity and fair)
Because preemption inherently results in lost work, it’s importantfor the scheduler to strike a good balance between starting manyopportunistic containers to make use of idle resources and avoidingtoo much preemption and the waste that it causes To help reducethe negative impacts of preemption, the scheduler can slightly delaykilling containers (to avoid wasting the work of containers that arealmost complete) and generally chooses to kill containers that haverecently launched (again, to avoid wasted work)
Trang 226 Verma, Abhishek, et al “Large-scale cluster management at Google with Borg.” Pro‐
ceedings of the Tenth European Conference on Computer Systems ACM, 2015 http:// research.google.com/pubs/archive/43438.pdf
Figure 2-2 When new jobs arrive in Queue A, they might be scheduled
if there is sufficient unused cluster capacity, allowing Queue A to use more than its guaranteed share If jobs later arrive in Queue B, the scheduler might then preempt some of the Queue A jobs to provide Queue B its guaranteed share.
A related concept is used by Google’s Borg system,6 which has a con‐cept of priorities and quotas; a quota represents a set of hardwareresource quantities (CPU, memory, disk, etc.) for a period of time,and higher-priority quota costs more than lower-priority quota.Borg never allocates more production-priority quota than is avail‐able on a given cluster; this guarantees production jobs the resour‐ces they need At any given time, excess resources that are not beingused by production jobs can be used by lower-priority jobs, butthose jobs can be killed if the production jobs’ usage later increases.(This behavior is similar to another kind of distributed system,Amazon Web Services, which has a concept of guaranteed instancesand spot instances; spot instances cost much less than guaranteedones but are subject to being killed at any time.)
Special Scheduling Treatment for Small Jobs
Some cluster operators provide special treatment for small or fastjobs; in a sense, this is the opposite of preemption One example isLinkedIn’s “fast queue” for Hadoop, which is a small queue that is
Trang 237 See slide 9 of http://www.slideshare.net/Hadoop_Summit/hadoop-operations-at-linkedin.
8 See http://doc.mapr.com/display/MapR/ExpressLane.
used only for jobs that take less than an hour total to run and whosecontainers each take less than 15 minutes.7 If jobs or containers vio‐late this limit, they are automatically killed This feature providesfast response for smaller jobs even when the cluster is bogged down
by large batch jobs; it also encourages developers to optimize theirjobs to run faster
The Hadoop vendor MapR provides somewhat similar functionalitywith its ExpressLane,8 which schedules small jobs (as defined byhaving few containers, each with low memory usage and small inputdata sizes) to run on the cluster even when the cluster is busy andhas no additional capacity for normal jobs This is also an interest‐ing example of using the input data size as a cue to the schedulerabout how fast a container is likely to be
Workload-Specific Scheduling Considerations
Aside from the general goals of high utilization and fairness acrossusers and queues, schedulers might take other factors into accountwhen deciding which containers to launch and where to run them.For example, a key design point of Hadoop is to move computation
to the data (The goal is to not just get the nodes to work as hard asthey can, but also get them to work more efficiently.) The schedulertries to accomplish this goal by preferring to place a given container
on one of the nodes that have the container’s input HDFS datastored locally; if that can’t be done within a certain amount of time,
it then tries to place the container on the same rack as a node thathas the HDFS data; if that also can’t be done after waiting a certainamount of time, the container is launched on any node that hasavailable computing resources Although this approach increasesoverall system efficiency, it complicates the scheduling problem
An example of a different kind of placement constraint is the sup‐
port for pods in Kubernetes A pod is a group of containers, such as
Docker containers, that are scheduled at the same time on the samenode Pods are frequently used to provide services that act as helperprograms for an application Unlike the preference for data locality
in Hadoop scheduling, the colocation and coscheduling of contain‐
Trang 249 Nurmi, Daniel, et al “Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction.” Proceedings of the 2006 ACM/IEEE
conference on Supercomputing ACM, 2006 http://www.cs.ucsb.edu/~nurmi/ nurmi_workflow.pdf
ers in a pod is a hard requirement; in many cases the applicationsimply would not work without the auxiliary services running onthe same node
A weaker constraint than colocation is the concept of gang schedul‐
ing, in which an application requires all of its resources to run con‐
currently, but they don’t need to run on the same node An example
is a distributed database like Impala, which needs to have all of its
“query fragments” running in order to serve queries Although somedistributed systems’ schedulers support gang scheduling natively,Hadoop doesn’t currently support gang scheduling; applications thatrequire concurrent containers mimic gang scheduling by keepingcontainers alive but idle until all of the required containers are run‐ning This workaround clearly wastes resources because these idlecontainers hold resources and stop other containers from running.However, even when gang scheduling is done “cleanly” by the sched‐uler, it can lead to inefficiencies because the scheduler needs toavoid fully loading the cluster with other containers to ensure thatenough space will eventually be available for the entire gang to bescheduled
As a side note, workflow schedulers such as Oozie are given informa‐
tion about the dependencies among jobs in a complex workflow thatmust happen in order; the workflow scheduler then submits theindividual jobs to the distributed system on behalf of the user Aworkflow scheduler can take into account the required inputs andoutputs of each stage (including inputs that depend on some off-cluster process to write new data to the cluster), the time of day theworkflow should be started, awareness of the full directed acyclicgraph (DAG) of the entire workflow, and similar constraints Gener‐ally, the workflow scheduler is distinct from the distributed system’sown scheduler that determines exactly where and when containersare launched on each node, but there are cases when overall sched‐uling can be much more efficient if workflow scheduling andresource scheduling are combined.9
Trang 25Inefficiencies in Scheduling
Although schedulers have become more sophisticated over time,they continue to suffer from inefficiencies related to the diversity ofworkloads running on multi-tenant distributed systems These inef‐ficiencies arise from the need to avoid overcommitting memorywhen doing up-front scheduling, a limited ability to consider alltypes of hardware resources, and challenges in considering thedependencies among all jobs and containers within complicatedworkflows
The Need to be Conservative with Memory
Distributed system schedulers generally make scheduling decisionsbased on conservative assumptions about the hardware resources—especially memory—required by each container These require‐ments are usually declared by the job author based on the worst-caseusage, not the actual usage This difference is critical because oftendifferent containers from the same job have different actual resourceusage, even if they are running identical code (This happens, forexample, when the input data for one container is larger or other‐wise different from the input data for other containers, resulting in aneed for more processing or more space in memory.)
If a node’s resources are fully scheduled and the node is “unlucky” inthe mix of containers it’s running, the node can be overloaded; if theresource that is overloaded is memory, the node might run out ofmemory and crash or start swapping badly In a large distributedsystem, some nodes are bound to be unlucky in this way, so if thescheduler does not use conservative resource usage estimates, thesystem will nearly always be in a bad state
The need to be conservative with memory allocation means thatmost nodes will be underutilized most of the time; containers gener‐ally do not often use their theoretical maximum memory, and evenwhen they do, it’s not for the full lifetime of the container (see
Figure 2-3) (In some cases, containers can use even more than theirdeclared maximum Systems can be more or less stringent about
Trang 2610 For example, Google’s Borg kills containers that try to exceed their declared memory limit Hadoop by default lets containers go over, but operators can configure it to kill such containers.
enforcing what the developer declares—some systems kill containerswhen they exceed their maximum memory, but others do not.10)
Figure 2-3 Actual physical memory usage compared to the container size (the theoretical maximum) for a typical container Note that the actual usage changes over time and is much smaller than the reserved amount.
To reduce the waste associated with this underutilization, operators
of large multi-tenant distributed systems often must perform a bal‐ancing act, trying to increase cluster utilization without pushingnodes over the edge As described in Chapter 4, software like Pep‐perdata provides a way to increase utilization for distributed systemssuch as Hadoop by monitoring actual physical memory usage anddynamically allowing more or fewer processes to be scheduled on agiven node, based on the current and projected future memoryusage on that node
Trang 27Inability to Effectively Schedule the Use of Other
Resources
Similar inefficiencies can occur due to the natural variation overtime in the resource usage for a single container, not just variationacross containers For a given container, memory usage tends tovary by a factor of two or three over the lifetime of the container,and the variation is generally smooth CPU usage varies quite a bitover time, but the maximum usage is generally limited to a singlecore In contrast, disk I/O and network usage frequently vary byorders of magnitude, and they spike very quickly They are alsoeffectively unlimited in how much of the corresponding resourcethey use: one single thread on one machine can easily saturate thefull network bandwidth of the node or use up all available disk I/Ooperations per second (IOPS) and bandwidth from dozens of disks(including even disks on multiple machines, when the thread isrequesting data stored on another node) See Figure 2-4 for theusage of various resources for a sample job The left column showsoverall usage for all map tasks (red, starting earlier) and reduce tasks(green, starting later) The right column shows a breakdown byindividual task (For this particular job, there is only one reducetask.)
Because CPU, disk, and network usage can change so quickly, it isimpossible for any system that only does up-front scheduling tooptimize cluster utilization and provide true fairness in the use ofhardware resources
Trang 28Figure 2-4 The variation over time in usage of different hardware resources for a typical MapReduce job (source: Pepperdata)
Trang 2911 Wang, Yang and Wei Shi “Budget-driven scheduling algorithms for batches of Map‐
Reduce jobs in heterogeneous clouds.” IEEE Transactions on Cloud Computing 2.3
(2014): 306-319 Scheduling-Algorithms-for-Batches-of-MapReduce-Jobs
https://www.researchgate.net/publication/277583513_Budget-Driven-12 See, for example, Chekuri, Chandra and Sanjeev Khanna “On multidimensional pack‐
ing problems.” SIAM journal on computing 33.4 (2004): 837-851 http://
Deadlock and Starvation
In some cases, schedulers might choose to start some containers in ajob’s DAG even before the preceding containers (the dependencies)have completed This is done to reduce the total run time of the job
or spread out resource usage over time
In the interest of concreteness, the discussion in this
section uses map and reduce containers, but similar
effects can happen any time a job has some containers
that depend on the output of others; the problems are
not specific to MapReduce or Hadoop
An example is Hadoop’s “slow start” feature, in which reduce con‐tainers might be launched before all of the map containers theydepend on have completed This behavior can help minimize spikes
Trang 3013 See hadoop/11673808#11673808.
http://stackoverflow.com/questions/11672676/when-do-reduce-tasks-start-in-14 See https://issues.apache.org/jira/browse/MAPREDUCE-314 for an example.
in network bandwidth usage by spreading out the heavy networktraffic of transferring data from mappers to reducers However,starting a reduce container too early means that it might end up justsitting on a node waiting for its input data (from map containers) to
be generated, which means that other containers are not able to usethe memory the reduce container is holding, thus affecting overallsystem utilization.13
This problem is especially common on very busy clusters with manytenants because often not all map containers from a job can bescheduled in quick succession; similarly, if a map container fails (forexample, due to node failure), it might take a long time to getrescheduled, especially if other, higher-priority jobs have been sub‐mitted after the reducers from this job were scheduled In extremecases this can lead to deadlock, when the cluster is occupied byreduce containers that are unable to proceed because the containersthey depend on cannot be scheduled.14 Even if deadlock does notoccur, the cluster can still be utilized inefficiently, and overall jobcompletion can be unnecessarily slow as measured by wall-clocktime, if the scheduler launches just a small number of containersfrom each of many users at one time
A similar scheduling problem is starvation, which can occur on aheavily loaded cluster For example, consider a case in which one jobhas containers that each need a larger amount of memory than con‐tainers from other jobs When one of the small containers completes
on a node, a naive scheduler will see that the node has a smallamount of memory available, but because it can’t fit one of the largecontainers there, it will schedule a small container to run In theextreme case, the larger containers might never be scheduled In
Hadoop and other systems, the concept of a reservation allows an
application to reserve available space on a node, even if the applica‐
Trang 3115 See Sulistio, Anthony, Wolfram Schiffmann, and Rajkumar Buyya “Advanced
reservation-based scheduling of task graphs on clusters.” International Conference on
High-Performance Computing Springer Berlin Heidelberg, 2006 http://www.cloud bus.org/papers/workflow_hipc2006.pdf For related recent work in Hadoop, see Curino, Carlo et al “Reservation-based Scheduling: If You’re Late Don’t Blame Us!” Proceed‐
ings of the ACM Symposium on Cloud Computing ACM, 2014 https://www.micro soft.com/en-us/research/publication/reservation-based-scheduling-if-youre-late-dont- blame-us/.
16 This is different from the standard use of the term “speculative execution” in which pipelined microprocessors sometimes execute both sides of a conditional branch before knowing which branch will be taken.
tion can’t immediately use it.15 (This behavior can help avoid starva‐tion, but it also means that the overall utilization of the system islower, because some amount of resources might be reserved butunused at any particular time.)
Waste Due to Speculative Execution
Operators can configure Hadoop to use speculative execution, in
which the scheduler can observe that a given container seems to berunning more slowly than is typical for that kind of container andstart another copy of that container on another node This behavior
is primarily intended to avoid cases in which a particular node isperforming badly (usually due to a hardware problem) and an entirejob could be slowed down due to just one straggler container.16
While speculative execution can reduce job completion time due tonode problems, it wastes resources when the container that is dupli‐cated simply had more work to do than other containers and so nat‐urally ran longer In practice, experienced operators typically disablespeculative execution on multi-tenant clusters, both because there isgenerally inherent container variation (not due to hardware prob‐lems) and because the operators are constantly watching for badhardware, so speculative execution does not enhance performance
Summary
Over time, distributed system schedulers have grown in sophistica‐tion from a very simple FIFO algorithm to add the twin goals offairness across users and increased cluster utilization Those twogoals must be balanced against each other; on multi-tenant dis‐tributed systems, operators often prioritize fairness They do so to
Trang 32reduce the level of user-visible scheduling issues as well as to keepmultiple business units satisfied to use shared infrastructure ratherthan running their own separate clusters (In contrast, configuringthe scheduler to maximize utilization could save money in the shortterm but waste it in the long term, because many small clusters areless efficient than one large one.)
Schedulers have also become more sophisticated by better takinginto account multiple hardware resource requirements (for example,not considering only memory) and effectively treating differentkinds of workloads differently when scheduling decisions are made.However, they still suffer from limitations, for example being con‐servative in resource allocation to avoid instability due to overcom‐mitting resources such as memory That conservatism can keep thecluster stable, but it results in lower utilization and slower run timesthan the hardware could actually support Software solutions thatmake real-time, fine-grained decisions about resource usage canprovide increased utilization while maintaining cluster stability andproviding more predictable job run times
Trang 34Today, distributed systems tend to run applications for which thelarge scale is driven by the size of the input data rather than theamount of computation needed—examples include both special-purpose distributed systems (such as those powering web searchamong billions of documents) and general-purpose systems such asHadoop (However, even in those general systems, there are stillsome cases such as iterative algorithms for machine learning wheremaking efficient use of the CPU is critical.)
As a result, the CPU is often not the primary bottleneck limiting adistributed system; nevertheless, it is important to be aware of theimpacts of CPU on overall speed and throughput
At a high level, the effect of CPU performance on distributed sys‐tems is driven by three primary factors:
• The efficiency of the program that’s running, at the level of thecode as well as how the work is broken into pieces and dis‐tributed across nodes
Trang 351 See https://github.com/linkedin/dr-elephant/wiki.
• Low-level kernel scheduling and prioritization of the computa‐tional work done by the CPU, when the CPU is not waiting fordata
• The amount of time the CPU spends waiting for data frommemory, disk, or network
These factors are important for the performance even of singleapplications running on a single machine; they are just as important,and even more complicated, for multi-tenant distributed systemsdue to the increased number and diversity of processes running onthose systems, and their varied input data sources
to profile and optimize a single instance of a program running on aparticular machine
For distributed systems, it can be equally important (if not more so)
to break down the work into units effectively For example, withMapReduce programs, some arrangements of map-shuffle-reducesteps are more efficient than others Likewise, whether using Map‐Reduce, Spark, or another distributed framework, using the rightlevel of parallelism is important For example, because every mapand reduce task requires a nontrivial amount of setup and teardownwork, running too many small tasks can lead to grossly inefficientoverhead—we’ve seen systems with thousands of map tasks thateach require several seconds for setup and teardown but spend lessthan one second on useful computation
In the case of Hadoop, open source tools like Dr Elephant1 (as well
as some commercial tools) provide performance measurement and
Trang 36recommendations to improve the overall flow of jobs, identifyingproblems such as a suboptimal breakdown of work into individualunits.
Kernel Scheduling
The operating system kernel (Linux, for example) decides whichthreads run where and when, distributing a fixed amount of CPUresource across threads (and thus ultimately across applications)
Every N (~5) milliseconds, the kernel takes control of a given core and decides which thread’s instructions will run there for the next N
milliseconds For each candidate thread, the kernel’s scheduler mustconsider several factors:
• Is the thread ready to do anything at all (versus waiting forI/O)?
• If yes, is it ready to do something on this core?
• If yes, what is its dynamic priority? This computation takes sev‐ eral factors into account, including the static priority of the pro‐
cess, how much CPU time the thread has been allocatedrecently, and other signals depending on the kernel version
• How does this thread’s dynamic priority compare to that ofother threads that could be run now?
The Linux kernel exposes several control knobs to affect the static (apriori) priority of a process; nice and control groups (cgroups) arethe most commonly used With cgroups, priorities can be set, andscheduling affected, for a group of processes rather than a singleprocess or thread; conceptually, cgroups divide the access to CPU
across the entire group This division across groups of processes
means that applications running many processes on a node do notreceive unfair advantage over applications with just one or a fewprocesses
In considering the impact of CPU usage, it is helpful to distinguish
between latency-sensitive and latency-insensitive applications:
• In a latency-sensitive application, a key consideration is the tim‐ing of the CPU cycles assigned to it Performance can be defined
by the question “How much CPU do I get when I need it?”
Trang 37• In a latency-insensitive application, the opposite situation exists:the exact timing of the CPU cycles assigned to it is unimportant;the most important consideration is the total number of CPUcycles assigned to it over time (usually minutes or hours).This distinction is important for distributed systems, which oftenrun latency-sensitive applications alongside batch workloads, such
as MapReduce in the case of Hadoop Examples of latency-sensitivedistributed applications include search engines, key-value stores,clustered databases, video streaming systems, and advertising sys‐tems with real-time bidding that must respond in milliseconds.Examples of latency-insensitive distributed applications includeindex generation and loading for search engines, garbage collectionfor key-value stores or databases, and offline machine learning foradvertising systems
An interesting point is that even the same binary can have very dif‐ferent requirements when used in different applications For exam‐
ple, a distributed data store like HBase can be latency-sensitive for
reading data when serving end-customer queries, and
insensitive when updating the underlying data—or it can be
latency-sensitive for writing data streamed from consumer devices, and
latency-insensitive when supporting analyst queries against thestored data The semantics of the specific application matter whensetting priorities and measuring performance
Intentional or Accidental Bad Actors
As is the case with other hardware resources, CPU is subject toeither intentional or accidental “bad actors” who can use more thantheir fair share of the CPU on a node or even the distributed system
as a whole These problems are specific to multi-tenant distributedsystems, not single-node systems or distributed systems running asingle application
A common problem case is due to multithreading If most applica‐tions running in a system are single threaded, but one developerwrites a multithreaded application, the system might not be tunedappropriately to handle the new type of workload Not only can thiscause general performance problems, it is considered unfair becausethat one developer can nearly monopolize the system Some systemslike Hadoop try to mitigate this problem by allowing developers tospecify how many cores each task will use (with multithreaded pro‐
Trang 38grams specifying multiple cores), but this can be wasteful of resour‐ces, because if a task is not fully using the specified number of cores,the cores might remain reserved and thus go unused.
Applying the Control Mechanisms in Multi-Tenant Distributed Systems
Over time, kernel mechanisms have added additional knobs likecgroups and CPU pinning, but today there is still no general end-to-end system that makes those mechanisms practical to use Forexample, there is no established policy mechanism to require appli‐cations to state their need, and no distributed system frameworkconnects application policies with kernel-level primitives
It’s common practice among Unix system administrators to run sys‐tem processes at a higher priority than user processes, but thedesired settings vary from system to system depending on the appli‐cations running there Getting things to run smoothly requires theadministrator to have a good “feel” for the way the cluster normallybehaves, and to watch and tune it constantly
In some special cases, software developers have designed their plat‐forms so that they can use CPU priorities to affect overall applica‐tion performance For example, the Teradata architecture wasdesigned to make all queries CPU bound, and then CPU prioritiescan be used to control overall query prioritization and performance.Similarly, HPC frameworks like Portable Batch System (PBS) andTerascale Open-Source Resource and QUEue Manager (TORQUE)support cgroups
For general-purpose, multi-tenant distributed systems like Hadoop,making effective use of kernel primitives such as cgroups is moredifficult, because a given system might be running a multitude ofdiverse workloads at any given time Even if CPU were the onlylimited resource, it would be difficult to adjust the settings correctly
in such an environment, because the amount of CPU required bythe various applications changes constantly Accounting for RAM,disk I/O, and network only multiplies the complexity Further com‐plicating the situation is the fact that distributed systems necessarilydivide applications into tens, hundreds, or thousands of processesacross many nodes, and giving one particular process a higher prior‐ity might not affect the run time of the overall application in a pre‐dictable way
Trang 392 Typically 64-256 KB for the L1 and L2 cache for each core, and a few megabytes for the L3 cache that is shared across cores; see https://en.wikipedia.org/wiki/Haswell_
%28microarchitecture%29.
Software such as Pepperdata helps address these complications andother limitations of Hadoop With Pepperdata, Hadoop administra‐tors set high-level priorities for individual applications and groups
of applications, and Pepperdata constantly monitors each process’suse of hardware and responds in real time to enforce those priori‐ties, adjusting kernel primitives like nice and cgroups
I/O Waiting and CPU Cache Impacts
The performance impact of waiting for disk and network I/O onmulti-tenant distributed systems is covered in Chapters 4 and 5 ofthis book; this section focuses on the behavior of the CPU cache
In modern systems, CPU chip speeds are orders of magnitude fasterthan memory speeds, so processors have on-chip caches to reducethe time spent waiting for data from memory (see Figure 3-1) How‐ever, because these caches are limited in size, the CPU often hascache misses when it needs to wait for data to come from slowercaches (such as L2 or L3), or even from main memory.2
Figure 3-1 Typical cache architecture for a multicore CPU chip.
Trang 403 See http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html for interesting data on context switch times.
Well-designed programs that are predominantly running alone onone or more cores can often make very effective use of the L1/L2/L3caches and thus spend most of their CPU time performing usefulcomputation In contrast, multi-tenant distributed systems are aninherently more chaotic environment, with many processes running
on the same machine and often on the same core In such a situa‐tion, each time a different process runs on a core, the data it needsmight not be in the cache, so it must wait for data to come frommain memory—and then when it does, that new data replaces whatwas previously in the cache, so when the CPU switches back to aprocess it had already been running, that process, in turn, mustfetch its data from main memory These pathological cache missescan cause most of the CPU time to be wasted waiting for datainstead of processing it (Such situations can be difficult to detectbecause memory access/wait times show up in most metrics as CPUtime.)
Along with the problems due to cache misses, running a large num‐ber of processes on a single machine can slow things down becausethe kernel must spend a lot of CPU time engaged in context switch‐ing (This excessive kernel overhead can be seen in kernel metrics
such as voluntary_ctxt_switches and nonvoluntary_ctxt_switches via
the /proc filesystem, or by using a tool such as SystemTap.)3
The nature of multi-tenant systems also exacerbates the cache missproblem because no single developer or operator is tuning and shap‐ing the processes within the box—a single developer therefore has
no control over what else is running on the box, and the environ‐ment is constantly changing as new workloads come and go In con‐trast, special-purpose systems (even distributed ones) can bedesigned and tuned to minimize the impact of cache misses andsimilar performance problems For example, in a web-scale searchengine, each user query needs the system to process different data toproduce the search results A naive implementation would distributequeries randomly across the cluster, resulting in high cache missrates, with the CPU cache constantly being overwritten Searchengine developers can avoid this problem by assigning particularqueries to subsets of the cluster This kind of careful design and tun‐