IT training effective multi tenant distributed systems khotailieu

43 Introduction 43 Overview of Disk Performance Limits 45 Disk Behavior When Using Multiple Disks 46 Disk Performance in Multi-Tenant Distributed Systems 47 Controlling Disk I/O Usage to

Trang 4

Chad Carson and Sean Suchter

Effective Multi-Tenant Distributed Systems

Challenges and Solutions when Running Complex Environments

Trang 5

[LSI]

Effective Multi-Tenant Distributed Systems

by Chad Carson and Sean Suchter

Printed in the United States of America.

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://safaribooksonline.com) For more information, contact our corporate/institutional sales department:

800-998-9938 or corporate@oreilly.com.

Editor: Nicole Taché and Debbie Hardin

Production Editor: Nicholas Adams

Copyeditor: Octal Publishing Inc.

Interior Designer: David Futato

Cover Designer: Randy Comer

Illustrator: Rebecca Demarest October 2016: First Edition

Revision History for the First Edition

2016-10-10: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Effective

Multi-Tenant Distributed Systems, the cover image, and related trade dress are trademarks

of O’Reilly Media, Inc.

While the publisher and the authors have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the authors disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is sub‐ ject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

Trang 6

Table of Contents

1 Introduction to Multi-Tenant Distributed Systems 1

The Benefits of Distributed Systems 1

Performance Problems in Distributed Systems 1

Lack of Visibility Within Multi-Tenant Distributed Systems 4

The Impact on Business from Performance Problems 4

Scope of This Book 6

2 Scheduling in Distributed Systems 9

Introduction 9

Dominant Resource Fairness Scheduling 10

Aggressive Scheduling for Busy Queues 12

Special Scheduling Treatment for Small Jobs 13

Workload-Specific Scheduling Considerations 14

Inefficiencies in Scheduling 16

Summary 22

3 CPU Performance Considerations 25

Introduction 25

Algorithm Efficiency 26

Kernel Scheduling 27

I/O Waiting and CPU Cache Impacts 30

Summary 32

4 Memory Usage in Distributed Systems 33

Introduction 33

Physical Versus Virtual Memory 33

Node Thrashing 34

Trang 7

Kernel Out-Of-Memory Killer 37

Implications of Memory-Intensive Workloads for Multi-Tenant Distributed Systems 38

Summary 41

5 Disk Performance: Identifying and Eliminating Bottlenecks 43

Introduction 43

Overview of Disk Performance Limits 45

Disk Behavior When Using Multiple Disks 46

Disk Performance in Multi-Tenant Distributed Systems 47

Controlling Disk I/O Usage to Improve Performance for High-Priority Applications 48

Solid-State Drives and Distributed Systems 50

Measuring Performance and Diagnosing Problems 52

Summary 54

6 Network Performance Limits: Causes and Solutions 55

Introduction 55

Bandwidth Problems in Distributed Systems 55

Other Network-Related Bottlenecks and Problems 61

Measuring Network Performance and Debugging Problems 62

Summary 64

7 Other Bottlenecks in Distributed Systems 67

Introduction 67

NameNode Contention 67

ResourceManager Contention 69

ZooKeeper 70

Locks 71

External Databases and Related Systems 72

DNS Servers 72

Summary 73

8 Monitoring Performance: Challenges and Solutions 75

Introduction 75

Why Monitor? 76

What to Monitor 78

Systems and Performance Aspects of Monitoring 80

Algorithmic and Logical Aspects of Monitoring 85

Measuring the Effect of Attempted Improvements 88

Allocating Cluster Costs Across Tenants 89

Trang 8

Summary 90

9 Conclusion: Performance Challenges and Solutions for Effective Multi-Tenant Distributed Systems 91

Trang 10

CHAPTER 1

Introduction to Multi-Tenant

Distributed Systems

The Benefits of Distributed Systems

The past few decades have seen an explosion of computing power.Search engines, social networks, cloud-based storage and comput‐ing, and similar services now make seemingly infinite amounts ofinformation and computation available to users across the globe.The tremendous scale of these services would not be possible

without distributed systems Distributed systems make it possible for

many hundreds or thousands of relatively inexpensive computers tocommunicate with one another and work together, creating the out‐ward appearance of a single, high-powered computer The primarybenefit of a distributed system is clear: the ability to massively scalecomputing power relatively inexpensively, enabling organizations toscale up their businesses to a global level in a way that was not possi‐ble even a decade ago

Performance Problems in Distributed Systems

As more and more nodes are added to the distributed system andinteract with one another, and as more and more developers writeand run applications on the system, complications arise Operators

of distributed systems must address an array of challenges that affectthe performance of the system as a whole as well as individual appli‐cations’ performance

Trang 11

These performance challenges are different from those faced whenoperating a data center of computers that are running more or lessindependently, such as a web server farm In a true distributed sys‐tem, applications are split into smaller units of work, which arespread across many nodes and communicate with one anothereither directly or via shared input/output data.

Additional performance challenges arise with multi-tenant dis‐

tributed systems, in which different users, groups, and possibly busi‐ness units run different applications on the same cluster (This is incontrast to a single, large distributed application, such as a searchengine, which is quite complex and has intertask dependencies but

is still just one overall application.) These challenges that come withmultitenancy result from the diversity of applications runningtogether on any node as well as the fact that the applications arewritten by many different developers instead of one engineeringteam focused on ensuring that everything in a single distributedapplication works well together

Scheduling

One of the primary challenges in a distributed system is in schedul‐ing jobs and their component processes Computing power might bequite large, but it is always finite, and the distributed system mustdecide which jobs should be scheduled to run where and when, andthe relative priority of those jobs Even sophisticated distributed-system schedulers have limitations that can lead to underutilization

of cluster hardware, unpredictable job run times, or both Examplesinclude assuming the worst-case resource usage to avoid overcom‐mitting, failing to plan for different resource types across differentapplications, and overlooking one or more dependencies, thus caus‐ing deadlock or starvation

The scheduling challenges become more severe on multi-tenantclusters, which add fairness of resource access among users as ascheduling goal, in addition to (and often in conflict with) the goals

of high overall hardware utilization and predictable run times forhigh-priority applications Aside from the challenge of balancingutilization and fairness, in some extreme cases the scheduler might

go too far in trying to ensure fairness, scheduling just a few tasksfrom many jobs for many users at once This can result in latencyfor every job on the cluster and cause the cluster to use resources

Trang 12

inefficiently because the system is trying to do too many disparatethings at the same time.

Hardware Bottlenecks

Beyond scheduling challenges, there are many ways a distributedsystem can suffer from hardware bottlenecks and other inefficien‐cies For example, a single job can saturate the network or disk I/O,slowing down every other job These potential problems are onlyexacerbated in a multi-tenant environment—usage of a given hard‐ware resource such as CPU or disk is often less efficient when anode has many different processes running on it In addition, opera‐tors cannot tune the cluster for a particular access pattern, becausethe access patterns are both diverse and constantly changing.(Again, contrast this situation with a farm of servers, each of which

is independently running a single application, or a large cluster run‐ning a single coherently designed and tuned application like a searchengine.)

Distributed systems are also subject to performance problems due tobottlenecks from centralized services used by every node in the sys‐tem One common example is the master node performing jobadmission and scheduling; others include the master node for a dis‐tributed file system storing data for the cluster as well as commonservices like domain name system (DNS) servers

These potential performance challenges are exacerbated by the factthat a primary design goal for many modern distributed systems is

to enable large numbers of developers, data scientists, and analysts

to use the system simultaneously This is in stark contrast to earlierdistributed systems such as high-performance computing (HPC)systems in which the only people who could write programs to run

on the cluster had a systems programming background Today, dis‐tributed systems are opening up enormous computing power topeople without a systems background, so they often don’t under‐stand or even think about system performance Such a user mighteasily write a job that accidentally brings a cluster to its knees,affecting every other job and user

Trang 13

Lack of Visibility Within Multi-Tenant

Distributed Systems

Because multi-tenant distributed systems simultaneously run manyapplications, each with different performance characteristics andwritten by different developers, it can be difficult to determinewhat’s going on with the system, whether (and why) there’s a prob‐lem, which users and applications are the cause of any problem, andwhat to do about such problems

Traditional cluster monitoring systems are generally limited totracking metrics at the node level; they lack visibility into detailedhardware usage by each process Major blind spots can result—when there’s a performance problem, operators are unable to pin‐point exactly which application caused it, or what to do about it.Similarly, application-level monitoring systems tend to focus onoverall application semantics (overall run times, data volumes, etc.)and do not drill down to performance-level metrics for actual hard‐ware resources on each node that is running a part of the applica‐tion

Truly useful monitoring for multi-tenant distributed systems musttrack hardware usage metrics at a sufficient level of granularity foreach interesting process on each node Gathering, processing, andpresenting this data for large clusters is a significant challenge, interms of both systems engineering (to process and store the dataefficiently and in a scalable fashion) and the presentation-level logicand math (to present it usefully and accurately) Even for limited,node-level metrics, traditional monitoring systems do not scale well

on large clusters of hundreds to thousands of nodes

The Impact on Business from Performance Problems

The performance challenges described in this book can easily lead tobusiness impacts such as the following:

Inconsistent, unpredictable application run times

Batch jobs might run late, interactive applications mightrespond slowly, and the ingestion and processing of new incom‐ing data for use by other applications might be delayed

Trang 14

Underutilized hardware

Job queues can appear full even when the cluster hardware isnot running at full capacity This inefficiency can result inhigher capital and operating expenses; it can also result in sig‐nificant delays for new projects due to insufficient hardware, oreven the need to build out new data-center space to add newmachines for additional processing power

Cluster instability

In extreme cases, nodes can become unresponsive or a dis‐tributed file system (DFS) might become overloaded, so appli‐cations cannot run or are significantly delayed in accessing data.Aside from these obvious effects, performance problems also causebusinesses to suffer in subtler but ultimately more significant ways.Organizations might informally “learn” that a multi-tenant cluster isunpredictable and build implicit or explicit processes to workaround the unpredictability, such as the following:

• Limit cluster access to a subset of developers or analysts, out of

a concern that poorly written jobs will slow down or even crashthe cluster for everyone

• Build separate clusters for different groups or different work‐loads so that the most important applications are insulated fromothers Doing so increases overall cost due to inefficiency inresource usage, adds operational overhead and cost, andreduces the ability to share data across groups

• Set up “development” and “production” clusters, with a commit‐tee or other cumbersome process to approve jobs before theycan be run on a production cluster Adding these hurdles candramatically hinder innovation, because they significantly slowthe feedback loop of learning from production data, building

Trang 15

1 We saw an example of the benefits of having an extremely short feedback loop at Yahoo

in 2006–2007, when the sponsored search R&D team was an early user of the very first production Hadoop cluster anywhere By moving to Hadoop and being able to deploy new click prediction models directly into production, we increased the number of simultaneous experiments by five times or more and reduced the feedback loop time by

a similar factor As a result, our models could improve an order of magnitude faster, and the revenue gains from those improvements similarly compounded that much faster.

and testing a new model or new feature, deploying it to produc‐tion, and learning again.1

These responses to unpredictable performance can limit a business’sability to fully benefit from the potential of distributed systems.Eliminating performance problems on the cluster can improve per‐formance of the business overall

Scope of This Book

In this book, we consider the performance challenges that arise fromscheduling inefficiencies, hardware bottlenecks, and lack of visibil‐ity We examine each problem in detail and present solutions thatorganizations use today to overcome these challenges and benefitfrom the tremendous scale and efficiency of distributed systems

Hadoop: An Example Distributed System

This book uses Hadoop as an example of a multi-tenant distributedsystem Hadoop serves as an ideal example of such a system because

of its broad adoption across a variety of industries, from healthcare

to finance to transportation Due to its open source availability and

a robust ecosystem of supporting applications, Hadoop’s adoption isincreasing among small and large organizations alike

Hadoop is also an ideal example because it is used in highly tenant production deployments (running jobs from many hundreds

multi-of developers) and is multi-often used to simultaneously run large batchjobs, real-time stream processing, interactive analysis, andcustomer-facing databases As a result, it suffers from all of the per‐formance challenges described herein

Trang 16

2 Various distributed systems are designed to make different tradeoffs among Consis‐ tency, Availability, and Partition tolerance For more information, see Gilbert, Seth, and Nancy Ann Lynch “Perspectives on the CAP Theorem.” Institute of Electrical and Elec‐ tronics Engineers, 2012 (http://hdl.handle.net/1721.1/79112) and https://

www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed.

Of course, Hadoop is not the only important distributed system; afew other examples include the following:2

• Classic HPC clusters using MPI, TORQUE, and Moab

• Distributed databases such as Oracle RAC, Teradata, Cassandra,and MongoDB

• Render farms used for animation

• Simulation systems used for physics and manufacturing

An atomic unit of work that is part of a job This work is done

on a single node, generally running as a single (sometimes mul‐tithreaded) process on the node

Host, machine, or node

A single computing node, which can be an actual physical com‐puter or a virtual machine

Trang 18

CHAPTER 2

Scheduling in Distributed Systems

Introduction

In distributed computing, a scheduler is responsible for managing

incoming container requests and determining which containers torun next, on which node to run them, and how many containers to

run in parallel on the node (Container is a general term for individ‐ ual parts of a job; some systems use other terms such as task to refer

to a container.) Schedulers range in complexity, with the simplesthaving a straightforward first-in–first-out (FIFO) policy Differentschedulers place more or less importance on various (often conflict‐ing) goals, such as the following:

• Utilizing cluster resources as fully as possible

• Giving each user and group fair access to the cluster

• Ensuring that high-priority or latency-sensitive jobs complete

on time

Multi-tenant distributed systems generally prioritize fairness amongusers and groups over optimal packing and maximal resource usage;without fairness, users would be likely to maximize their own access

to the cluster without regard to others’ needs Also, different groupsand business units would be inclined to run their own smaller, lessefficient cluster to ensure access for their users

In the context of Hadoop, one of two schedulers is most commonly

used: the capacity scheduler and the fair scheduler Historically, each

scheduler was written as an extension of the simple FIFO scheduler,

Trang 19

1 The new architecture is referred to as Yet Another Resource Negotiator (YARN) or MapReduce v2 See https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn- site/YARN.html.

2 See http://mesos.apache.org/api/latest/java/org/apache/mesos/Scheduler.html.

3 See https://www.quora.com/How-does-two-level-scheduling-work-in-Apache-Mesos.

and initially each had a different goal, as their names indicate Overtime, the two schedulers have experienced convergent evolution,with each incorporating improvements from the other; today, theyare mostly different in details Both schedulers have the concept of

multiple queues of jobs to be scheduled, with admission to each

queue determined based on user- or operator-specified policies.Recent versions of Hadoop1 perform two-level scheduling, in which a

centralized scheduler running on the ResourceManager node

assigns cluster resources (containers) to each application, and an

ApplicationMaster running in one of those containers uses the othercontainers to run individual tasks for the application The Applica‐tionMaster manages the details of the application, including com‐munication and coordination among tasks This architecture ismuch more scalable than Hadoop’s original one-level scheduling, inwhich a single central node (the JobTracker) did the work of boththe ResourceManager and every ApplicationMaster

Many other modern distributed systems like Dryad and Mesos haveschedulers that are similar to Hadoop’s schedulers For example,Mesos also supports a pluggable scheduler interface much likeHadoop,2 and it performs two-level scheduling,3 with a centralscheduler that registers available resources and assigns them toapplications (“frameworks”)

Dominant Resource Fairness Scheduling

Historically, most schedulers considered only a single type of hard‐ware resource when deciding which container to schedule next—both in calculating the free resources on each node and in calculat‐ing how much a given user, group, or queue was already using (e.g.,from the point of view of fairness in usage) In the case of Hadoop,only memory usage was considered

However, in a multi-tenant distributed system, different jobs andcontainers generally have widely different hardware usage profiles—

Trang 20

4 Ghodsi, Ali, et al “Dominant Resource Fairness: Fair Allocation of Multiple Resource

Types.” NSDI Vol 11 2011 https://www.cs.berkeley.edu/~alig/papers/drf.pdf

some containers require significant memory, whereas some useCPU much more heavily (see Figure 2-1) Not considering CPUusage in scheduling meant that the system might be significantlyunderutilized, and some users would end up getting more or less

than their true fair share of the cluster A policy called Dominant

Resource Fairness (DRF)4 addresses these limitations by consideringmultiple resource types and expressing the usage of each resource in

a common currency (the share of the total allocation of thatresource), and then scheduling based on the resource each container

is using most heavily

Figure 2-1 Per-container physical memory usage versus CPU usage during a representative period of time on a production cluster Note that some jobs consume large amounts of memory while using rela‐ tively little CPU; others use significant CPU but relatively little mem‐ ory.

Trang 21

5 See http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairSchedu ler.html and http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ CapacityScheduler.html.

In Hadoop, operators can configure both the Fair Scheduler and theCapacity Scheduler to consider both memory and CPU (using theDRF framework) when considering which container to launch next

on a given node.5

Aggressive Scheduling for Busy Queues

Often a multi-tenant cluster might be in a state where some but notall queues are full; that is, some tenants currently don’t have enoughwork to use their full share of the cluster, but others have more workthan they are guaranteed based on the scheduler’s configured alloca‐tion In such cases, the scheduler might launch more containersfrom the busy queues to keep the cluster fully utilized

Sometimes, after those extra containers are launched, new jobs aresubmitted to a queue that was previously empty; based on thescheduler’s policy, containers from those jobs should be scheduledimmediately, but because the scheduler has already opportunisticallylaunched extra containers from other queues, the cluster is full In

those cases, the scheduler might preempt those extra containers by

killing some of them in order to reflect the desired fairness policy(see Figure 2-2) Preemption is a common feature in schedulers formulti-tenant distributed systems, including both popular Hadoopschedulers (capacity and fair)

Because preemption inherently results in lost work, it’s importantfor the scheduler to strike a good balance between starting manyopportunistic containers to make use of idle resources and avoidingtoo much preemption and the waste that it causes To help reducethe negative impacts of preemption, the scheduler can slightly delaykilling containers (to avoid wasting the work of containers that arealmost complete) and generally chooses to kill containers that haverecently launched (again, to avoid wasted work)

Trang 22

6 Verma, Abhishek, et al “Large-scale cluster management at Google with Borg.” Pro‐

ceedings of the Tenth European Conference on Computer Systems ACM, 2015 http:// research.google.com/pubs/archive/43438.pdf

Figure 2-2 When new jobs arrive in Queue A, they might be scheduled

if there is sufficient unused cluster capacity, allowing Queue A to use more than its guaranteed share If jobs later arrive in Queue B, the scheduler might then preempt some of the Queue A jobs to provide Queue B its guaranteed share.

A related concept is used by Google’s Borg system,6 which has a con‐cept of priorities and quotas; a quota represents a set of hardwareresource quantities (CPU, memory, disk, etc.) for a period of time,and higher-priority quota costs more than lower-priority quota.Borg never allocates more production-priority quota than is avail‐able on a given cluster; this guarantees production jobs the resour‐ces they need At any given time, excess resources that are not beingused by production jobs can be used by lower-priority jobs, butthose jobs can be killed if the production jobs’ usage later increases.(This behavior is similar to another kind of distributed system,Amazon Web Services, which has a concept of guaranteed instancesand spot instances; spot instances cost much less than guaranteedones but are subject to being killed at any time.)

Special Scheduling Treatment for Small Jobs

Some cluster operators provide special treatment for small or fastjobs; in a sense, this is the opposite of preemption One example isLinkedIn’s “fast queue” for Hadoop, which is a small queue that is

Trang 23

7 See slide 9 of http://www.slideshare.net/Hadoop_Summit/hadoop-operations-at-linkedin.

8 See http://doc.mapr.com/display/MapR/ExpressLane.

used only for jobs that take less than an hour total to run and whosecontainers each take less than 15 minutes.7 If jobs or containers vio‐late this limit, they are automatically killed This feature providesfast response for smaller jobs even when the cluster is bogged down

by large batch jobs; it also encourages developers to optimize theirjobs to run faster

The Hadoop vendor MapR provides somewhat similar functionalitywith its ExpressLane,8 which schedules small jobs (as defined byhaving few containers, each with low memory usage and small inputdata sizes) to run on the cluster even when the cluster is busy andhas no additional capacity for normal jobs This is also an interest‐ing example of using the input data size as a cue to the schedulerabout how fast a container is likely to be

Workload-Specific Scheduling Considerations

Aside from the general goals of high utilization and fairness acrossusers and queues, schedulers might take other factors into accountwhen deciding which containers to launch and where to run them.For example, a key design point of Hadoop is to move computation

to the data (The goal is to not just get the nodes to work as hard asthey can, but also get them to work more efficiently.) The schedulertries to accomplish this goal by preferring to place a given container

on one of the nodes that have the container’s input HDFS datastored locally; if that can’t be done within a certain amount of time,

it then tries to place the container on the same rack as a node thathas the HDFS data; if that also can’t be done after waiting a certainamount of time, the container is launched on any node that hasavailable computing resources Although this approach increasesoverall system efficiency, it complicates the scheduling problem

An example of a different kind of placement constraint is the sup‐

port for pods in Kubernetes A pod is a group of containers, such as

Docker containers, that are scheduled at the same time on the samenode Pods are frequently used to provide services that act as helperprograms for an application Unlike the preference for data locality

in Hadoop scheduling, the colocation and coscheduling of contain‐

Trang 24

9 Nurmi, Daniel, et al “Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction.” Proceedings of the 2006 ACM/IEEE

conference on Supercomputing ACM, 2006 http://www.cs.ucsb.edu/~nurmi/ nurmi_workflow.pdf

ers in a pod is a hard requirement; in many cases the applicationsimply would not work without the auxiliary services running onthe same node

A weaker constraint than colocation is the concept of gang schedul‐

ing, in which an application requires all of its resources to run con‐

currently, but they don’t need to run on the same node An example

is a distributed database like Impala, which needs to have all of its

“query fragments” running in order to serve queries Although somedistributed systems’ schedulers support gang scheduling natively,Hadoop doesn’t currently support gang scheduling; applications thatrequire concurrent containers mimic gang scheduling by keepingcontainers alive but idle until all of the required containers are run‐ning This workaround clearly wastes resources because these idlecontainers hold resources and stop other containers from running.However, even when gang scheduling is done “cleanly” by the sched‐uler, it can lead to inefficiencies because the scheduler needs toavoid fully loading the cluster with other containers to ensure thatenough space will eventually be available for the entire gang to bescheduled

As a side note, workflow schedulers such as Oozie are given informa‐

tion about the dependencies among jobs in a complex workflow thatmust happen in order; the workflow scheduler then submits theindividual jobs to the distributed system on behalf of the user Aworkflow scheduler can take into account the required inputs andoutputs of each stage (including inputs that depend on some off-cluster process to write new data to the cluster), the time of day theworkflow should be started, awareness of the full directed acyclicgraph (DAG) of the entire workflow, and similar constraints Gener‐ally, the workflow scheduler is distinct from the distributed system’sown scheduler that determines exactly where and when containersare launched on each node, but there are cases when overall sched‐uling can be much more efficient if workflow scheduling andresource scheduling are combined.9

Trang 25

Inefficiencies in Scheduling

Although schedulers have become more sophisticated over time,they continue to suffer from inefficiencies related to the diversity ofworkloads running on multi-tenant distributed systems These inef‐ficiencies arise from the need to avoid overcommitting memorywhen doing up-front scheduling, a limited ability to consider alltypes of hardware resources, and challenges in considering thedependencies among all jobs and containers within complicatedworkflows

The Need to be Conservative with Memory

Distributed system schedulers generally make scheduling decisionsbased on conservative assumptions about the hardware resources—especially memory—required by each container These require‐ments are usually declared by the job author based on the worst-caseusage, not the actual usage This difference is critical because oftendifferent containers from the same job have different actual resourceusage, even if they are running identical code (This happens, forexample, when the input data for one container is larger or other‐wise different from the input data for other containers, resulting in aneed for more processing or more space in memory.)

If a node’s resources are fully scheduled and the node is “unlucky” inthe mix of containers it’s running, the node can be overloaded; if theresource that is overloaded is memory, the node might run out ofmemory and crash or start swapping badly In a large distributedsystem, some nodes are bound to be unlucky in this way, so if thescheduler does not use conservative resource usage estimates, thesystem will nearly always be in a bad state

The need to be conservative with memory allocation means thatmost nodes will be underutilized most of the time; containers gener‐ally do not often use their theoretical maximum memory, and evenwhen they do, it’s not for the full lifetime of the container (see

Figure 2-3) (In some cases, containers can use even more than theirdeclared maximum Systems can be more or less stringent about

Trang 26

10 For example, Google’s Borg kills containers that try to exceed their declared memory limit Hadoop by default lets containers go over, but operators can configure it to kill such containers.

enforcing what the developer declares—some systems kill containerswhen they exceed their maximum memory, but others do not.10)

Figure 2-3 Actual physical memory usage compared to the container size (the theoretical maximum) for a typical container Note that the actual usage changes over time and is much smaller than the reserved amount.

To reduce the waste associated with this underutilization, operators

of large multi-tenant distributed systems often must perform a bal‐ancing act, trying to increase cluster utilization without pushingnodes over the edge As described in Chapter 4, software like Pep‐perdata provides a way to increase utilization for distributed systemssuch as Hadoop by monitoring actual physical memory usage anddynamically allowing more or fewer processes to be scheduled on agiven node, based on the current and projected future memoryusage on that node

Trang 27

Inability to Effectively Schedule the Use of Other

Resources

Similar inefficiencies can occur due to the natural variation overtime in the resource usage for a single container, not just variationacross containers For a given container, memory usage tends tovary by a factor of two or three over the lifetime of the container,and the variation is generally smooth CPU usage varies quite a bitover time, but the maximum usage is generally limited to a singlecore In contrast, disk I/O and network usage frequently vary byorders of magnitude, and they spike very quickly They are alsoeffectively unlimited in how much of the corresponding resourcethey use: one single thread on one machine can easily saturate thefull network bandwidth of the node or use up all available disk I/Ooperations per second (IOPS) and bandwidth from dozens of disks(including even disks on multiple machines, when the thread isrequesting data stored on another node) See Figure 2-4 for theusage of various resources for a sample job The left column showsoverall usage for all map tasks (red, starting earlier) and reduce tasks(green, starting later) The right column shows a breakdown byindividual task (For this particular job, there is only one reducetask.)

Because CPU, disk, and network usage can change so quickly, it isimpossible for any system that only does up-front scheduling tooptimize cluster utilization and provide true fairness in the use ofhardware resources

Trang 28

Figure 2-4 The variation over time in usage of different hardware resources for a typical MapReduce job (source: Pepperdata)

Trang 29

11 Wang, Yang and Wei Shi “Budget-driven scheduling algorithms for batches of Map‐

Reduce jobs in heterogeneous clouds.” IEEE Transactions on Cloud Computing 2.3

(2014): 306-319 Scheduling-Algorithms-for-Batches-of-MapReduce-Jobs

https://www.researchgate.net/publication/277583513_Budget-Driven-12 See, for example, Chekuri, Chandra and Sanjeev Khanna “On multidimensional pack‐

ing problems.” SIAM journal on computing 33.4 (2004): 837-851 http://

Deadlock and Starvation

In some cases, schedulers might choose to start some containers in ajob’s DAG even before the preceding containers (the dependencies)have completed This is done to reduce the total run time of the job

or spread out resource usage over time

In the interest of concreteness, the discussion in this

section uses map and reduce containers, but similar

effects can happen any time a job has some containers

that depend on the output of others; the problems are

not specific to MapReduce or Hadoop

An example is Hadoop’s “slow start” feature, in which reduce con‐tainers might be launched before all of the map containers theydepend on have completed This behavior can help minimize spikes

Trang 30

13 See hadoop/11673808#11673808.

http://stackoverflow.com/questions/11672676/when-do-reduce-tasks-start-in-14 See https://issues.apache.org/jira/browse/MAPREDUCE-314 for an example.

in network bandwidth usage by spreading out the heavy networktraffic of transferring data from mappers to reducers However,starting a reduce container too early means that it might end up justsitting on a node waiting for its input data (from map containers) to

be generated, which means that other containers are not able to usethe memory the reduce container is holding, thus affecting overallsystem utilization.13

This problem is especially common on very busy clusters with manytenants because often not all map containers from a job can bescheduled in quick succession; similarly, if a map container fails (forexample, due to node failure), it might take a long time to getrescheduled, especially if other, higher-priority jobs have been sub‐mitted after the reducers from this job were scheduled In extremecases this can lead to deadlock, when the cluster is occupied byreduce containers that are unable to proceed because the containersthey depend on cannot be scheduled.14 Even if deadlock does notoccur, the cluster can still be utilized inefficiently, and overall jobcompletion can be unnecessarily slow as measured by wall-clocktime, if the scheduler launches just a small number of containersfrom each of many users at one time

A similar scheduling problem is starvation, which can occur on aheavily loaded cluster For example, consider a case in which one jobhas containers that each need a larger amount of memory than con‐tainers from other jobs When one of the small containers completes

on a node, a naive scheduler will see that the node has a smallamount of memory available, but because it can’t fit one of the largecontainers there, it will schedule a small container to run In theextreme case, the larger containers might never be scheduled In

Hadoop and other systems, the concept of a reservation allows an

application to reserve available space on a node, even if the applica‐

Trang 31

15 See Sulistio, Anthony, Wolfram Schiffmann, and Rajkumar Buyya “Advanced

reservation-based scheduling of task graphs on clusters.” International Conference on

High-Performance Computing Springer Berlin Heidelberg, 2006 http://www.cloud bus.org/papers/workflow_hipc2006.pdf For related recent work in Hadoop, see Curino, Carlo et al “Reservation-based Scheduling: If You’re Late Don’t Blame Us!” Proceed‐

ings of the ACM Symposium on Cloud Computing ACM, 2014 https://www.micro soft.com/en-us/research/publication/reservation-based-scheduling-if-youre-late-dont- blame-us/.

16 This is different from the standard use of the term “speculative execution” in which pipelined microprocessors sometimes execute both sides of a conditional branch before knowing which branch will be taken.

tion can’t immediately use it.15 (This behavior can help avoid starva‐tion, but it also means that the overall utilization of the system islower, because some amount of resources might be reserved butunused at any particular time.)

Waste Due to Speculative Execution

Operators can configure Hadoop to use speculative execution, in

which the scheduler can observe that a given container seems to berunning more slowly than is typical for that kind of container andstart another copy of that container on another node This behavior

is primarily intended to avoid cases in which a particular node isperforming badly (usually due to a hardware problem) and an entirejob could be slowed down due to just one straggler container.16

While speculative execution can reduce job completion time due tonode problems, it wastes resources when the container that is dupli‐cated simply had more work to do than other containers and so nat‐urally ran longer In practice, experienced operators typically disablespeculative execution on multi-tenant clusters, both because there isgenerally inherent container variation (not due to hardware prob‐lems) and because the operators are constantly watching for badhardware, so speculative execution does not enhance performance

Summary

Over time, distributed system schedulers have grown in sophistica‐tion from a very simple FIFO algorithm to add the twin goals offairness across users and increased cluster utilization Those twogoals must be balanced against each other; on multi-tenant dis‐tributed systems, operators often prioritize fairness They do so to

Trang 32

reduce the level of user-visible scheduling issues as well as to keepmultiple business units satisfied to use shared infrastructure ratherthan running their own separate clusters (In contrast, configuringthe scheduler to maximize utilization could save money in the shortterm but waste it in the long term, because many small clusters areless efficient than one large one.)

Schedulers have also become more sophisticated by better takinginto account multiple hardware resource requirements (for example,not considering only memory) and effectively treating differentkinds of workloads differently when scheduling decisions are made.However, they still suffer from limitations, for example being con‐servative in resource allocation to avoid instability due to overcom‐mitting resources such as memory That conservatism can keep thecluster stable, but it results in lower utilization and slower run timesthan the hardware could actually support Software solutions thatmake real-time, fine-grained decisions about resource usage canprovide increased utilization while maintaining cluster stability andproviding more predictable job run times

Trang 34

Today, distributed systems tend to run applications for which thelarge scale is driven by the size of the input data rather than theamount of computation needed—examples include both special-purpose distributed systems (such as those powering web searchamong billions of documents) and general-purpose systems such asHadoop (However, even in those general systems, there are stillsome cases such as iterative algorithms for machine learning wheremaking efficient use of the CPU is critical.)

As a result, the CPU is often not the primary bottleneck limiting adistributed system; nevertheless, it is important to be aware of theimpacts of CPU on overall speed and throughput

At a high level, the effect of CPU performance on distributed sys‐tems is driven by three primary factors:

• The efficiency of the program that’s running, at the level of thecode as well as how the work is broken into pieces and dis‐tributed across nodes

Trang 35

1 See https://github.com/linkedin/dr-elephant/wiki.

• Low-level kernel scheduling and prioritization of the computa‐tional work done by the CPU, when the CPU is not waiting fordata

• The amount of time the CPU spends waiting for data frommemory, disk, or network

These factors are important for the performance even of singleapplications running on a single machine; they are just as important,and even more complicated, for multi-tenant distributed systemsdue to the increased number and diversity of processes running onthose systems, and their varied input data sources

to profile and optimize a single instance of a program running on aparticular machine

For distributed systems, it can be equally important (if not more so)

to break down the work into units effectively For example, withMapReduce programs, some arrangements of map-shuffle-reducesteps are more efficient than others Likewise, whether using Map‐Reduce, Spark, or another distributed framework, using the rightlevel of parallelism is important For example, because every mapand reduce task requires a nontrivial amount of setup and teardownwork, running too many small tasks can lead to grossly inefficientoverhead—we’ve seen systems with thousands of map tasks thateach require several seconds for setup and teardown but spend lessthan one second on useful computation

In the case of Hadoop, open source tools like Dr Elephant1 (as well

as some commercial tools) provide performance measurement and

Trang 36

recommendations to improve the overall flow of jobs, identifyingproblems such as a suboptimal breakdown of work into individualunits.

Kernel Scheduling

The operating system kernel (Linux, for example) decides whichthreads run where and when, distributing a fixed amount of CPUresource across threads (and thus ultimately across applications)

Every N (~5) milliseconds, the kernel takes control of a given core and decides which thread’s instructions will run there for the next N

milliseconds For each candidate thread, the kernel’s scheduler mustconsider several factors:

• Is the thread ready to do anything at all (versus waiting forI/O)?

• If yes, is it ready to do something on this core?

• If yes, what is its dynamic priority? This computation takes sev‐ eral factors into account, including the static priority of the pro‐

cess, how much CPU time the thread has been allocatedrecently, and other signals depending on the kernel version

• How does this thread’s dynamic priority compare to that ofother threads that could be run now?

The Linux kernel exposes several control knobs to affect the static (apriori) priority of a process; nice and control groups (cgroups) arethe most commonly used With cgroups, priorities can be set, andscheduling affected, for a group of processes rather than a singleprocess or thread; conceptually, cgroups divide the access to CPU

across the entire group This division across groups of processes

means that applications running many processes on a node do notreceive unfair advantage over applications with just one or a fewprocesses

In considering the impact of CPU usage, it is helpful to distinguish

between latency-sensitive and latency-insensitive applications:

• In a latency-sensitive application, a key consideration is the tim‐ing of the CPU cycles assigned to it Performance can be defined

by the question “How much CPU do I get when I need it?”

Trang 37

• In a latency-insensitive application, the opposite situation exists:the exact timing of the CPU cycles assigned to it is unimportant;the most important consideration is the total number of CPUcycles assigned to it over time (usually minutes or hours).This distinction is important for distributed systems, which oftenrun latency-sensitive applications alongside batch workloads, such

as MapReduce in the case of Hadoop Examples of latency-sensitivedistributed applications include search engines, key-value stores,clustered databases, video streaming systems, and advertising sys‐tems with real-time bidding that must respond in milliseconds.Examples of latency-insensitive distributed applications includeindex generation and loading for search engines, garbage collectionfor key-value stores or databases, and offline machine learning foradvertising systems

An interesting point is that even the same binary can have very dif‐ferent requirements when used in different applications For exam‐

ple, a distributed data store like HBase can be latency-sensitive for

reading data when serving end-customer queries, and

insensitive when updating the underlying data—or it can be

latency-sensitive for writing data streamed from consumer devices, and

latency-insensitive when supporting analyst queries against thestored data The semantics of the specific application matter whensetting priorities and measuring performance

Intentional or Accidental Bad Actors

As is the case with other hardware resources, CPU is subject toeither intentional or accidental “bad actors” who can use more thantheir fair share of the CPU on a node or even the distributed system

as a whole These problems are specific to multi-tenant distributedsystems, not single-node systems or distributed systems running asingle application

A common problem case is due to multithreading If most applica‐tions running in a system are single threaded, but one developerwrites a multithreaded application, the system might not be tunedappropriately to handle the new type of workload Not only can thiscause general performance problems, it is considered unfair becausethat one developer can nearly monopolize the system Some systemslike Hadoop try to mitigate this problem by allowing developers tospecify how many cores each task will use (with multithreaded pro‐

Trang 38

grams specifying multiple cores), but this can be wasteful of resour‐ces, because if a task is not fully using the specified number of cores,the cores might remain reserved and thus go unused.

Applying the Control Mechanisms in Multi-Tenant Distributed Systems

Over time, kernel mechanisms have added additional knobs likecgroups and CPU pinning, but today there is still no general end-to-end system that makes those mechanisms practical to use Forexample, there is no established policy mechanism to require appli‐cations to state their need, and no distributed system frameworkconnects application policies with kernel-level primitives

It’s common practice among Unix system administrators to run sys‐tem processes at a higher priority than user processes, but thedesired settings vary from system to system depending on the appli‐cations running there Getting things to run smoothly requires theadministrator to have a good “feel” for the way the cluster normallybehaves, and to watch and tune it constantly

In some special cases, software developers have designed their plat‐forms so that they can use CPU priorities to affect overall applica‐tion performance For example, the Teradata architecture wasdesigned to make all queries CPU bound, and then CPU prioritiescan be used to control overall query prioritization and performance.Similarly, HPC frameworks like Portable Batch System (PBS) andTerascale Open-Source Resource and QUEue Manager (TORQUE)support cgroups

For general-purpose, multi-tenant distributed systems like Hadoop,making effective use of kernel primitives such as cgroups is moredifficult, because a given system might be running a multitude ofdiverse workloads at any given time Even if CPU were the onlylimited resource, it would be difficult to adjust the settings correctly

in such an environment, because the amount of CPU required bythe various applications changes constantly Accounting for RAM,disk I/O, and network only multiplies the complexity Further com‐plicating the situation is the fact that distributed systems necessarilydivide applications into tens, hundreds, or thousands of processesacross many nodes, and giving one particular process a higher prior‐ity might not affect the run time of the overall application in a pre‐dictable way

Trang 39

2 Typically 64-256 KB for the L1 and L2 cache for each core, and a few megabytes for the L3 cache that is shared across cores; see https://en.wikipedia.org/wiki/Haswell_

%28microarchitecture%29.

Software such as Pepperdata helps address these complications andother limitations of Hadoop With Pepperdata, Hadoop administra‐tors set high-level priorities for individual applications and groups

of applications, and Pepperdata constantly monitors each process’suse of hardware and responds in real time to enforce those priori‐ties, adjusting kernel primitives like nice and cgroups

I/O Waiting and CPU Cache Impacts

The performance impact of waiting for disk and network I/O onmulti-tenant distributed systems is covered in Chapters 4 and 5 ofthis book; this section focuses on the behavior of the CPU cache

In modern systems, CPU chip speeds are orders of magnitude fasterthan memory speeds, so processors have on-chip caches to reducethe time spent waiting for data from memory (see Figure 3-1) How‐ever, because these caches are limited in size, the CPU often hascache misses when it needs to wait for data to come from slowercaches (such as L2 or L3), or even from main memory.2

Figure 3-1 Typical cache architecture for a multicore CPU chip.

Trang 40

3 See http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html for interesting data on context switch times.

Well-designed programs that are predominantly running alone onone or more cores can often make very effective use of the L1/L2/L3caches and thus spend most of their CPU time performing usefulcomputation In contrast, multi-tenant distributed systems are aninherently more chaotic environment, with many processes running

on the same machine and often on the same core In such a situa‐tion, each time a different process runs on a core, the data it needsmight not be in the cache, so it must wait for data to come frommain memory—and then when it does, that new data replaces whatwas previously in the cache, so when the CPU switches back to aprocess it had already been running, that process, in turn, mustfetch its data from main memory These pathological cache missescan cause most of the CPU time to be wasted waiting for datainstead of processing it (Such situations can be difficult to detectbecause memory access/wait times show up in most metrics as CPUtime.)

Along with the problems due to cache misses, running a large num‐ber of processes on a single machine can slow things down becausethe kernel must spend a lot of CPU time engaged in context switch‐ing (This excessive kernel overhead can be seen in kernel metrics

such as voluntary_ctxt_switches and nonvoluntary_ctxt_switches via

the /proc filesystem, or by using a tool such as SystemTap.)3

The nature of multi-tenant systems also exacerbates the cache missproblem because no single developer or operator is tuning and shap‐ing the processes within the box—a single developer therefore has

no control over what else is running on the box, and the environ‐ment is constantly changing as new workloads come and go In con‐trast, special-purpose systems (even distributed ones) can bedesigned and tuned to minimize the impact of cache misses andsimilar performance problems For example, in a web-scale searchengine, each user query needs the system to process different data toproduce the search results A naive implementation would distributequeries randomly across the cluster, resulting in high cache missrates, with the CPU cache constantly being overwritten Searchengine developers can avoid this problem by assigning particularqueries to subsets of the cluster This kind of careful design and tun‐

Định dạng
Số trang	102
Dung lượng	34,14 MB