Effective multi tenant distributed systems

Additional performance challenges arise with multi-tenant distributed systems, in which different users, groups, and possibly business units rundifferent applications on the same cluster

Trang 2

Strata

Trang 4

Effective Multi-Tenant

Distributed Systems

Challenges and Solutions when Running Complex Environments

Chad Carson and Sean Suchter

Trang 5

Effective Multi-Tenant Distributed Systems

by Chad Carson and Sean Suchter

Printed in the United States of America

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North,Sebastopol, CA 95472

O’Reilly books may be purchased for educational, business, or salespromotional use Online editions are also available for most titles(http://safaribooksonline.com) For more information, contact ourcorporate/institutional sales department: 800-998-9938 or

corporate@oreilly.com.

Editor: Nicole Taché and Debbie Hardin

Production Editor: Nicholas Adams

Copyeditor: Octal Publishing Inc

Interior Designer: David Futato

Cover Designer: Randy Comer

Illustrator: Rebecca Demarest

October 2016: First Edition

Trang 6

Revision History for the First Edition

2016-10-10: First Release

The O’Reilly logo is a registered trademark of O’Reilly Media, Inc Effective

Multi-Tenant Distributed Systems, the cover image, and related trade dress

are trademarks of O’Reilly Media, Inc

While the publisher and the authors have used good faith efforts to ensurethat the information and instructions contained in this work are accurate, thepublisher and the authors disclaim all responsibility for errors or omissions,including without limitation responsibility for damages resulting from the use

of or reliance on this work Use of the information and instructions contained

in this work is at your own risk If any code samples or other technology thiswork contains or describes is subject to open source licenses or the

intellectual property rights of others, it is your responsibility to ensure thatyour use thereof complies with such licenses and/or rights

978-1-491-96183-4

[LSI]

Trang 7

Chapter 1 Introduction to Tenant Distributed Systems

Trang 8

Multi-The Benefits of Distributed Systems

The past few decades have seen an explosion of computing power Searchengines, social networks, cloud-based storage and computing, and similarservices now make seemingly infinite amounts of information and

computation available to users across the globe

The tremendous scale of these services would not be possible without

distributed systems Distributed systems make it possible for many hundreds

or thousands of relatively inexpensive computers to communicate with oneanother and work together, creating the outward appearance of a single, high-powered computer The primary benefit of a distributed system is clear: theability to massively scale computing power relatively inexpensively, enablingorganizations to scale up their businesses to a global level in a way that wasnot possible even a decade ago

Trang 9

Performance Problems in Distributed Systems

As more and more nodes are added to the distributed system and interact withone another, and as more and more developers write and run applications onthe system, complications arise Operators of distributed systems must

address an array of challenges that affect the performance of the system as awhole as well as individual applications’ performance

These performance challenges are different from those faced when operating

a data center of computers that are running more or less independently, such

as a web server farm In a true distributed system, applications are split intosmaller units of work, which are spread across many nodes and communicatewith one another either directly or via shared input/output data

Additional performance challenges arise with multi-tenant distributed

systems, in which different users, groups, and possibly business units rundifferent applications on the same cluster (This is in contrast to a single,large distributed application, such as a search engine, which is quite complexand has intertask dependencies but is still just one overall application.) Thesechallenges that come with multitenancy result from the diversity of

applications running together on any node as well as the fact that the

applications are written by many different developers instead of one

engineering team focused on ensuring that everything in a single distributedapplication works well together

Trang 10

deadlock or starvation.

The scheduling challenges become more severe on multi-tenant clusters,which add fairness of resource access among users as a scheduling goal, inaddition to (and often in conflict with) the goals of high overall hardwareutilization and predictable run times for high-priority applications Asidefrom the challenge of balancing utilization and fairness, in some extremecases the scheduler might go too far in trying to ensure fairness, schedulingjust a few tasks from many jobs for many users at once This can result inlatency for every job on the cluster and cause the cluster to use resourcesinefficiently because the system is trying to do too many disparate things atthe same time

Trang 11

Hardware Bottlenecks

Beyond scheduling challenges, there are many ways a distributed system cansuffer from hardware bottlenecks and other inefficiencies For example, asingle job can saturate the network or disk I/O, slowing down every other job.These potential problems are only exacerbated in a multi-tenant environment

— usage of a given hardware resource such as CPU or disk is often less

efficient when a node has many different processes running on it In addition,operators cannot tune the cluster for a particular access pattern, because theaccess patterns are both diverse and constantly changing (Again, contrastthis situation with a farm of servers, each of which is independently running asingle application, or a large cluster running a single coherently designed andtuned application like a search engine.)

Distributed systems are also subject to performance problems due to

bottlenecks from centralized services used by every node in the system Onecommon example is the master node performing job admission and

scheduling; others include the master node for a distributed file system

storing data for the cluster as well as common services like domain namesystem (DNS) servers

These potential performance challenges are exacerbated by the fact that aprimary design goal for many modern distributed systems is to enable largenumbers of developers, data scientists, and analysts to use the system

simultaneously This is in stark contrast to earlier distributed systems such ashigh-performance computing (HPC) systems in which the only people whocould write programs to run on the cluster had a systems programming

background Today, distributed systems are opening up enormous computingpower to people without a systems background, so they often don’t

understand or even think about system performance Such a user might easilywrite a job that accidentally brings a cluster to its knees, affecting every otherjob and user

Trang 12

Lack of Visibility Within Multi-Tenant

Distributed Systems

Because multi-tenant distributed systems simultaneously run many

applications, each with different performance characteristics and written bydifferent developers, it can be difficult to determine what’s going on with thesystem, whether (and why) there’s a problem, which users and applicationsare the cause of any problem, and what to do about such problems

Traditional cluster monitoring systems are generally limited to tracking

metrics at the node level; they lack visibility into detailed hardware usage byeach process Major blind spots can result — when there’s a performanceproblem, operators are unable to pinpoint exactly which application caused it,

or what to do about it Similarly, application-level monitoring systems tend tofocus on overall application semantics (overall run times, data volumes, etc.)and do not drill down to performance-level metrics for actual hardware

resources on each node that is running a part of the application

Truly useful monitoring for multi-tenant distributed systems must track

hardware usage metrics at a sufficient level of granularity for each interestingprocess on each node Gathering, processing, and presenting this data forlarge clusters is a significant challenge, in terms of both systems engineering(to process and store the data efficiently and in a scalable fashion) and thepresentation-level logic and math (to present it usefully and accurately) Evenfor limited, node-level metrics, traditional monitoring systems do not scalewell on large clusters of hundreds to thousands of nodes

Trang 13

The Impact on Business from Performance

Problems

The performance challenges described in this book can easily lead to businessimpacts such as the following:

Inconsistent, unpredictable application run times

Batch jobs might run late, interactive applications might respond slowly,and the ingestion and processing of new incoming data for use by otherapplications might be delayed

Underutilized hardware

Job queues can appear full even when the cluster hardware is not

running at full capacity This inefficiency can result in higher capital andoperating expenses; it can also result in significant delays for new

projects due to insufficient hardware, or even the need to build out newdata-center space to add new machines for additional processing power

Cluster instability

In extreme cases, nodes can become unresponsive or a distributed filesystem (DFS) might become overloaded, so applications cannot run orare significantly delayed in accessing data

Aside from these obvious effects, performance problems also cause

businesses to suffer in subtler but ultimately more significant ways

Organizations might informally “learn” that a multi-tenant cluster is

unpredictable and build implicit or explicit processes to work around theunpredictability, such as the following:

Limit cluster access to a subset of developers or analysts, out of a

concern that poorly written jobs will slow down or even crash the clusterfor everyone

Build separate clusters for different groups or different workloads sothat the most important applications are insulated from others Doing soincreases overall cost due to inefficiency in resource usage, adds

Trang 14

operational overhead and cost, and reduces the ability to share dataacross groups.

Set up “development” and “production” clusters, with a committee orother cumbersome process to approve jobs before they can be run on aproduction cluster Adding these hurdles can dramatically hinder

innovation, because they significantly slow the feedback loop of

learning from production data, building and testing a new model or newfeature, deploying it to production, and learning again.1

These responses to unpredictable performance can limit a business’s ability

to fully benefit from the potential of distributed systems Eliminating

performance problems on the cluster can improve performance of the

business overall

Trang 15

Scope of This Book

In this book, we consider the performance challenges that arise from

scheduling inefficiencies, hardware bottlenecks, and lack of visibility Weexamine each problem in detail and present solutions that organizations usetoday to overcome these challenges and benefit from the tremendous scaleand efficiency of distributed systems

Trang 16

Hadoop: An Example Distributed System

This book uses Hadoop as an example of a multi-tenant distributed system.Hadoop serves as an ideal example of such a system because of its broadadoption across a variety of industries, from healthcare to finance to

transportation Due to its open source availability and a robust ecosystem ofsupporting applications, Hadoop’s adoption is increasing among small andlarge organizations alike

Hadoop is also an ideal example because it is used in highly multi-tenantproduction deployments (running jobs from many hundreds of developers)and is often used to simultaneously run large batch jobs, real-time streamprocessing, interactive analysis, and customer-facing databases As a result, itsuffers from all of the performance challenges described herein

Of course, Hadoop is not the only important distributed system; a few otherexamples include the following:2

Classic HPC clusters using MPI, TORQUE, and Moab

Distributed databases such as Oracle RAC, Teradata, Cassandra, andMongoDB

Render farms used for animation

Simulation systems used for physics and manufacturing

Trang 17

Throughout the book, we use the following sets of terms interchangeably:

Application or job

A program submitted by a particular user to be run on a distributed

system (In some systems, this might be termed a query.)

Container or task

An atomic unit of work that is part of a job This work is done on a

single node, generally running as a single (sometimes multithreaded)process on the node

Host, machine, or node

A single computing node, which can be an actual physical computer or avirtual machine

We saw an example of the benefits of having an extremely short feedback loop at Yahoo in 2006–

2007, when the sponsored search R&D team was an early user of the very first production Hadoop cluster anywhere By moving to Hadoop and being able to deploy new click prediction models directly into production, we increased the number of simultaneous experiments by five times or more and reduced the feedback loop time by a similar factor As a result, our models could improve

an order of magnitude faster, and the revenue gains from those improvements similarly

compounded that much faster.

Various distributed systems are designed to make different tradeoffs among Consistency,

Availability, and Partition tolerance For more information, see Gilbert, Seth, and Nancy Ann Lynch “Perspectives on the CAP Theorem.” Institute of Electrical and Electronics Engineers, 2012 (http://hdl.handle.net/1721.1/79112) and https://www.infoq.com/articles/cap-twelve-years-later- how-the-rules-have-changed.

1

2

Trang 18

Chapter 2 Scheduling in Distributed Systems

Trang 19

In distributed computing, a scheduler is responsible for managing incoming

container requests and determining which containers to run next, on whichnode to run them, and how many containers to run in parallel on the node

(Container is a general term for individual parts of a job; some systems use other terms such as task to refer to a container.) Schedulers range in

complexity, with the simplest having a straightforward first-in–first-out

(FIFO) policy Different schedulers place more or less importance on various(often conflicting) goals, such as the following:

Utilizing cluster resources as fully as possible

Giving each user and group fair access to the cluster

Ensuring that high-priority or latency-sensitive jobs complete on time

Multi-tenant distributed systems generally prioritize fairness among users andgroups over optimal packing and maximal resource usage; without fairness,users would be likely to maximize their own access to the cluster withoutregard to others’ needs Also, different groups and business units would beinclined to run their own smaller, less efficient cluster to ensure access fortheir users

In the context of Hadoop, one of two schedulers is most commonly used: the

capacity scheduler and the fair scheduler Historically, each scheduler was

written as an extension of the simple FIFO scheduler, and initially each had adifferent goal, as their names indicate Over time, the two schedulers haveexperienced convergent evolution, with each incorporating improvementsfrom the other; today, they are mostly different in details Both schedulers

have the concept of multiple queues of jobs to be scheduled, with admission

to each queue determined based on user- or operator-specified policies

Recent versions of Hadoop1 perform two-level scheduling, in which a

centralized scheduler running on the ResourceManager node assigns cluster

resources (containers) to each application, and an ApplicationMaster running

Trang 20

in one of those containers uses the other containers to run individual tasks forthe application The ApplicationMaster manages the details of the

application, including communication and coordination among tasks Thisarchitecture is much more scalable than Hadoop’s original one-level

scheduling, in which a single central node (the JobTracker) did the work ofboth the ResourceManager and every ApplicationMaster

Many other modern distributed systems like Dryad and Mesos have

schedulers that are similar to Hadoop’s schedulers For example, Mesos alsosupports a pluggable scheduler interface much like Hadoop,2 and it performstwo-level scheduling,3 with a central scheduler that registers available

resources and assigns them to applications (“frameworks”)

Trang 21

Dominant Resource Fairness Scheduling

Historically, most schedulers considered only a single type of hardware

resource when deciding which container to schedule next — both in

calculating the free resources on each node and in calculating how much agiven user, group, or queue was already using (e.g., from the point of view offairness in usage) In the case of Hadoop, only memory usage was

considered

However, in a multi-tenant distributed system, different jobs and containersgenerally have widely different hardware usage profiles — some containersrequire significant memory, whereas some use CPU much more heavily (see

Figure 2-1) Not considering CPU usage in scheduling meant that the systemmight be significantly underutilized, and some users would end up getting

more or less than their true fair share of the cluster A policy called Dominant

Resource Fairness (DRF)4 addresses these limitations by considering

multiple resource types and expressing the usage of each resource in a

common currency (the share of the total allocation of that resource), and thenscheduling based on the resource each container is using most heavily

Trang 22

Figure 2-1 Per-container physical memory usage versus CPU usage during a representative period of time on a production cluster Note that some jobs consume large amounts of memory while using

relatively little CPU; others use significant CPU but relatively little memory.

In Hadoop, operators can configure both the Fair Scheduler and the CapacityScheduler to consider both memory and CPU (using the DRF framework)when considering which container to launch next on a given node.5

Trang 23

Aggressive Scheduling for Busy Queues

Often a multi-tenant cluster might be in a state where some but not all queuesare full; that is, some tenants currently don’t have enough work to use theirfull share of the cluster, but others have more work than they are guaranteedbased on the scheduler’s configured allocation In such cases, the schedulermight launch more containers from the busy queues to keep the cluster fullyutilized

Sometimes, after those extra containers are launched, new jobs are submitted

to a queue that was previously empty; based on the scheduler’s policy,

containers from those jobs should be scheduled immediately, but because thescheduler has already opportunistically launched extra containers from other

queues, the cluster is full In those cases, the scheduler might preempt those

extra containers by killing some of them in order to reflect the desired

fairness policy (see Figure 2-2) Preemption is a common feature in

schedulers for multi-tenant distributed systems, including both popular

Hadoop schedulers (capacity and fair)

Because preemption inherently results in lost work, it’s important for thescheduler to strike a good balance between starting many opportunistic

containers to make use of idle resources and avoiding too much preemptionand the waste that it causes To help reduce the negative impacts of

preemption, the scheduler can slightly delay killing containers (to avoid

wasting the work of containers that are almost complete) and generally

chooses to kill containers that have recently launched (again, to avoid wastedwork)

Trang 24

Figure 2-2 When new jobs arrive in Queue A, they might be scheduled if there is sufficient unused cluster capacity, allowing Queue A to use more than its guaranteed share If jobs later arrive in Queue

B, the scheduler might then preempt some of the Queue A jobs to provide Queue B its guaranteed

share.

A related concept is used by Google’s Borg system,6 which has a concept ofpriorities and quotas; a quota represents a set of hardware resource quantities(CPU, memory, disk, etc.) for a period of time, and higher-priority quotacosts more than lower-priority quota Borg never allocates more production-priority quota than is available on a given cluster; this guarantees productionjobs the resources they need At any given time, excess resources that are notbeing used by production jobs can be used by lower-priority jobs, but thosejobs can be killed if the production jobs’ usage later increases (This behavior

is similar to another kind of distributed system, Amazon Web Services,

which has a concept of guaranteed instances and spot instances; spot

instances cost much less than guaranteed ones but are subject to being killed

at any time.)

Trang 25

Special Scheduling Treatment for Small Jobs

Some cluster operators provide special treatment for small or fast jobs; in asense, this is the opposite of preemption One example is LinkedIn’s “fastqueue” for Hadoop, which is a small queue that is used only for jobs that takeless than an hour total to run and whose containers each take less than 15minutes.7 If jobs or containers violate this limit, they are automatically killed.This feature provides fast response for smaller jobs even when the cluster isbogged down by large batch jobs; it also encourages developers to optimizetheir jobs to run faster

The Hadoop vendor MapR provides somewhat similar functionality with itsExpressLane,8 which schedules small jobs (as defined by having few

containers, each with low memory usage and small input data sizes) to run onthe cluster even when the cluster is busy and has no additional capacity fornormal jobs This is also an interesting example of using the input data size as

a cue to the scheduler about how fast a container is likely to be

Trang 26

Workload-Specific Scheduling Considerations

Aside from the general goals of high utilization and fairness across users andqueues, schedulers might take other factors into account when deciding

which containers to launch and where to run them

For example, a key design point of Hadoop is to move computation to thedata (The goal is to not just get the nodes to work as hard as they can, butalso get them to work more efficiently.) The scheduler tries to accomplishthis goal by preferring to place a given container on one of the nodes thathave the container’s input HDFS data stored locally; if that can’t be donewithin a certain amount of time, it then tries to place the container on thesame rack as a node that has the HDFS data; if that also can’t be done afterwaiting a certain amount of time, the container is launched on any node thathas available computing resources Although this approach increases overallsystem efficiency, it complicates the scheduling problem

An example of a different kind of placement constraint is the support for pods

in Kubernetes A pod is a group of containers, such as Docker containers, thatare scheduled at the same time on the same node Pods are frequently used toprovide services that act as helper programs for an application Unlike thepreference for data locality in Hadoop scheduling, the colocation and

coscheduling of containers in a pod is a hard requirement; in many cases theapplication simply would not work without the auxiliary services running onthe same node

A weaker constraint than colocation is the concept of gang scheduling, in

which an application requires all of its resources to run concurrently, but theydon’t need to run on the same node An example is a distributed database likeImpala, which needs to have all of its “query fragments” running in order toserve queries Although some distributed systems’ schedulers support gangscheduling natively, Hadoop doesn’t currently support gang scheduling;

applications that require concurrent containers mimic gang scheduling bykeeping containers alive but idle until all of the required containers are

running This workaround clearly wastes resources because these idle

Trang 27

containers hold resources and stop other containers from running However,even when gang scheduling is done “cleanly” by the scheduler, it can lead toinefficiencies because the scheduler needs to avoid fully loading the clusterwith other containers to ensure that enough space will eventually be availablefor the entire gang to be scheduled.

As a side note, workflow schedulers such as Oozie are given information

about the dependencies among jobs in a complex workflow that must happen

in order; the workflow scheduler then submits the individual jobs to the

distributed system on behalf of the user A workflow scheduler can take intoaccount the required inputs and outputs of each stage (including inputs thatdepend on some off-cluster process to write new data to the cluster), the time

of day the workflow should be started, awareness of the full directed acyclicgraph (DAG) of the entire workflow, and similar constraints Generally, theworkflow scheduler is distinct from the distributed system’s own schedulerthat determines exactly where and when containers are launched on eachnode, but there are cases when overall scheduling can be much more efficient

if workflow scheduling and resource scheduling are combined.9

Trang 28

Inefficiencies in Scheduling

Although schedulers have become more sophisticated over time, they

continue to suffer from inefficiencies related to the diversity of workloadsrunning on multi-tenant distributed systems These inefficiencies arise fromthe need to avoid overcommitting memory when doing up-front scheduling, alimited ability to consider all types of hardware resources, and challenges inconsidering the dependencies among all jobs and containers within

complicated workflows

Trang 29

The Need to be Conservative with Memory

Distributed system schedulers generally make scheduling decisions based onconservative assumptions about the hardware resources — especially

memory — required by each container These requirements are usually

declared by the job author based on the worst-case usage, not the actual

usage This difference is critical because often different containers from thesame job have different actual resource usage, even if they are running

identical code (This happens, for example, when the input data for one

container is larger or otherwise different from the input data for other

containers, resulting in a need for more processing or more space in

memory.)

If a node’s resources are fully scheduled and the node is “unlucky” in the mix

of containers it’s running, the node can be overloaded; if the resource that isoverloaded is memory, the node might run out of memory and crash or startswapping badly In a large distributed system, some nodes are bound to beunlucky in this way, so if the scheduler does not use conservative resourceusage estimates, the system will nearly always be in a bad state

The need to be conservative with memory allocation means that most nodeswill be underutilized most of the time; containers generally do not often usetheir theoretical maximum memory, and even when they do, it’s not for thefull lifetime of the container (see Figure 2-3) (In some cases, containers canuse even more than their declared maximum Systems can be more or lessstringent about enforcing what the developer declares — some systems killcontainers when they exceed their maximum memory, but others do not.10)

Trang 30

Figure 2-3 Actual physical memory usage compared to the container size (the theoretical maximum) for a typical container Note that the actual usage changes over time and is much smaller than the

reserved amount.

To reduce the waste associated with this underutilization, operators of largemulti-tenant distributed systems often must perform a balancing act, trying toincrease cluster utilization without pushing nodes over the edge As described

in Chapter 4, software like Pepperdata provides a way to increase utilizationfor distributed systems such as Hadoop by monitoring actual physical

memory usage and dynamically allowing more or fewer processes to be

scheduled on a given node, based on the current and projected future memoryusage on that node

Trang 31

Inability to Effectively Schedule the Use of Other

limited to a single core In contrast, disk I/O and network usage frequentlyvary by orders of magnitude, and they spike very quickly They are also

effectively unlimited in how much of the corresponding resource they use:one single thread on one machine can easily saturate the full network

bandwidth of the node or use up all available disk I/O operations per second(IOPS) and bandwidth from dozens of disks (including even disks on

multiple machines, when the thread is requesting data stored on another

node) See Figure 2-4 for the usage of various resources for a sample job Theleft column shows overall usage for all map tasks (red, starting earlier) andreduce tasks (green, starting later) The right column shows a breakdown byindividual task (For this particular job, there is only one reduce task.)

Because CPU, disk, and network usage can change so quickly, it is

impossible for any system that only does up-front scheduling to optimizecluster utilization and provide true fairness in the use of hardware resources

Trang 33

Figure 2-4 The variation over time in usage of different hardware resources for a typical MapReduce

job (source: Pepperdata)

Some schedulers (such as those in Hadoop) characterize computing nodes in

a fairly basic way, allocating containers to a machine based on its total RAMand the number of cores A more powerful scheduler would be aware ofdifferent hardware profiles (such as CPU speed and the number and type ofhard drives) and match the workload to the right machine (A somewhatrelated approach is budget-driven scheduling for heterogeneous clusters,where each node type might have both different hardware profiles and

different costs.11) Similarly, although modern schedulers use DRF to helpensure fairness across jobs that have different resource usage characteristics,DRF does not optimize efficiency; an improved scheduler could use thecluster as a whole more efficiently by ensuring that each node has a mix ofdifferent types of workloads, such as CPU-heavy workloads running

alongside data-intensive workloads that use much more disk I/O and

memory (This multidimensional packing problem is NP-hard,12 but simpleheuristics could help performance significantly.)

Trang 34

Deadlock and Starvation

In some cases, schedulers might choose to start some containers in a job’sDAG even before the preceding containers (the dependencies) have

completed This is done to reduce the total run time of the job or spread outresource usage over time

NOTE

In the interest of concreteness, the discussion in this section uses map and reduce

containers, but similar effects can happen any time a job has some containers that depend

on the output of others; the problems are not specific to MapReduce or Hadoop.

An example is Hadoop’s “slow start” feature, in which reduce containersmight be launched before all of the map containers they depend on have

completed This behavior can help minimize spikes in network bandwidthusage by spreading out the heavy network traffic of transferring data frommappers to reducers However, starting a reduce container too early meansthat it might end up just sitting on a node waiting for its input data (from mapcontainers) to be generated, which means that other containers are not able touse the memory the reduce container is holding, thus affecting overall systemutilization.13

This problem is especially common on very busy clusters with many tenantsbecause often not all map containers from a job can be scheduled in quicksuccession; similarly, if a map container fails (for example, due to node

failure), it might take a long time to get rescheduled, especially if other,

higher-priority jobs have been submitted after the reducers from this job werescheduled In extreme cases this can lead to deadlock, when the cluster isoccupied by reduce containers that are unable to proceed because the

containers they depend on cannot be scheduled.14 Even if deadlock does notoccur, the cluster can still be utilized inefficiently, and overall job completioncan be unnecessarily slow as measured by wall-clock time, if the schedulerlaunches just a small number of containers from each of many users at one

Trang 35

A similar scheduling problem is starvation, which can occur on a heavilyloaded cluster For example, consider a case in which one job has containersthat each need a larger amount of memory than containers from other jobs.When one of the small containers completes on a node, a naive scheduler willsee that the node has a small amount of memory available, but because itcan’t fit one of the large containers there, it will schedule a small container torun In the extreme case, the larger containers might never be scheduled In

Hadoop and other systems, the concept of a reservation allows an application

to reserve available space on a node, even if the application can’t

immediately use it.15 (This behavior can help avoid starvation, but it alsomeans that the overall utilization of the system is lower, because some

amount of resources might be reserved but unused at any particular time.)

Trang 36

Waste Due to Speculative Execution

Operators can configure Hadoop to use speculative execution, in which the

scheduler can observe that a given container seems to be running more

slowly than is typical for that kind of container and start another copy of thatcontainer on another node This behavior is primarily intended to avoid cases

in which a particular node is performing badly (usually due to a hardwareproblem) and an entire job could be slowed down due to just one stragglercontainer.16

While speculative execution can reduce job completion time due to nodeproblems, it wastes resources when the container that is duplicated simplyhad more work to do than other containers and so naturally ran longer Inpractice, experienced operators typically disable speculative execution onmulti-tenant clusters, both because there is generally inherent container

variation (not due to hardware problems) and because the operators are

constantly watching for bad hardware, so speculative execution does notenhance performance

Trang 37

Over time, distributed system schedulers have grown in sophistication from avery simple FIFO algorithm to add the twin goals of fairness across users andincreased cluster utilization Those two goals must be balanced against eachother; on multi-tenant distributed systems, operators often prioritize fairness.They do so to reduce the level of user-visible scheduling issues as well as tokeep multiple business units satisfied to use shared infrastructure rather thanrunning their own separate clusters (In contrast, configuring the scheduler tomaximize utilization could save money in the short term but waste it in thelong term, because many small clusters are less efficient than one large one.)Schedulers have also become more sophisticated by better taking into

account multiple hardware resource requirements (for example, not

considering only memory) and effectively treating different kinds of

workloads differently when scheduling decisions are made However, theystill suffer from limitations, for example being conservative in resource

allocation to avoid instability due to overcommitting resources such as

memory That conservatism can keep the cluster stable, but it results in lowerutilization and slower run times than the hardware could actually support.Software solutions that make real-time, fine-grained decisions about resourceusage can provide increased utilization while maintaining cluster stability andproviding more predictable job run times

The new architecture is referred to as Yet Another Resource Negotiator (YARN) or MapReduce v2 See https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/YARN.html.

See http://mesos.apache.org/api/latest/java/org/apache/mesos/Scheduler.html.

See https://www.quora.com/How-does-two-level-scheduling-work-in-Apache-Mesos.

Ghodsi, Ali, et al “Dominant Resource Fairness: Fair Allocation of Multiple Resource Types.”

NSDI Vol 11 2011 https://www.cs.berkeley.edu/~alig/papers/drf.pdf

See http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/FairScheduler.html and

http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html Verma, Abhishek, et al “Large-scale cluster management at Google with Borg.” Proceedings of the

Tenth European Conference on Computer Systems ACM, 2015.

Trang 38

See slide 9 of http://www.slideshare.net/Hadoop_Summit/hadoop-operations-at-linkedin.

See http://doc.mapr.com/display/MapR/ExpressLane.

Nurmi, Daniel, et al “Evaluation of a workflow scheduler using integrated performance modelling and batch queue wait time prediction.” Proceedings of the 2006 ACM/IEEE conference on

Supercomputing ACM, 2006 http://www.cs.ucsb.edu/~nurmi/nurmi_workflow.pdf

For example, Google’s Borg kills containers that try to exceed their declared memory limit Hadoop

by default lets containers go over, but operators can configure it to kill such containers.

Wang, Yang and Wei Shi “Budget-driven scheduling algorithms for batches of MapReduce jobs in

heterogeneous clouds.” IEEE Transactions on Cloud Computing 2.3 (2014): 306-319.

Batches-of-MapReduce-Jobs

https://www.researchgate.net/publication/277583513_Budget-Driven-Scheduling-Algorithms-for-See, for example, Chekuri, Chandra and Sanjeev Khanna “On multidimensional packing

problems.” SIAM journal on computing 33.4 (2004): 837-851.

http://repository.upenn.edu/cgi/viewcontent.cgi?article=1080&context=cis_papers

See

http://stackoverflow.com/questions/11672676/when-do-reduce-tasks-start-in-hadoop/11673808#11673808.

See https://issues.apache.org/jira/browse/MAPREDUCE-314 for an example.

See Sulistio, Anthony, Wolfram Schiffmann, and Rajkumar Buyya “Advanced reservation-based

scheduling of task graphs on clusters.” International Conference on High-Performance Computing.

Springer Berlin Heidelberg, 2006 http://www.cloudbus.org/papers/workflow_hipc2006.pdf For related recent work in Hadoop, see Curino, Carlo et al “Reservation-based Scheduling: If You’re

Late Don’t Blame Us!” Proceedings of the ACM Symposium on Cloud Computing ACM, 2014.

dont-blame-us/.

https://www.microsoft.com/en-us/research/publication/reservation-based-scheduling-if-youre-late-This is different from the standard use of the term “speculative execution” in which pipelined microprocessors sometimes execute both sides of a conditional branch before knowing which branch will be taken.

Trang 39

Chapter 3 CPU Performance Considerations

Trang 40

Historically, large-scale distributed systems were designed to perform

massive amounts of numerical computation, for example in scientific

simulations run on high-performance computing (HPC) platforms In mostcases, the work done on such systems was extremely compute intensive, sothe CPU was often the primary bottleneck

Today, distributed systems tend to run applications for which the large scale

is driven by the size of the input data rather than the amount of computationneeded — examples include both special-purpose distributed systems (such

as those powering web search among billions of documents) and purpose systems such as Hadoop (However, even in those general systems,there are still some cases such as iterative algorithms for machine learningwhere making efficient use of the CPU is critical.)

general-As a result, the CPU is often not the primary bottleneck limiting a distributedsystem; nevertheless, it is important to be aware of the impacts of CPU onoverall speed and throughput

At a high level, the effect of CPU performance on distributed systems is

driven by three primary factors:

The efficiency of the program that’s running, at the level of the code aswell as how the work is broken into pieces and distributed across nodes

Low-level kernel scheduling and prioritization of the computationalwork done by the CPU, when the CPU is not waiting for data

The amount of time the CPU spends waiting for data from memory,disk, or network

These factors are important for the performance even of single applicationsrunning on a single machine; they are just as important, and even more

complicated, for multi-tenant distributed systems due to the increased numberand diversity of processes running on those systems, and their varied inputdata sources

Định dạng
Số trang	146
Dung lượng	4,87 MB