Systems and networking

Performance Analysis of MapReduceComputing Framework Hou Song songhou@comp.nus.edu.sg Abstract MapReduce is a more and more popular distributed computing framework, especially in largesc

Trang 1

N ATIONAL U NIVERSITY OF S INGAPORE

FOR THE DEGREE OF MASTER OF SCIENCE

Performance Analysis of MapReduce

September 2011

Trang 3

Performance Analysis of MapReduce

Computing Framework

Hou Song songhou@comp.nus.edu.sg

Abstract

MapReduce is a more and more popular distributed computing framework, especially in largescale data analysis Although it has been adopted in many places, a theoretical analysis of itsbehavior is lacking This thesis introduces an analytical model for MapReduce with three parts,the average task performance, the random behavior and the waiting time This model is thenvalidated using measured data from three categories of workloads The model’s usefulness isdemonstrated by three optimization processes, which give reasonable conclusions yet differentfrom current understandings

Trang 5

I would like to express my gratitude to my supervisor Professor Tay Yong Chiang for his countlessguidance and advice Without his help I cannot complete this thesis in time I would also like tothank my University for supporting me financially, without which I cannot survive

My parents and brother always trust me and give me continuous support I owe them so much

At last I would like to thank Shi Lei, Vo Hoang Tam and many other lab mates for their generousreviews and suggestions

Trang 6

Table of Contents

Acknowledgments iii

List of Figures vii

List of Tables viii

Summary ix

1 Introduction 1 1.1 Background 1

1.2 Motivation for an Analytical Model 2

1.3 Overview 3

2 Related Work 4 2.1 Data Storage 4

2.1.1 The Google File System 4

2.1.2 Bigtable 7

2.2 Data Manipulation 9

2.2.1 MapReduce 9

2.2.2 Dryad 13

3 System Description 17 3.1 Architecture of Hadoop MapReduce 17

3.2 Representative Workload 18

3.3 Experimental Setup and Measured Results 19

4 Model Description 22 4.1 Assumptions 22

4.2 Related Theory Overview 22

4.3 Notation Table 25

4.4 Disassembled Sub-models 25

4.4.1 The Average Task Performance 26

Trang 7

4.4.2 The Random Behavior 28

4.4.3 The Waiting Time 31

4.5 The Global Model 31

5 Model Validation 34 5.1 Database Query 35

5.2 Random Number Generation 37

5.3 Sorting 40

6 Model Application 43 6.1 Procedures of Optimization using Gradient Descent 43

6.2 Optimal Number of Reducers Per Job 46

6.3 Optimal Block Size 50

6.4 Optimal Cluster Size 53

6.5 Summary 56

Trang 8

List of Figures

1 Basic infrastructure of GFS 6

2 An example table that stores Web pages 7

3 Bigtable tablet representation 8

4 Basic structure of MapReduce framework 10

5 A Dryad job DAG 14

6 Example figure of throughput 21

7 Example figure of times of each phase 21

8 Example figure of numbers of tasks in each phase 21

9 Queueing model of a slave node 27

10 Histograms of tasks’ response time 29

11 Example figure of randomness T∆ 29

12 Regular pattern for T∆ 30

13 Transformation procedure to get equation for T∆ 30

14 Measured time VS calculated time for database query 36

15 Measured ∆ VS calculated ∆ for database query 36

16 Measured response time VS calculated response time for database query 37

17 Measured time VS calculated time for random number generation 38

18 Measured ∆ VS calculated ∆ for random number generation 39

19 Measured response time VS calculated response time for random number genera-tion 39

20 Measured time VS calculated time for sorting 40

21 Measured ∆ VS calculated ∆ for sorting 41

22 Measured response time VS calculated response time for sorting 41

23 An example of gradient descent usage 44

24 The process of gradient descent for the optimization of Hr for sorting 47

25 The gradients for the optimization of Hr for sorting 48

26 The trend of gradients for the optimization of Hr for sorting 48

Trang 9

27 The process of gradient descent for the optimization of Hr for database query 49

28 The gradients for the optimization of Hr for database query 50

29 The process of gradient descent for the optimization block size for sorting 51

30 The gradients for the optimization of block size for sorting 52

31 The trend of gradients for the optimization of block size for sorting 52

32 The process of gradient descent for the optimization block size for database query 53 33 The gradients for the optimization of block size for database query 54

34 The process of gradient descent for cluster size for database query 55

35 The gradients for the optimization of cluster size for database query 55

36 The process of gradient descent for cluster size for sorting 56

37 The gradients for the optimization of cluster size for sorting 57

Trang 10

List of Tables

1 Symbols and notations 25

2 System default values 34

3 System optimization conclusions 57

Trang 11

The work people are trying to solve is getting much larger than a single computer’s capability, anddistributed computing is an inevitable direction For example, Internet companies are using tens ofthousands of machines to process an enormous amount of concurrent user requests MapReduce is

a useful and popular distributed computing framework that is widely adopted in both the industryand academic world, because it is simple to use yet able to provide good scalability and highperformance However, its performance has not been fully studied yet There are some paperswhich are trying to improve MapReduce’s design and implementation through experiments orsimulations, but no one has yet proposed an analytical model that can overcome the weaknesses ofthe first two methods

This thesis investigates the details of MapReduce, and proposes an analytical model based

on the categorization of typical workloads This model consists of three parts: the average taskperformance which is a modified multiclass processor sharing queue, the random behavior which

is a fitted curve using common observations, and the waiting time which utilizes a deterministicwaiting equation The model is then validated using measured data from all three categories Atlast this model is applied in three optimizations, thus demonstrating its usefulness in configuringMapReduce computations Their conclusions provide new understandings about MapReduce’sperformance behavior

Trang 12

There are various proposed frameworks and tools intended to help developers Message ing Interface (MPI) [12] is a successful communication protocol with a wide range of adoption Itprovides convenient ways to send point-to-point messages or multicast messages MPI is scalableand portable, and remains in the dominant position for high performance computing systems thatfocus on raw computation, such as traditional climate simulation However, this does not elimi-nate the difficulties that developers have to resort to low level primitives to accomplish complexlogic such as synchronization, failure detection and recovery, comparable to writing a sequentialprogram using assembly language They make a distributed system hard to design and implement,and tricky to ensure correctness However many of these aspects share some common operations,which could be provided by the underlying system and thus relief the burden on programmers.From a broader point of view, there are two types of high performance computer systems, one isfor raw computation power and the other is for data processing The first type has a longer history,with its concentration on the number of calculation operations per second Those systems in theTOP500 list [28] are good examples As people are collecting and generating more and more data,such as Internet web pages, photos and videos in social network sites, health records, telescopeimageries, transaction logs and so on, automatic processing of these data using large computersystems is of high demand For example, successful data mining of transaction records from asupermarket is able to give the manager a better understanding of the business and its customers,

Trang 13

Pass-therefore improving this business to a new level Fast and accurate processing of telescope images

is possible to provide breakthrough scientific discoveries Database systems are designed to age large data, but up to 70% to 80% of online data are unstructured and may be used for only afew times, and the processing is not efficient, or even difficult, if people use existing commercialdatabases [7] New systems are being designed [4, 11], and companies such as Google, Microsoftand Amazon made these designs into commercial systems that are operating tens of thousands ofcomputers Their accumulated power of low end computers makes it possible to analyse the wholeInternet in a timely fashion, support large transaction systems and many more

man-This new trend is also attracting attention from smaller companies and researchers, who do nothave access to large computing infrastructure like Google and Microsoft However, in the era ofcloud computing, people can rent machines in the remote clouds and start their own cluster at avery low price Then the immediate question is: Given the workload and service level objectives,how many machines are needed? Other challenges include cluster parameters optimization, systemupgrading, scheduler design decisions, and cluster sharing To answer these questions engineersand researcher need to understand the relationship between system performance, system parametersettings and the characteristics of workload, which is the major research direction of this thesis

Although there are many such successful systems, their performance is not very deeply studied,especially the newly established massive scale data processing systems MapReduce [9], the dis-tributed computing framework originally from Google, is an example It is very popular thanks

to its power of expressiveness and simplicity of use There are some work [24, 36, 29] trying tostudy and improve the performance of MapReduce, but to the best of our knowledge, no one hasproposed an analytic model for MapReduce This thesis mainly focuses on MapReduce, and tries

to design an analytical model that characterizes its performance

There are usually three ways to study a system, i.e experiment, simulation and analyticalmodelling Experiments are accurate, but sometimes too slow to be feasible Simulation extractsonly the necessary details and runs faster, but is still not practical when the parameter space is toolarge; for example, a distributed system could have hundreds of parameters An analytical model

Trang 14

describes a system abstractly with mathematical equations with system parameters as input andsystem performance metrics as output A well developed analytical model is able to hide thoseunnecessary details, explores the whole parameter space conveniently and possibly unveils someobscure truth that is impossible to produce otherwise.

In this thesis we first study the details of Hadoop, an open source implementation of duce, and categorize ordinary workloads into three groups according to the size of their input andoutput We divide MapReduce execution into several pieces, and develop models for each of them.Then we validate the accuracy of this model using measured data At last we show its applicationsusing three examples, which provide useful conclusions For example, the common practice ofusing larger block size to get better sorting benchmark may lead to even longer response time.MapReduce’s scalability can be influenced by the design of MapReduce’s jobs, and improvementcould be made according to this finding

7 and sets the plan for future work

Trang 15

2 Related Work

Modern data processing systems are growing in both size and complexity, but their theory andguidelines do not change very frequently The predication from [10] stays valid till now that largescale database systems should be built based on shared-nothing architecture made of conventionalhardware During the last few decades the shared-nothing architecture has been developing fast,attracting popularity from both the academia and the industry A number of systems are imple-mented in this area, which will be introduced later in this section

Intuitively in a data processing system there are two major parts: how data are stored and howthey are manipulated Therefore, this section of related work is organized into two correspondingsubsections

From the lower level data storage is a sequence of bytes that is kept in disks or other stable storagedevices However bytes can be interpreted in different forms, ranging from simple bit streams to agroup of records to nested objects In order to control and share data more conveniently, databasetechnologies are invented and widely deployed

In recent years the requirements for data storage are becoming more demanding These quirements include but are not limited to the volume that scales up to Petabytes, currency controlamong billions of objects, fault tolerance and failover, and high throughput of both read and write.The traditional storage and database system may become awkward for handling all these, thus newapproaches are proposed

re-2.1.1 The Google File System

The Google File System (GFS) [13] was originally designed and implemented by Google to meettheir needs A lot of decisions and optimizations were made to fit the environment they were in.This environment is not unique for Google, and its open source version, Hadoop Distributed FileSystem [16], is now widely used in many companies and institutes [15]

GFS is based on several assumptions First, failures are the norm rather than an exception

Trang 16

Be-cause of the large number of machines gathered together, expensive, highly reliable hardware doesnot help much Instead, Google’s system consists of thousands of machines built from commodityhardware Therefore, fault tolerance is necessary in such systems Second, the system should man-age very large files gracefully The system is able to store small files, but the priority is for largefiles, the size of which is counted in gigabytes Third, its workload is primarily one type of writeand two types of reads: large sequential writes that append at the end of files, and large sequentialreads or small random reads Finally, the objective is sustainable performance for a large number

of current clients, and places more emphasis on high overall throughput instead of response time

of a single job [13]

Using these assumptions, GFS uses a single master multiple slaves architecture as the basicform Files are divided into chunks in the size of 64MB by default, which is the unit of file distri-bution The master maintains all the meta-data, such as the file system structure, chunk identifiers

of files, and locations of chunks It has all the information and controls all the actions The slavenodes called chunkservers store the actual data in chunks that are assigned by the master, and servethem directly to clients The master and chunkservers exchange all kinds of information, such asprimary copy lease, meta-data update and server registration, periodically through heartbeat Thework for master node is minimized to avoid overloading

When a client tries to read a file, it retrieves the chunk handles and locations from the masterusing the file name and offset within the file The client caches this piece of meta-data locally tolimit the communication with master Then it chooses the closest location among all possibilities(local or within the same rack), and then initiates a connection with that chunkserver to stream theactual data

Writing is a bit more complex Because GFS uses replication to improve reliability and readperformance, consistency has to be taken into consideration Each chunk has a primary copy se-lected by the master, and every file modification has to go through the primary copy so that theoperations on this chunk are ordered properly In order to provide high performance, ordinarywrites that update a certain region of an existing file are supported, but the system is more op-timized for appends, which are used when an application wants to add a record at the end of afile but it does not care about the record’s exact location The primary copy of the last chunk of

Trang 17

that file decides the location where the record is written to Therefore record appends are atomicoperations and the system can serve a large number of concurrent operations, because the order

is decided at writing site and no further synchronization is required Before the operations aresuccessfully returned to applications, the primary copy pushes all the operations on the chunk toall the secondary copies to ensure all copies are the same

Fault tolerance is one of the key features in GFS No hardware is trusted, so software needs

to deal with all kinds of failures File chunks are replicated across multiple nodes and racks.Checksums are used a lot to rule out data corruption Master replicates its state and logs so that

in case of failure, master can restore itself locally or on other nodes Shadow masters are used

to provide read-only access during master’s failure Servers are designed for fast recovery, anddowntime can be reduced down to a few seconds

GFS

client GFS master

Chunk server

File read/write request

Figure 1 Basic infrastructure of GFS

In addition, there are other useful functionalities, such as snapshots, garbage collection, tegrity test, re-replication and re-balancing The structure of GFS is shown in Figure 1

Trang 18

in-2.1.2 Bigtable

GFS is a reliable, high performance distributed file system that serves raw files, but there are manyapplications that need structured data which resembles a table in a relational database Bigtable [6]which is also from Google fulfills this need HBase [14] from Hadoop is an open source version

Figure 2 An example table that stores Web pages

In Bigtable, tables are not organized in strict relational data models; in contrast each tablehas one row key and unfixed, unlimited number of columns, and each field has multiple versionsindexed by time stamp This data model supports dynamic data control and gives applicationsmore choices to express their data Internally Bigtable is a distributed, efficient map from rowkey, column name and time stamp to the actual value in that cell Columns are grouped into somecolumn families, which are basic units of access control Bigtable is sorted in lexicographic order

by row keys, and dynamically partitioned into tablets which have several adjacent rows Thisdesign exploits good data localities and improves overall performance An example is shown inFigure 2, which is a small part of Web page table This row has a row key “com.cnn.www”, twocolumn families “contents” and “anchor”, and the “contents” value has three versions

Bigtable is built on top of many other Google infrastructures Bigtable uses GFS as persistentstorage for data files and log files It depends on cluster management system to schedule resources

It relies on Chubby [5] as lock service provider and meta-data manager It runs in clusters ofmachine pools shared with other applications

In implementation, there are one master and many tablet servers The master takes charge of

Trang 19

the whole system, and tablet servers manage the tablets assigned to them by the master Tablets’information is stored in meta-data indexed by a specific Chubby file Data in a tablet are stored

in two parts, the old values that are immutable and stored in a special SStable file format which

is sorted by row key, and a commit log file of mutations on this tablet Both of these files arestored in GFS When a tablet server starts, it reads SStable and the commit log for the tablet, andconstructs a sorted buffer named “memtable” filled with the most recent view of values When aread operation is received, the tablet server searches in the merged view of the latest tablet fromboth SSTable files and commit logs, and returns the value When a write operation is received, theoperation is written to the log file, and the memtable is modified accordingly The whole process

is shown in Figure 3

Memory GFS

Figure 3 Bigtable tablet representation

A lot of refinements are utilized to achieve high performance and reliability Compression isapplied to save storage space and speed up transportation Caching is used a lot at both serverside and client side to relief the load on network and disks Because most tables are sparse tables,bloom filters are used to speed up searches for non-existent fields Logs on a tablet server areco-mingled into one And tablet recovery is also designed to be rapid to minimize down time

Trang 20

2.2 Data Manipulation

How to make good use of data is more than just data storage People could write dedicated tributed programs to conduct a certain processing, but this kind of programs is hard to write andmaintain, and each program has to deal with data distribution, scheduling, failure detection andrecovery, and machine communication A central framework can be implemented to provide thesecommon features, so that users can rely on it and concentrate only on unique logic of their jobs.Here two systems are analyzed in detail

MapReduce [9] is a powerful programming model for distributed massive scale data processing.Originally designed and implemented in Google, MapReduce is now a hot topic and applied tofields that it was not intended to fit Hadoop also has its open source MapReduce version It isbuilt on top of GFS and Bigtable, and use them as input source and output destination

MapReduce’s programming model is from functional languages, consisting of two functions:map and reduce The Map function takes input in the form of < key, value > pairs, does somecomputation on a single pair and produces a set of intermediate< key0, value0> pairs Then allthe intermediate pairs are grouped and sorted together The Reduce function takes an intermediatekey and the value list for that key as input, does some computation and writes out the final <key00, value00 > pair as the result A lot of practical jobs can be expressed in this model, includinggrep, inverted list of web pages and page rank computation

MapReduce inside Google is implemented in a master-slave architecture When a new jobstarts, it generates a master, a number of workers controlled by the master, and some number

of Mapper and Reducer tasks The master assigns Mapper and Reducer tasks to free workers.Mapper tasks write their intermediate results into their local disks, and notify the master of theircompletion, which informs Reducers tasks to fetch the map outputs When some tasks finish, ifthere are still un-assigned tasks, the master continues its scheduling When all Mapper tasks arefinished, the Reducer tasks are started After all reducer have finished, the job is finished, andreturned to client During this job execution, if any worker dies, all the tasks on that worker ismarked as failed, and re-scheduled later until finished successfully Figure 4 is an example, which

Trang 21

has an input file of 5 splits, 4 mappers and 2 reducers Note that Hadoop’s version MapReduce is

a little different from Google’s version MapReduce, which we will show later

Figure 4 Basic structure of MapReduce framework

There are some enhancements to MapReduce from other researchers Traditional Map Reducefocuses on long running data intensive batch jobs that aim at high throughput rather than shortresponse time Outputs of both map phase and reduce phase are written to disks, either local filesystem or distributed file system, to simplify fault tolerance MapReduce Online [8] was proposed

to allow data pipelined between different phases and between immediate jobs Intermediate <key, value > pairs are sent to the next operator soon after they are generated, from mappers toreducers in the same job, or from reducers in this job to mappers in the next immediate job Becausereducers executes with only a portion of all intermediate results, the final results are not alwaysaccurate In MapReduce Online the system takes snapshots as mappers proceed, runs reducers

on these snapshots and approximates the real answer A lot of refinements are used to improveperformance and fault tolerance, and support online aggregation and continuous queries

Trang 22

Although MapReduce can be used to express many algorithms in the areas such as informationretrieval and machine learning, it is hard to use in some database fields, especially table joins Map-Reduce-Merge [33] extends the original model by adding a third phase, merge phase, at the end ofthe reduce phase A merger gets reducers’ resulting < key, value > pairs from different reducers

in two MapReduce jobs, runs default or user-defined functions, and generates the final output files.With the help of merge phase and some operators, Map-Reduce-Merge is able to express manyrelational operators such as projection, aggregation, selection and set operations What is moreimportant, it is able to run most join algorithms such as sort-merge join, hash join and nested-loopjoin

Job scheduling is an important factor in MapReduce’s performance In Hadoop, all nodesare assumed to be the same, which results in bad performance in heterogeneous environment,such as virtual machines in Amazon’s EC2 [2] in which the performance of machines could differsignificantly Speculative tasks are one of the reasons [36] proposes a new scheduling algorithm,Longest Approximate Time to End (LATE), that fits well in heterogeneous environment and moreaccurately executes speculative tasks The key idea in this algorithm is try to estimate which taskwill finish farthest into the future LATE uses simple heuristics that assume task’s progress rate

is constant The progress rate is calculated using the completed part of the work and elapsedtime, then the full completion time is computed by dividing the remaining part of the work by theprogress rate The task that finishes the latest decides the job’s response time, and therefore is thefirst one to re-execute, if speculative tasks are needed There are some tuning parameters used tofurther enhance this strategy

In order to use MapReduce more conveniently, some other applications are built on top ofMapReduce so as to provide simple interfaces Hive [26] is a data warehousing solution built

on Hadoop It organizes data in relational table with partitions, and uses SQL-like declarativelanguage HiveQL as its query language Many database operations are supported in Hive, such asequi-join, selection, group by, sort by and some aggregation functions Users are able to defineand plug in their own functions, therefore furthermore increase Hive’s functionalities Its ability totranslate SQL to directed acyclic graphs of MapReduce jobs empowers ordinary database users toanalyze extremely large datasets, thus it is becoming more and more popular

Trang 23

Pig Latin [20] is a data processing language for large dataset, and can be compiled into asequence of MapReduce jobs Not like SQL, Pig Latin is a hybrid procedural query languageespecially suitable for programmers, and users can easily control its data flow It employs a flexiblenested data model, and internally supports many operators such as co-group and equi-join LikeHive it is easily extensible using user-defined functions Furthermore, Pig Latin is easy to learnand easy to debug.

Sawzall [22] is designed and implemented at Google and runs on top of Google’s infrastructuresuch as protocol buffers, GFS and MapReduce Like MapReduce, Sawzall has two phases, afiltering phase and an aggregation phase In filtering phase records are processed one by one, andemitted to the aggregation phase, which performs aggregation operations such as sum, maximumand histogram In practice Sawzall acts as a wrapper of MapReduce, and presents a virtual vision

of pure data operations

Additionally, MapReduce is introduced into areas other than large scale data analysis Asmulti-core systems are getting popular, how to easily utilize the power of multi-core is a hot topic.MapReduce is also useful in this situation [23] describes Phoenix system, a MapReduce runtimefor multi-core and multi-processor systems using shared-memories Like MapReduce, Phoenixalso consists of many map and reduce workers, each of which runs in a thread on a core Sharedmemories are used as storage for intermediate results, so that data don’t need to be copied and thatsaves a lot of time At the end of reducer phase, all outputs from different reducers are merged intoone output file in a bushy tree fashion It provides a small set of API that is flexible and easy touse

Phoenix system shows a new way to apply MapReduce in a shared memory system, but it mayperform badly in a distributed environment Phoenix Rebirth [34] revises the original Phoenixinto a new version for NUMA systems It utilizes locality information when making schedulingdecisions to minimize remote memory traffic Mappers are scheduled to machines that have thedata or are near to the data Combiners are used to reduce the size of mappers’ outputs, andtherefore reduce remote memory demand In the merge phase, merge sort is performed first within

a locality group, and then among different groups

Another area is general purpose computation on graphics processors GPUs are being used

Trang 24

in high-performance computing because of their high internal parallelism However, programsfor GPUs are hard to write and not portable Mars [17] implements a MapReduce framework onGPUs that is easy to use, flexible and portable During its execution, inputs are prepared in themain memory, copied to GPU’s device memory and then mappers are started on those hundreds ofcores on GPU After mappers are completed, reducers are scheduled At last the final outputs aremerged into one and copied to main memory Mars exposes a small set of API and yet produces ahigh performance that is sometimes 16 times faster than its CPU based counterpart.

Although MapReduce is now widely adopted in many areas, it is awkward to express some rithms such as a large graph computation which is needed in many cases Dryad [18] can be seen

algo-as an extended MapReduce that enables users to control the topology of its computation Dryad is

a general purpose, data parallel, low level distributed execution engine It organizes its jobs as rected acyclic graphs, in which nodes are simple computation programs that are usually sequential,and edges are data transmission between nodes

di-Creating a Dryad job is easy Dryad uses a simple graph description language to define a graph.This description language has 8 basic topology building blocks, which are sufficient to representall DAGs when combined together Users can define computation nodes by simply inheriting aC++ base node class and integrating them in the graph Edges in the graph have three forms: files,TCP connections and shared memories An example of job DAG is shown in Figure 5 Othertedious work such as job scheduling, resource management, synchronization, failure detection andrecovery, and data transportation is internally performed by Dryad framework itself

The system architecture is also a single master multiple slave style to ensure efficiency Theprocess of a Dryad job is coordinated by the job manager, which also monitors the the runningstates of all slaves When a new job arrives, the job manager starts computation nodes that havedirect input from files according to its job graph description When a node finishes, its output isfed to its child nodes The job manager keeps looking for nodes that have all its input ready, andstarts them in real time If a nodes fails, it is restarted on another machine If all computation nodesfinish, then the whole job finishes, and the output is returned to the user

Trang 25

input

Figure 5 A Dryad job DAG

Trang 26

It needs a lot of optimizations to make Dryad useful in practice First of all, the user providedexecution plan may not be optimal, which needs refinements For example, if a large number ofvertices in the execution graph aggregate to a single vertex, that vertex may become a bottleneck.

At run time, these source vertices may be grouped into subsets, and corresponding intermediatevertices are added in the middle Second, the number of vertices may be a lot larger than thenumber of running machines, and these vertices are not always independent Therefore how tomap these logic vertices onto physical resources is of great importance There are some verticesthat are so closely related that it is better to schedule them on the same machine or even in thesame process Vertices can run one after another, or can run at the same time with data pipelined

in between What’s more, there are three kinds of data transmission channels: shared memory,TCP network connection and temporary file, and each of them has different characteristics Usinginappropriate channel could cause large overhead Last but not least, how the failure recovery isaccomplished can affect the whole system Note that these optimization decisions are correlated.For example, if some vertices are executed in the same process, the channels between them shouldhave shared memory to minimize overhead, but failure recovery is more complex because once avertex fails, other vertices in the same process need to be re-executed at the same time, because theintermediate data are stored in the memory and volatile to failures

Although Dryad is powerful for expressing many algorithm, it is too low level to fit in dailywork Dryad users still need to consider details about computation topology and data manipulation.Therefore some higher level systems are designed and implemented on top of Dryad The Nebulalanguage [18] is one example Nebula language exposes Dryad as a generalization of simplepipelining mechanism, providing developers a clean interface and hides away Dryad’s unnecessarydetails Nebula also has a front-end that integrates Nebula scripts with simple SQL queries.DryadLINQ [35] is an extension on top of Dryad, aiming to give programmers the illusion of asingle powerful virtual computer so that they can focus on the primary logic of their applications Itautomatically translates the high level sequential LINQ programs, which have an SQL-style syntaxand many extra enhancements, into Dryad plans, and applies both static and dynamic optimizations

to speedup their execution A strongly typed data model is inherited from the original LINQlanguage, and old LINQ programs what aimed to be executed on traditional relational databases

Trang 27

can now deal with extremely large volume of data on Dryad clusters without any change Debugenvironments are also provided to help developers.

Trang 28

3 System Description

In this section we first investigate the fundamental behavior of Hadoop MapReduce, followed

by the categorization of usual workloads We then describe the experimental setup, and a set ofexperiment results is plotted as a preliminary impression of MapReduce’s performance

Hadoop [16] is a suite of distributed processing software which closely mimic their counterparts

in Google’s system In this study we choose Hadoop MapReduce along with Hadoop DistributedFile System (HDFS), and their architectures are analyzed here in order to get enough insight to set

up the environment for the model

HDFS is organized in a single master, multiple slaves style The master is called Namenode,which maintains the file system structure, and controls all read/write operations The slaves arecalled Datanodes, which maintain actual data, and carry out the read/write operations As stated

in Section 2.1.1, file data are stored in blocks of fixed size This improves the scalability andfault tolerance However, its available functionalities are confined to keep the system simple andefficient

When a read operation comes, the Namenode first checks its validity, and redirects this tion to a list of corresponding Datanodes according to the file name and the offset inside the file.The operation sender then contact one of those Datanodes for the data it requires The Datanodethat is closer to the sender is chosen first, and in special cases the sender itself, so as to save net-work bandwidth If the current block is finished, the next block is chosen by the Namenode, andthe operation restarts from the new offset at the beginning of the block

opera-When a write operation comes, Namenode again checks its validity, and chooses a list of odes to store different replicas of the written data The operation sender streams the data to the firstDatanode, the first Datanode streams the same data to the next Datanode at the same time, and so

Datan-on until no data is left If the current block is full, a new list of Datanodes is chosen to store theremaining data, and the operation restarts

Hadoop MapReduce is built on top of HDFS, and similarly it also has a master-slave

Trang 29

archi-tecture Its master is called Jobtracker, which controls the process of a job, including submission,scheduling, cleaning up and so on Slaves are called Tasktrackers, which run the tasks assigned tothem by the Jobtracker To keep the description concise, only major actions will be shown.

When a new job arrives, the Jobtracker sets up necessary data structures to keep track of theprogress of this job, and then initializes the right number of mappers and reducers which are put

in the pool of available tasks The scheduler monitors that pool, and allocates new tasks to freeTasktrackers There are many strategies that can help the scheduling, such as utilizing the localityinformation of input files, and rearranging tasks in a better order If no Tasktracker is available,these new tasks are queued up until some Tasktracker finishes an old tasks and is ready for a newtask

When a Tasktracker receives a task from Jobtracker, it spawns a new process to run the actualcode for that particular task, collects the running information and sends it back to the Jobtracker.Depending on the specific configuration of the task, it may read data from local disk or remotenodes, compute the output and write them to local disk or HDFS Therefore three important partsare involved, namely CPU, local disk and network interface card

According to MapReduce’s topology, ideally a job has two phases: map and reduce Butafter careful study, we find two synchronization points after mapper and reducer, meaning that allreducers start only after all mappers finish, and the job result is returned only after all reducersfinish In this work we focus on average performance, and this randomness of response times ofindividual map or reduce tasks makes the average task time insufficient to calculate the total jobtime We measure this randomness using the difference ∆ between the response time of a job andthe average time of map and reduce Furthermore, a job would have to wait in the master node ifthere are currently no free slots for new jobs We call this waiting the fourth part In total we havefour phase, i.e map, reduce, ∆ and waiting

MapReduce is powerful in its expression ability, especially in large scale problems Typical loads includes sorting, log processing, index building and database operations such as selections,projections and joins In order to make our analysis applicable to a variety of scenarios, we divide

Trang 30

work-these workloads into three types according to the relation between input and output, because mostMapReduce applications are I/O bounded programs, which means that they require more disk ornetwork I/O time than CPU time.

The first type is large-in-small-out, in which a large amount of data are read and processed,but only a small amount of output is generated Examples include log item aggregation and tableselections in database The second type is large-in-large-out, in which a large amount of data areread, and a comparable amount of output is generated Examples include sorting, index buildingand table projection in database The last type is small-in-large-out, in which only a small amount

of input is needed, but can generate a large amount of output Examples include random datageneration, and some table joins in database

To reach the accurate model step by step, we will first build the model for the first type load, and then verify this model to incorporate the other two types The workload we choose toset up the basic model is a random SQL query of an equi-join of two tables with projection andselection, shown in Listing 1 The two chosen input tables are large enough, and size of its output

work-is negligible In thwork-is query u name work-is randomly chosen in order to make queries different fromeach other, the way the real workload should look like

Listing 1 Workload query

All experiments are run in an in-house cluster with at most 72 working nodes, although we do notuse all of them all the time Each node has a 4 core CPU, 8GB main memory, a 400GB disk and a

Trang 31

Gb Ethernet network card.

In order to get a complete overview of MapReduce’s performance, we used the followingsystematic way to arrange the experiments First the number of concurrent tasks is changed to alarge enough number to support more concurrent jobs Then in each experiment, the number ofconcurrent running jobs is at first fixed, and as the jobs running, measure the number of maps,reduces and shuffles, and also the amount of time in each phase Possibly run each experimentmultiple times and calculate the averages to improve accuracy Then vary the job concurrency toget the whole curve, which shows how the performance changes along with the workload Aftergetting the curve for one setting, we change the number of nodes used, or the system parameters,

to get the effects of different settings

The usual patterns for throughput, times of each phase and numbers of tasks in each phaseare pictured in Figure 6, 7 and 8 respectively For different settings the specific numbers in thesecurves may be different, but the shapes are similar In the throughput Figure 6, the throughputfirst increases almost linearly, then gradually decelerate to reach a maximum, after which thethroughput decreases a little bit and then remains steady afterwards The last changing point isconcurrency 40 in this example If we dissect the running time into 4 parts mentioned earlier,

we get Figure 7 The first two parts (mapper and reducer) have a similar patten, a linear increasefollowed by steady constant The ∆ part is different, which stabilizes after an exponential-likeincrease The waiting part remains 0 and increases linearly after a turning point One thing tonotice here is that the turning points of these 4 parts are the same, concurrency 40 in this examplefigure, which is the same in the throughput At last, in Figure 8 where the numbers of concurrenttasks are shown, these two phases have a similar pattern, a linear increase followed by constants,and the turning points are the same as in Figure 7 and Figure 6 However, not all workloadsproduce the same curves; the length of performance drop depends on system parameters and thecharacteristics of the workload, and may disappear in some scenarios In the next chapter we willexplain why the shape is like this

Trang 32

Figure 7 Example figure of times of each phase

Figure 8 Example figure of numbers of tasks in each phase

Trang 33

4 Model Description

This section starts with the assumptions, and reviews related theory in analytical performancemodelling Then the model is discussed in the form of open systems, in three major parts Thenthe whole model is assembled, and a corresponding formula is derived for closed systems

Assumption 4 Individual jobs are not too large, and their tasks can run at the same time duce jobs could be very large and, for example, take hours to finish, but it is not realistic to treatthese long jobs as failure free Because our model does not consider failure due to the lack of timeand budget, our primary focus is on the throughput of short running jobs which take only severalminutes to finish

Before we delve deep into the model, we first review some related theories that will be used later.One of the most important theorems in performance analysis is Little’s Law, which has the follow-ing form in Theorem 1

Trang 34

Theorem 1 In steady state of a system,

Theorem 2 For an M/M/1 queue,

to support a fixed large number of concurrent users, a closed model may be more suitable

The classic M/M/1 queue uses First Come First Served (FCFS) discipline, which is simpleand easy to study Processor sharing is another useful discipline, in the sense that a lot of systembehaves in a processor sharing pattern, such as modern time sharing operating systems In suchsystems, the server evenly divides its computing power among all the visiting customers Further-

Trang 35

more, if there are multiple classes of customers in a queueing system, it is much more complicated

to analyze In [1] the authors summarized several models for processor sharing queues, and here

we use the equations about the relation between response time and the arrival rate unconditioned

on the number of jobs in the queue, as described in Theorem 3

Theorem 3 In a queueing system with in total K classes of customers and processor sharingdiscipline, let Tk be the expected response time of a class-k customer, λk be its arrival rate and

µk be its service rate If the service requirement is exponentially distributed, then Tk satisfies thefollowing linear equations:

Tk= Bk0+

K

∑j=1

λk

Large and complex systems, such as the Internet which has countless routers, switches andend computers, are usually hard to model using aforementioned techniques because of the largenumber of sub-systems they include Bottleneck Analysis [25] is helpful in this scenario Amongall sub-systems, such as all links in the Internet, the one that has the highest utilization is called thebottleneck, and the bottleneck defines a performance bound for the whole system For example,

an end user A in the Internet is sending data to another user B, then the data transferring speed

is limited by the speed of the bottleneck link between A and B The model of the bottlenecksub-system is a good approximation of the whole system, accurate enough in many different cases

Trang 36

4.3 Notation Table

Table 1 shows the symbols and their descriptions that will be used throughout the thesis Severalless important notation symbols will be introduced where they are needed

X Job arrival rate (in open systems) or throughput (in closed systems)

Table 1 Symbols and notations

In this subsection we first model the average performance of individual tasks, and then use theirrandom behavior to calculate the response time of an ordinary job At last we consider the waitingtime when the existing jobs occupy all free slots and new coming jobs have to wait

Trang 37

4.4.1 The Average Task Performance

Back to MapReduce framework, a job is decomposed into many tasks which run on slave nodes,and they read from or write to HDFS files that spread across all nodes As a result, all jobs arepotentially related to all nodes, which means that the busiest slave node is potentially the slowestpoint for all jobs Intuitively from Section 4.2 we know the tasks on the busiest node are the slowesttasks, meaning that the busiest node is the bottleneck for the whole MapReduce framework, and if

we model the busiest node accurately, we have the model for the slave nodes Therefore, we firstfocus on the model for a single node, then use the parameters on the bottleneck node to directlycalculate the performance of that particular node, and indirectly calculate the performance of theslave nodes as a whole We introduce a parameter p to represent this imbalance, which is defined

in the following Equation 8,

p= T he amount o f work on the slowest node

where the amount of work is number of running Operating System processes, including duce mapper and reducers tasks, their management processes, and the distributed file system pro-cesses The more running processes one machine has, the slower each of these processes get.This cluster imbalance factor p is affected by the type of work and the cluster size, which will bediscussed later

MapRe-When a slave node receives a new task, it sets up the environment and initializes the task, andthen launches a new process to run it Although all parts of a computer system, such as CPU,memory, disk and network interface, are involved in the execution of tasks, we consider it as ablack box to simplify the problem We will later validate this simplification using measured data.Usually slave nodes are managed by modern time sharing operating system, Linux in our case.There are two types of tasks running in slaves, as mentioned before, and therefore, a multiclassprocessor sharing model would be a reasonable fit Theorem 3 gives us the precise equationswhich will be used later Although Equations 5 do not necessarily imply the performance curve islinear as in Figure 7, later we will show that in our system a little modification would validate thismodel

Định dạng
Số trang	75
Dung lượng	2,32 MB