Performance Analysis of MapReduceComputing Framework Hou Song songhou@comp.nus.edu.sg Abstract MapReduce is a more and more popular distributed computing framework, especially in largesc
Trang 1N ATIONAL U NIVERSITY OF S INGAPORE
FOR THE DEGREE OF MASTER OF SCIENCE
Performance Analysis of MapReduce
September 2011
Trang 3Performance Analysis of MapReduce
Computing Framework
Hou Song songhou@comp.nus.edu.sg
Abstract
MapReduce is a more and more popular distributed computing framework, especially in largescale data analysis Although it has been adopted in many places, a theoretical analysis of itsbehavior is lacking This thesis introduces an analytical model for MapReduce with three parts,the average task performance, the random behavior and the waiting time This model is thenvalidated using measured data from three categories of workloads The model’s usefulness isdemonstrated by three optimization processes, which give reasonable conclusions yet differentfrom current understandings
Trang 5I would like to express my gratitude to my supervisor Professor Tay Yong Chiang for his countlessguidance and advice Without his help I cannot complete this thesis in time I would also like tothank my University for supporting me financially, without which I cannot survive
My parents and brother always trust me and give me continuous support I owe them so much
At last I would like to thank Shi Lei, Vo Hoang Tam and many other lab mates for their generousreviews and suggestions
Trang 6Table of Contents
Acknowledgments iii
List of Figures vii
List of Tables viii
Summary ix
1 Introduction 1 1.1 Background 1
1.2 Motivation for an Analytical Model 2
1.3 Overview 3
2 Related Work 4 2.1 Data Storage 4
2.1.1 The Google File System 4
2.1.2 Bigtable 7
2.2 Data Manipulation 9
2.2.1 MapReduce 9
2.2.2 Dryad 13
3 System Description 17 3.1 Architecture of Hadoop MapReduce 17
3.2 Representative Workload 18
3.3 Experimental Setup and Measured Results 19
4 Model Description 22 4.1 Assumptions 22
4.2 Related Theory Overview 22
4.3 Notation Table 25
4.4 Disassembled Sub-models 25
4.4.1 The Average Task Performance 26
Trang 74.4.2 The Random Behavior 28
4.4.3 The Waiting Time 31
4.5 The Global Model 31
5 Model Validation 34 5.1 Database Query 35
5.2 Random Number Generation 37
5.3 Sorting 40
6 Model Application 43 6.1 Procedures of Optimization using Gradient Descent 43
6.2 Optimal Number of Reducers Per Job 46
6.3 Optimal Block Size 50
6.4 Optimal Cluster Size 53
6.5 Summary 56
Trang 8List of Figures
1 Basic infrastructure of GFS 6
2 An example table that stores Web pages 7
3 Bigtable tablet representation 8
4 Basic structure of MapReduce framework 10
5 A Dryad job DAG 14
6 Example figure of throughput 21
7 Example figure of times of each phase 21
8 Example figure of numbers of tasks in each phase 21
9 Queueing model of a slave node 27
10 Histograms of tasks’ response time 29
11 Example figure of randomness T∆ 29
12 Regular pattern for T∆ 30
13 Transformation procedure to get equation for T∆ 30
14 Measured time VS calculated time for database query 36
15 Measured ∆ VS calculated ∆ for database query 36
16 Measured response time VS calculated response time for database query 37
17 Measured time VS calculated time for random number generation 38
18 Measured ∆ VS calculated ∆ for random number generation 39
19 Measured response time VS calculated response time for random number genera-tion 39
20 Measured time VS calculated time for sorting 40
21 Measured ∆ VS calculated ∆ for sorting 41
22 Measured response time VS calculated response time for sorting 41
23 An example of gradient descent usage 44
24 The process of gradient descent for the optimization of Hr for sorting 47
25 The gradients for the optimization of Hr for sorting 48
26 The trend of gradients for the optimization of Hr for sorting 48
Trang 927 The process of gradient descent for the optimization of Hr for database query 49
28 The gradients for the optimization of Hr for database query 50
29 The process of gradient descent for the optimization block size for sorting 51
30 The gradients for the optimization of block size for sorting 52
31 The trend of gradients for the optimization of block size for sorting 52
32 The process of gradient descent for the optimization block size for database query 53 33 The gradients for the optimization of block size for database query 54
34 The process of gradient descent for cluster size for database query 55
35 The gradients for the optimization of cluster size for database query 55
36 The process of gradient descent for cluster size for sorting 56
37 The gradients for the optimization of cluster size for sorting 57
Trang 10List of Tables
1 Symbols and notations 25
2 System default values 34
3 System optimization conclusions 57
Trang 11The work people are trying to solve is getting much larger than a single computer’s capability, anddistributed computing is an inevitable direction For example, Internet companies are using tens ofthousands of machines to process an enormous amount of concurrent user requests MapReduce is
a useful and popular distributed computing framework that is widely adopted in both the industryand academic world, because it is simple to use yet able to provide good scalability and highperformance However, its performance has not been fully studied yet There are some paperswhich are trying to improve MapReduce’s design and implementation through experiments orsimulations, but no one has yet proposed an analytical model that can overcome the weaknesses ofthe first two methods
This thesis investigates the details of MapReduce, and proposes an analytical model based
on the categorization of typical workloads This model consists of three parts: the average taskperformance which is a modified multiclass processor sharing queue, the random behavior which
is a fitted curve using common observations, and the waiting time which utilizes a deterministicwaiting equation The model is then validated using measured data from all three categories Atlast this model is applied in three optimizations, thus demonstrating its usefulness in configuringMapReduce computations Their conclusions provide new understandings about MapReduce’sperformance behavior
Trang 12There are various proposed frameworks and tools intended to help developers Message ing Interface (MPI) [12] is a successful communication protocol with a wide range of adoption Itprovides convenient ways to send point-to-point messages or multicast messages MPI is scalableand portable, and remains in the dominant position for high performance computing systems thatfocus on raw computation, such as traditional climate simulation However, this does not elimi-nate the difficulties that developers have to resort to low level primitives to accomplish complexlogic such as synchronization, failure detection and recovery, comparable to writing a sequentialprogram using assembly language They make a distributed system hard to design and implement,and tricky to ensure correctness However many of these aspects share some common operations,which could be provided by the underlying system and thus relief the burden on programmers.From a broader point of view, there are two types of high performance computer systems, one isfor raw computation power and the other is for data processing The first type has a longer history,with its concentration on the number of calculation operations per second Those systems in theTOP500 list [28] are good examples As people are collecting and generating more and more data,such as Internet web pages, photos and videos in social network sites, health records, telescopeimageries, transaction logs and so on, automatic processing of these data using large computersystems is of high demand For example, successful data mining of transaction records from asupermarket is able to give the manager a better understanding of the business and its customers,
Trang 13Pass-therefore improving this business to a new level Fast and accurate processing of telescope images
is possible to provide breakthrough scientific discoveries Database systems are designed to age large data, but up to 70% to 80% of online data are unstructured and may be used for only afew times, and the processing is not efficient, or even difficult, if people use existing commercialdatabases [7] New systems are being designed [4, 11], and companies such as Google, Microsoftand Amazon made these designs into commercial systems that are operating tens of thousands ofcomputers Their accumulated power of low end computers makes it possible to analyse the wholeInternet in a timely fashion, support large transaction systems and many more
man-This new trend is also attracting attention from smaller companies and researchers, who do nothave access to large computing infrastructure like Google and Microsoft However, in the era ofcloud computing, people can rent machines in the remote clouds and start their own cluster at avery low price Then the immediate question is: Given the workload and service level objectives,how many machines are needed? Other challenges include cluster parameters optimization, systemupgrading, scheduler design decisions, and cluster sharing To answer these questions engineersand researcher need to understand the relationship between system performance, system parametersettings and the characteristics of workload, which is the major research direction of this thesis
Although there are many such successful systems, their performance is not very deeply studied,especially the newly established massive scale data processing systems MapReduce [9], the dis-tributed computing framework originally from Google, is an example It is very popular thanks
to its power of expressiveness and simplicity of use There are some work [24, 36, 29] trying tostudy and improve the performance of MapReduce, but to the best of our knowledge, no one hasproposed an analytic model for MapReduce This thesis mainly focuses on MapReduce, and tries
to design an analytical model that characterizes its performance
There are usually three ways to study a system, i.e experiment, simulation and analyticalmodelling Experiments are accurate, but sometimes too slow to be feasible Simulation extractsonly the necessary details and runs faster, but is still not practical when the parameter space is toolarge; for example, a distributed system could have hundreds of parameters An analytical model
Trang 14describes a system abstractly with mathematical equations with system parameters as input andsystem performance metrics as output A well developed analytical model is able to hide thoseunnecessary details, explores the whole parameter space conveniently and possibly unveils someobscure truth that is impossible to produce otherwise.
In this thesis we first study the details of Hadoop, an open source implementation of duce, and categorize ordinary workloads into three groups according to the size of their input andoutput We divide MapReduce execution into several pieces, and develop models for each of them.Then we validate the accuracy of this model using measured data At last we show its applicationsusing three examples, which provide useful conclusions For example, the common practice ofusing larger block size to get better sorting benchmark may lead to even longer response time.MapReduce’s scalability can be influenced by the design of MapReduce’s jobs, and improvementcould be made according to this finding
7 and sets the plan for future work
Trang 152 Related Work
Modern data processing systems are growing in both size and complexity, but their theory andguidelines do not change very frequently The predication from [10] stays valid till now that largescale database systems should be built based on shared-nothing architecture made of conventionalhardware During the last few decades the shared-nothing architecture has been developing fast,attracting popularity from both the academia and the industry A number of systems are imple-mented in this area, which will be introduced later in this section
Intuitively in a data processing system there are two major parts: how data are stored and howthey are manipulated Therefore, this section of related work is organized into two correspondingsubsections
From the lower level data storage is a sequence of bytes that is kept in disks or other stable storagedevices However bytes can be interpreted in different forms, ranging from simple bit streams to agroup of records to nested objects In order to control and share data more conveniently, databasetechnologies are invented and widely deployed
In recent years the requirements for data storage are becoming more demanding These quirements include but are not limited to the volume that scales up to Petabytes, currency controlamong billions of objects, fault tolerance and failover, and high throughput of both read and write.The traditional storage and database system may become awkward for handling all these, thus newapproaches are proposed
re-2.1.1 The Google File System
The Google File System (GFS) [13] was originally designed and implemented by Google to meettheir needs A lot of decisions and optimizations were made to fit the environment they were in.This environment is not unique for Google, and its open source version, Hadoop Distributed FileSystem [16], is now widely used in many companies and institutes [15]
GFS is based on several assumptions First, failures are the norm rather than an exception
Trang 16Be-cause of the large number of machines gathered together, expensive, highly reliable hardware doesnot help much Instead, Google’s system consists of thousands of machines built from commodityhardware Therefore, fault tolerance is necessary in such systems Second, the system should man-age very large files gracefully The system is able to store small files, but the priority is for largefiles, the size of which is counted in gigabytes Third, its workload is primarily one type of writeand two types of reads: large sequential writes that append at the end of files, and large sequentialreads or small random reads Finally, the objective is sustainable performance for a large number
of current clients, and places more emphasis on high overall throughput instead of response time
of a single job [13]
Using these assumptions, GFS uses a single master multiple slaves architecture as the basicform Files are divided into chunks in the size of 64MB by default, which is the unit of file distri-bution The master maintains all the meta-data, such as the file system structure, chunk identifiers
of files, and locations of chunks It has all the information and controls all the actions The slavenodes called chunkservers store the actual data in chunks that are assigned by the master, and servethem directly to clients The master and chunkservers exchange all kinds of information, such asprimary copy lease, meta-data update and server registration, periodically through heartbeat Thework for master node is minimized to avoid overloading
When a client tries to read a file, it retrieves the chunk handles and locations from the masterusing the file name and offset within the file The client caches this piece of meta-data locally tolimit the communication with master Then it chooses the closest location among all possibilities(local or within the same rack), and then initiates a connection with that chunkserver to stream theactual data
Writing is a bit more complex Because GFS uses replication to improve reliability and readperformance, consistency has to be taken into consideration Each chunk has a primary copy se-lected by the master, and every file modification has to go through the primary copy so that theoperations on this chunk are ordered properly In order to provide high performance, ordinarywrites that update a certain region of an existing file are supported, but the system is more op-timized for appends, which are used when an application wants to add a record at the end of afile but it does not care about the record’s exact location The primary copy of the last chunk of
Trang 17that file decides the location where the record is written to Therefore record appends are atomicoperations and the system can serve a large number of concurrent operations, because the order
is decided at writing site and no further synchronization is required Before the operations aresuccessfully returned to applications, the primary copy pushes all the operations on the chunk toall the secondary copies to ensure all copies are the same
Fault tolerance is one of the key features in GFS No hardware is trusted, so software needs
to deal with all kinds of failures File chunks are replicated across multiple nodes and racks.Checksums are used a lot to rule out data corruption Master replicates its state and logs so that
in case of failure, master can restore itself locally or on other nodes Shadow masters are used
to provide read-only access during master’s failure Servers are designed for fast recovery, anddowntime can be reduced down to a few seconds
GFS
client GFS master
Chunk server
Chunk server
Chunk server
File read/write request
Figure 1 Basic infrastructure of GFS
In addition, there are other useful functionalities, such as snapshots, garbage collection, tegrity test, re-replication and re-balancing The structure of GFS is shown in Figure 1
Trang 18in-2.1.2 Bigtable
GFS is a reliable, high performance distributed file system that serves raw files, but there are manyapplications that need structured data which resembles a table in a relational database Bigtable [6]which is also from Google fulfills this need HBase [14] from Hadoop is an open source version
Figure 2 An example table that stores Web pages
In Bigtable, tables are not organized in strict relational data models; in contrast each tablehas one row key and unfixed, unlimited number of columns, and each field has multiple versionsindexed by time stamp This data model supports dynamic data control and gives applicationsmore choices to express their data Internally Bigtable is a distributed, efficient map from rowkey, column name and time stamp to the actual value in that cell Columns are grouped into somecolumn families, which are basic units of access control Bigtable is sorted in lexicographic order
by row keys, and dynamically partitioned into tablets which have several adjacent rows Thisdesign exploits good data localities and improves overall performance An example is shown inFigure 2, which is a small part of Web page table This row has a row key “com.cnn.www”, twocolumn families “contents” and “anchor”, and the “contents” value has three versions
Bigtable is built on top of many other Google infrastructures Bigtable uses GFS as persistentstorage for data files and log files It depends on cluster management system to schedule resources
It relies on Chubby [5] as lock service provider and meta-data manager It runs in clusters ofmachine pools shared with other applications
In implementation, there are one master and many tablet servers The master takes charge of
Trang 19the whole system, and tablet servers manage the tablets assigned to them by the master Tablets’information is stored in meta-data indexed by a specific Chubby file Data in a tablet are stored
in two parts, the old values that are immutable and stored in a special SStable file format which
is sorted by row key, and a commit log file of mutations on this tablet Both of these files arestored in GFS When a tablet server starts, it reads SStable and the commit log for the tablet, andconstructs a sorted buffer named “memtable” filled with the most recent view of values When aread operation is received, the tablet server searches in the merged view of the latest tablet fromboth SSTable files and commit logs, and returns the value When a write operation is received, theoperation is written to the log file, and the memtable is modified accordingly The whole process
is shown in Figure 3
Memory GFS
Figure 3 Bigtable tablet representation
A lot of refinements are utilized to achieve high performance and reliability Compression isapplied to save storage space and speed up transportation Caching is used a lot at both serverside and client side to relief the load on network and disks Because most tables are sparse tables,bloom filters are used to speed up searches for non-existent fields Logs on a tablet server areco-mingled into one And tablet recovery is also designed to be rapid to minimize down time
Trang 202.2 Data Manipulation
How to make good use of data is more than just data storage People could write dedicated tributed programs to conduct a certain processing, but this kind of programs is hard to write andmaintain, and each program has to deal with data distribution, scheduling, failure detection andrecovery, and machine communication A central framework can be implemented to provide thesecommon features, so that users can rely on it and concentrate only on unique logic of their jobs.Here two systems are analyzed in detail
MapReduce [9] is a powerful programming model for distributed massive scale data processing.Originally designed and implemented in Google, MapReduce is now a hot topic and applied tofields that it was not intended to fit Hadoop also has its open source MapReduce version It isbuilt on top of GFS and Bigtable, and use them as input source and output destination
MapReduce’s programming model is from functional languages, consisting of two functions:map and reduce The Map function takes input in the form of < key, value > pairs, does somecomputation on a single pair and produces a set of intermediate< key0, value0> pairs Then allthe intermediate pairs are grouped and sorted together The Reduce function takes an intermediatekey and the value list for that key as input, does some computation and writes out the final <key00, value00 > pair as the result A lot of practical jobs can be expressed in this model, includinggrep, inverted list of web pages and page rank computation
MapReduce inside Google is implemented in a master-slave architecture When a new jobstarts, it generates a master, a number of workers controlled by the master, and some number
of Mapper and Reducer tasks The master assigns Mapper and Reducer tasks to free workers.Mapper tasks write their intermediate results into their local disks, and notify the master of theircompletion, which informs Reducers tasks to fetch the map outputs When some tasks finish, ifthere are still un-assigned tasks, the master continues its scheduling When all Mapper tasks arefinished, the Reducer tasks are started After all reducer have finished, the job is finished, andreturned to client During this job execution, if any worker dies, all the tasks on that worker ismarked as failed, and re-scheduled later until finished successfully Figure 4 is an example, which
Trang 21has an input file of 5 splits, 4 mappers and 2 reducers Note that Hadoop’s version MapReduce is
a little different from Google’s version MapReduce, which we will show later
Figure 4 Basic structure of MapReduce framework
There are some enhancements to MapReduce from other researchers Traditional Map Reducefocuses on long running data intensive batch jobs that aim at high throughput rather than shortresponse time Outputs of both map phase and reduce phase are written to disks, either local filesystem or distributed file system, to simplify fault tolerance MapReduce Online [8] was proposed
to allow data pipelined between different phases and between immediate jobs Intermediate <key, value > pairs are sent to the next operator soon after they are generated, from mappers toreducers in the same job, or from reducers in this job to mappers in the next immediate job Becausereducers executes with only a portion of all intermediate results, the final results are not alwaysaccurate In MapReduce Online the system takes snapshots as mappers proceed, runs reducers
on these snapshots and approximates the real answer A lot of refinements are used to improveperformance and fault tolerance, and support online aggregation and continuous queries
Trang 22Although MapReduce can be used to express many algorithms in the areas such as informationretrieval and machine learning, it is hard to use in some database fields, especially table joins Map-Reduce-Merge [33] extends the original model by adding a third phase, merge phase, at the end ofthe reduce phase A merger gets reducers’ resulting < key, value > pairs from different reducers
in two MapReduce jobs, runs default or user-defined functions, and generates the final output files.With the help of merge phase and some operators, Map-Reduce-Merge is able to express manyrelational operators such as projection, aggregation, selection and set operations What is moreimportant, it is able to run most join algorithms such as sort-merge join, hash join and nested-loopjoin
Job scheduling is an important factor in MapReduce’s performance In Hadoop, all nodesare assumed to be the same, which results in bad performance in heterogeneous environment,such as virtual machines in Amazon’s EC2 [2] in which the performance of machines could differsignificantly Speculative tasks are one of the reasons [36] proposes a new scheduling algorithm,Longest Approximate Time to End (LATE), that fits well in heterogeneous environment and moreaccurately executes speculative tasks The key idea in this algorithm is try to estimate which taskwill finish farthest into the future LATE uses simple heuristics that assume task’s progress rate
is constant The progress rate is calculated using the completed part of the work and elapsedtime, then the full completion time is computed by dividing the remaining part of the work by theprogress rate The task that finishes the latest decides the job’s response time, and therefore is thefirst one to re-execute, if speculative tasks are needed There are some tuning parameters used tofurther enhance this strategy
In order to use MapReduce more conveniently, some other applications are built on top ofMapReduce so as to provide simple interfaces Hive [26] is a data warehousing solution built
on Hadoop It organizes data in relational table with partitions, and uses SQL-like declarativelanguage HiveQL as its query language Many database operations are supported in Hive, such asequi-join, selection, group by, sort by and some aggregation functions Users are able to defineand plug in their own functions, therefore furthermore increase Hive’s functionalities Its ability totranslate SQL to directed acyclic graphs of MapReduce jobs empowers ordinary database users toanalyze extremely large datasets, thus it is becoming more and more popular
Trang 23Pig Latin [20] is a data processing language for large dataset, and can be compiled into asequence of MapReduce jobs Not like SQL, Pig Latin is a hybrid procedural query languageespecially suitable for programmers, and users can easily control its data flow It employs a flexiblenested data model, and internally supports many operators such as co-group and equi-join LikeHive it is easily extensible using user-defined functions Furthermore, Pig Latin is easy to learnand easy to debug.
Sawzall [22] is designed and implemented at Google and runs on top of Google’s infrastructuresuch as protocol buffers, GFS and MapReduce Like MapReduce, Sawzall has two phases, afiltering phase and an aggregation phase In filtering phase records are processed one by one, andemitted to the aggregation phase, which performs aggregation operations such as sum, maximumand histogram In practice Sawzall acts as a wrapper of MapReduce, and presents a virtual vision
of pure data operations
Additionally, MapReduce is introduced into areas other than large scale data analysis Asmulti-core systems are getting popular, how to easily utilize the power of multi-core is a hot topic.MapReduce is also useful in this situation [23] describes Phoenix system, a MapReduce runtimefor multi-core and multi-processor systems using shared-memories Like MapReduce, Phoenixalso consists of many map and reduce workers, each of which runs in a thread on a core Sharedmemories are used as storage for intermediate results, so that data don’t need to be copied and thatsaves a lot of time At the end of reducer phase, all outputs from different reducers are merged intoone output file in a bushy tree fashion It provides a small set of API that is flexible and easy touse
Phoenix system shows a new way to apply MapReduce in a shared memory system, but it mayperform badly in a distributed environment Phoenix Rebirth [34] revises the original Phoenixinto a new version for NUMA systems It utilizes locality information when making schedulingdecisions to minimize remote memory traffic Mappers are scheduled to machines that have thedata or are near to the data Combiners are used to reduce the size of mappers’ outputs, andtherefore reduce remote memory demand In the merge phase, merge sort is performed first within
a locality group, and then among different groups
Another area is general purpose computation on graphics processors GPUs are being used
Trang 24in high-performance computing because of their high internal parallelism However, programsfor GPUs are hard to write and not portable Mars [17] implements a MapReduce framework onGPUs that is easy to use, flexible and portable During its execution, inputs are prepared in themain memory, copied to GPU’s device memory and then mappers are started on those hundreds ofcores on GPU After mappers are completed, reducers are scheduled At last the final outputs aremerged into one and copied to main memory Mars exposes a small set of API and yet produces ahigh performance that is sometimes 16 times faster than its CPU based counterpart.
Although MapReduce is now widely adopted in many areas, it is awkward to express some rithms such as a large graph computation which is needed in many cases Dryad [18] can be seen
algo-as an extended MapReduce that enables users to control the topology of its computation Dryad is
a general purpose, data parallel, low level distributed execution engine It organizes its jobs as rected acyclic graphs, in which nodes are simple computation programs that are usually sequential,and edges are data transmission between nodes
di-Creating a Dryad job is easy Dryad uses a simple graph description language to define a graph.This description language has 8 basic topology building blocks, which are sufficient to representall DAGs when combined together Users can define computation nodes by simply inheriting aC++ base node class and integrating them in the graph Edges in the graph have three forms: files,TCP connections and shared memories An example of job DAG is shown in Figure 5 Othertedious work such as job scheduling, resource management, synchronization, failure detection andrecovery, and data transportation is internally performed by Dryad framework itself
The system architecture is also a single master multiple slave style to ensure efficiency Theprocess of a Dryad job is coordinated by the job manager, which also monitors the the runningstates of all slaves When a new job arrives, the job manager starts computation nodes that havedirect input from files according to its job graph description When a node finishes, its output isfed to its child nodes The job manager keeps looking for nodes that have all its input ready, andstarts them in real time If a nodes fails, it is restarted on another machine If all computation nodesfinish, then the whole job finishes, and the output is returned to the user
Trang 25input
Figure 5 A Dryad job DAG
Trang 26It needs a lot of optimizations to make Dryad useful in practice First of all, the user providedexecution plan may not be optimal, which needs refinements For example, if a large number ofvertices in the execution graph aggregate to a single vertex, that vertex may become a bottleneck.
At run time, these source vertices may be grouped into subsets, and corresponding intermediatevertices are added in the middle Second, the number of vertices may be a lot larger than thenumber of running machines, and these vertices are not always independent Therefore how tomap these logic vertices onto physical resources is of great importance There are some verticesthat are so closely related that it is better to schedule them on the same machine or even in thesame process Vertices can run one after another, or can run at the same time with data pipelined
in between What’s more, there are three kinds of data transmission channels: shared memory,TCP network connection and temporary file, and each of them has different characteristics Usinginappropriate channel could cause large overhead Last but not least, how the failure recovery isaccomplished can affect the whole system Note that these optimization decisions are correlated.For example, if some vertices are executed in the same process, the channels between them shouldhave shared memory to minimize overhead, but failure recovery is more complex because once avertex fails, other vertices in the same process need to be re-executed at the same time, because theintermediate data are stored in the memory and volatile to failures
Although Dryad is powerful for expressing many algorithm, it is too low level to fit in dailywork Dryad users still need to consider details about computation topology and data manipulation.Therefore some higher level systems are designed and implemented on top of Dryad The Nebulalanguage [18] is one example Nebula language exposes Dryad as a generalization of simplepipelining mechanism, providing developers a clean interface and hides away Dryad’s unnecessarydetails Nebula also has a front-end that integrates Nebula scripts with simple SQL queries.DryadLINQ [35] is an extension on top of Dryad, aiming to give programmers the illusion of asingle powerful virtual computer so that they can focus on the primary logic of their applications Itautomatically translates the high level sequential LINQ programs, which have an SQL-style syntaxand many extra enhancements, into Dryad plans, and applies both static and dynamic optimizations
to speedup their execution A strongly typed data model is inherited from the original LINQlanguage, and old LINQ programs what aimed to be executed on traditional relational databases
Trang 27can now deal with extremely large volume of data on Dryad clusters without any change Debugenvironments are also provided to help developers.
Trang 283 System Description
In this section we first investigate the fundamental behavior of Hadoop MapReduce, followed
by the categorization of usual workloads We then describe the experimental setup, and a set ofexperiment results is plotted as a preliminary impression of MapReduce’s performance
Hadoop [16] is a suite of distributed processing software which closely mimic their counterparts
in Google’s system In this study we choose Hadoop MapReduce along with Hadoop DistributedFile System (HDFS), and their architectures are analyzed here in order to get enough insight to set
up the environment for the model
HDFS is organized in a single master, multiple slaves style The master is called Namenode,which maintains the file system structure, and controls all read/write operations The slaves arecalled Datanodes, which maintain actual data, and carry out the read/write operations As stated
in Section 2.1.1, file data are stored in blocks of fixed size This improves the scalability andfault tolerance However, its available functionalities are confined to keep the system simple andefficient
When a read operation comes, the Namenode first checks its validity, and redirects this tion to a list of corresponding Datanodes according to the file name and the offset inside the file.The operation sender then contact one of those Datanodes for the data it requires The Datanodethat is closer to the sender is chosen first, and in special cases the sender itself, so as to save net-work bandwidth If the current block is finished, the next block is chosen by the Namenode, andthe operation restarts from the new offset at the beginning of the block
opera-When a write operation comes, Namenode again checks its validity, and chooses a list of odes to store different replicas of the written data The operation sender streams the data to the firstDatanode, the first Datanode streams the same data to the next Datanode at the same time, and so
Datan-on until no data is left If the current block is full, a new list of Datanodes is chosen to store theremaining data, and the operation restarts
Hadoop MapReduce is built on top of HDFS, and similarly it also has a master-slave
Trang 29archi-tecture Its master is called Jobtracker, which controls the process of a job, including submission,scheduling, cleaning up and so on Slaves are called Tasktrackers, which run the tasks assigned tothem by the Jobtracker To keep the description concise, only major actions will be shown.
When a new job arrives, the Jobtracker sets up necessary data structures to keep track of theprogress of this job, and then initializes the right number of mappers and reducers which are put
in the pool of available tasks The scheduler monitors that pool, and allocates new tasks to freeTasktrackers There are many strategies that can help the scheduling, such as utilizing the localityinformation of input files, and rearranging tasks in a better order If no Tasktracker is available,these new tasks are queued up until some Tasktracker finishes an old tasks and is ready for a newtask
When a Tasktracker receives a task from Jobtracker, it spawns a new process to run the actualcode for that particular task, collects the running information and sends it back to the Jobtracker.Depending on the specific configuration of the task, it may read data from local disk or remotenodes, compute the output and write them to local disk or HDFS Therefore three important partsare involved, namely CPU, local disk and network interface card
According to MapReduce’s topology, ideally a job has two phases: map and reduce Butafter careful study, we find two synchronization points after mapper and reducer, meaning that allreducers start only after all mappers finish, and the job result is returned only after all reducersfinish In this work we focus on average performance, and this randomness of response times ofindividual map or reduce tasks makes the average task time insufficient to calculate the total jobtime We measure this randomness using the difference ∆ between the response time of a job andthe average time of map and reduce Furthermore, a job would have to wait in the master node ifthere are currently no free slots for new jobs We call this waiting the fourth part In total we havefour phase, i.e map, reduce, ∆ and waiting
MapReduce is powerful in its expression ability, especially in large scale problems Typical loads includes sorting, log processing, index building and database operations such as selections,projections and joins In order to make our analysis applicable to a variety of scenarios, we divide
Trang 30work-these workloads into three types according to the relation between input and output, because mostMapReduce applications are I/O bounded programs, which means that they require more disk ornetwork I/O time than CPU time.
The first type is large-in-small-out, in which a large amount of data are read and processed,but only a small amount of output is generated Examples include log item aggregation and tableselections in database The second type is large-in-large-out, in which a large amount of data areread, and a comparable amount of output is generated Examples include sorting, index buildingand table projection in database The last type is small-in-large-out, in which only a small amount
of input is needed, but can generate a large amount of output Examples include random datageneration, and some table joins in database
To reach the accurate model step by step, we will first build the model for the first type load, and then verify this model to incorporate the other two types The workload we choose toset up the basic model is a random SQL query of an equi-join of two tables with projection andselection, shown in Listing 1 The two chosen input tables are large enough, and size of its output
work-is negligible In thwork-is query u name work-is randomly chosen in order to make queries different fromeach other, the way the real workload should look like
Listing 1 Workload query
All experiments are run in an in-house cluster with at most 72 working nodes, although we do notuse all of them all the time Each node has a 4 core CPU, 8GB main memory, a 400GB disk and a
Trang 31Gb Ethernet network card.
In order to get a complete overview of MapReduce’s performance, we used the followingsystematic way to arrange the experiments First the number of concurrent tasks is changed to alarge enough number to support more concurrent jobs Then in each experiment, the number ofconcurrent running jobs is at first fixed, and as the jobs running, measure the number of maps,reduces and shuffles, and also the amount of time in each phase Possibly run each experimentmultiple times and calculate the averages to improve accuracy Then vary the job concurrency toget the whole curve, which shows how the performance changes along with the workload Aftergetting the curve for one setting, we change the number of nodes used, or the system parameters,
to get the effects of different settings
The usual patterns for throughput, times of each phase and numbers of tasks in each phaseare pictured in Figure 6, 7 and 8 respectively For different settings the specific numbers in thesecurves may be different, but the shapes are similar In the throughput Figure 6, the throughputfirst increases almost linearly, then gradually decelerate to reach a maximum, after which thethroughput decreases a little bit and then remains steady afterwards The last changing point isconcurrency 40 in this example If we dissect the running time into 4 parts mentioned earlier,
we get Figure 7 The first two parts (mapper and reducer) have a similar patten, a linear increasefollowed by steady constant The ∆ part is different, which stabilizes after an exponential-likeincrease The waiting part remains 0 and increases linearly after a turning point One thing tonotice here is that the turning points of these 4 parts are the same, concurrency 40 in this examplefigure, which is the same in the throughput At last, in Figure 8 where the numbers of concurrenttasks are shown, these two phases have a similar pattern, a linear increase followed by constants,and the turning points are the same as in Figure 7 and Figure 6 However, not all workloadsproduce the same curves; the length of performance drop depends on system parameters and thecharacteristics of the workload, and may disappear in some scenarios In the next chapter we willexplain why the shape is like this
Trang 32Figure 7 Example figure of times of each phase
Figure 8 Example figure of numbers of tasks in each phase
Trang 334 Model Description
This section starts with the assumptions, and reviews related theory in analytical performancemodelling Then the model is discussed in the form of open systems, in three major parts Thenthe whole model is assembled, and a corresponding formula is derived for closed systems
Assumption 4 Individual jobs are not too large, and their tasks can run at the same time duce jobs could be very large and, for example, take hours to finish, but it is not realistic to treatthese long jobs as failure free Because our model does not consider failure due to the lack of timeand budget, our primary focus is on the throughput of short running jobs which take only severalminutes to finish
Before we delve deep into the model, we first review some related theories that will be used later.One of the most important theorems in performance analysis is Little’s Law, which has the follow-ing form in Theorem 1
Trang 34Theorem 1 In steady state of a system,
Theorem 2 For an M/M/1 queue,
to support a fixed large number of concurrent users, a closed model may be more suitable
The classic M/M/1 queue uses First Come First Served (FCFS) discipline, which is simpleand easy to study Processor sharing is another useful discipline, in the sense that a lot of systembehaves in a processor sharing pattern, such as modern time sharing operating systems In suchsystems, the server evenly divides its computing power among all the visiting customers Further-
Trang 35more, if there are multiple classes of customers in a queueing system, it is much more complicated
to analyze In [1] the authors summarized several models for processor sharing queues, and here
we use the equations about the relation between response time and the arrival rate unconditioned
on the number of jobs in the queue, as described in Theorem 3
Theorem 3 In a queueing system with in total K classes of customers and processor sharingdiscipline, let Tk be the expected response time of a class-k customer, λk be its arrival rate and
µk be its service rate If the service requirement is exponentially distributed, then Tk satisfies thefollowing linear equations:
Tk= Bk0+
K
∑j=1
λk
Large and complex systems, such as the Internet which has countless routers, switches andend computers, are usually hard to model using aforementioned techniques because of the largenumber of sub-systems they include Bottleneck Analysis [25] is helpful in this scenario Amongall sub-systems, such as all links in the Internet, the one that has the highest utilization is called thebottleneck, and the bottleneck defines a performance bound for the whole system For example,
an end user A in the Internet is sending data to another user B, then the data transferring speed
is limited by the speed of the bottleneck link between A and B The model of the bottlenecksub-system is a good approximation of the whole system, accurate enough in many different cases
Trang 364.3 Notation Table
Table 1 shows the symbols and their descriptions that will be used throughout the thesis Severalless important notation symbols will be introduced where they are needed
X Job arrival rate (in open systems) or throughput (in closed systems)
Table 1 Symbols and notations
In this subsection we first model the average performance of individual tasks, and then use theirrandom behavior to calculate the response time of an ordinary job At last we consider the waitingtime when the existing jobs occupy all free slots and new coming jobs have to wait
Trang 374.4.1 The Average Task Performance
Back to MapReduce framework, a job is decomposed into many tasks which run on slave nodes,and they read from or write to HDFS files that spread across all nodes As a result, all jobs arepotentially related to all nodes, which means that the busiest slave node is potentially the slowestpoint for all jobs Intuitively from Section 4.2 we know the tasks on the busiest node are the slowesttasks, meaning that the busiest node is the bottleneck for the whole MapReduce framework, and if
we model the busiest node accurately, we have the model for the slave nodes Therefore, we firstfocus on the model for a single node, then use the parameters on the bottleneck node to directlycalculate the performance of that particular node, and indirectly calculate the performance of theslave nodes as a whole We introduce a parameter p to represent this imbalance, which is defined
in the following Equation 8,
p= T he amount o f work on the slowest node
where the amount of work is number of running Operating System processes, including duce mapper and reducers tasks, their management processes, and the distributed file system pro-cesses The more running processes one machine has, the slower each of these processes get.This cluster imbalance factor p is affected by the type of work and the cluster size, which will bediscussed later
MapRe-When a slave node receives a new task, it sets up the environment and initializes the task, andthen launches a new process to run it Although all parts of a computer system, such as CPU,memory, disk and network interface, are involved in the execution of tasks, we consider it as ablack box to simplify the problem We will later validate this simplification using measured data.Usually slave nodes are managed by modern time sharing operating system, Linux in our case.There are two types of tasks running in slaves, as mentioned before, and therefore, a multiclassprocessor sharing model would be a reasonable fit Theorem 3 gives us the precise equationswhich will be used later Although Equations 5 do not necessarily imply the performance curve islinear as in Figure 7, later we will show that in our system a little modification would validate thismodel