The multimedia application being executed in the processor, fetchesdata from input buffer, and stores output in the playout buffer.. pre-If the processor allocates a constant number of c
Trang 1APPLICATION-SPECIFIC WORKLOAD SHAPING IN RESOURCE-CONSTRAINED MEDIA PLAYERS
BALAJI RAMANMaster of Science, NUS
A THESIS SUBMITTED FOR THE DEGREE OF DOCTOR OF PHILIOSOPHY
DEPARTMENT OF COMPUTER SCIENCE
SCHOOL OF COMPUTING NATIONAL UNIVERSITY OF SINGAPORE
January 2009
Trang 2Samarjit Chakraborty, my graduate advisor and guru, accepted me as hisPhD student, proposed this thesis topic, involved substantially in my re-search, writing, and presentation Samarjit’s empathy towards students, histolerance for my annoying demands, and his patience with my tortoise pacedeserves a standing ovation from heaven Samarjit taught me to acquireexcellence as a habit, and to reject mediocrity, especially in writing Hiscountless advice on both technical, and non-technical matters resonates in
my everyday academic life
Wei Tsang Ooi, my co-advisor and mentor, taught and trained me thefundamental skills that a research student should possess This thesis ben-efited on Wei Tsang’s insistence on clarity in writing, correctness in results,and simplicity in style His emphasis on research ethics was such that thoserules are hammered into my head Wei Tsang spent innumerable amount ofhours in meetings, and reviewing my writing This countably infinite hoursdoes not include the hours he spent on devising small courses on writing,reading and presentation, and pondering on my research topics on his own.Not being tired of these labors, being an excellent listener, he offered greatcareer advice that suited me
Tulika Mitra, my master thesis advisor, paved the way for my doctoralstudies I enjoyed our weekly meetings, when I learned why and how to put
an effort to think and concentrate on a research problem I practice thediscipline and the integrity that Tulika taught, conveying through her ownactions Apart from all these advices, I benefited greatly on Tulika’s teaching
on diligence in writing, especially, when presenting related work
I had a good fortune when Paolo Ienne gave me an opportunity to dointernship at EPFL The intense intellectual discussion on my thesis helped
me to a great extent to write my thesis after my internship Paolo, presented
my thesis work in an important international forum, and explained its impact
to the relevant audience His advice on my career had a significant, positiveimpact in my application process to postdoctoral jobs
I thank the numerous reviewers of my publications, who pointed out
sev-i
Trang 3ii Acknowledgments
eral improvements, and gave concrete suggestions In particular, I thank mythesis committee members, Weng Fai Wong, Wang Ye, and Andy Pimentel.Many people gave generously of their time, and helped me with the ad-ministration I thank Loo Line Fong for responding me promptly at criticaltimes I thank her as well for administrative support during my student years
at NUS I thank Chan Tim Fook, Embedded Systems laboratory in-charge,who provided me with all the computational resources I needed I thank thefollowing friends who helped me to communicate with staff at NUS, when Icame to France: Ankit Goel, Ashwin Nanjappa, and Deepak Gangadharan.Chantal Schneeberger, administrative staff at EPFL, went beyond her means
to help during my internship in Lausanne, Switzerland
My friends provided the needed rest and relaxation in the forms of playsand movies I thank Chanakya, Subramanian, and Sudharsanan for counsel-ing me at difficult times, for loaning money when needed, and for providingcompany when the deadlines required to work past midnight I cherished thecompany of Ramkumar, Senthilnathan, Unmesh, Chandra, Vijaykumar, Pan
Yu, Linh, Kathy, Yanhong, Satish, Cheng Wei, Ma Lin, and other friends
I am profoundly grateful to my parents, who tolerated when I was busyfor trips to India, who stayed with me in Singapore for many months, whoresponded with useful advice and counseling every week, and who energized
me during my vacations in India As though that were not enough, my fathertolerated with me when I discussed all the technical details of my researchwork, and my mother sounded persuaded when I reasoned why I am a studentfor so many years
I am indebted to my sister Sudha Raman, whose confidence and successare infectious, and her encouragement provided me with the essential moralsupport needed for my stay in Singapore She provided me with partialfinancial support for attending conferences, and when my stipend arrivedlate Sudha showed lots of patience whenever I stressed out over studies,and vented at home Sudha, from childhood, led me in my personal andacademic life While I will chose a different venue to completely state herpositive influence on me, in brevity:
I dedicate this thesis to my sister Sudha Raman
I thank you all and God
Trang 41 Introduction 5
1.1 What is Workload Shaping? 6
1.2 Shaping Techniques 9
1.3 Thesis 15
2 Background 19 2.1 Analytical Model: A Bird’s Eye Review 20
2.2 Tuning Scheduler Parameters 25
2.2.1 Methodologies 27
2.3 Preliminaries 31
2.3.1 Our System Model 34
3 Buffering for Smoothing 39 3.1 Buffering Vs Workload 40
3.1.1 Basic Intuition 40
3.1.2 Related Work 41
3.2 Frequency Estimation 43
3.3 Delay Redistribution 44
3.3.1 Motivation 44
3.3.2 Relation to Previous Work 47
3.3.3 Illustrative Example 48
3.3.4 Problem statement 51
3.3.5 Playout Delay Redistribution 52
3.3.6 Buffer Size Estimation 56
3.3.7 Results 58
iii
Trang 5iv TABLE OF CONTENTS
4.1 Motivation 66
4.1.1 Our Contribution 68
4.1.2 Reference works 69
4.2 Illustrative Example 71
4.2.1 Problem Statement 74
4.3 Dynamic Buffering 75
4.3.1 Schedulability Analysis 77
4.4 Discussion 84
5 Buffering with Stochastic Guarantees 89 5.1 Basic Idea 90
5.2 Motivation 90
5.3 Illustrative Example 94
5.4 Minimizing Buffering 96
5.4.1 Buffer Underflow 96
5.5 Numerical Evaluation 103
5.5.1 Minimum playout delay 103
5.5.2 Validation 106
6 Future Work and Conclusions 109 6.1 Modeling Processor Waiting Time 110
6.2 General Stochastic Framework 118
6.2.1 A motivating example 119
6.3 Final Remarks 121
Trang 61.1 Shaping Techniques for Multimedia players 7
2.1 Dimensions of SoC Design 21
2.2 System Model 34
3.1 Our system model and technique FIFO buffers connect PEs in pipeline An application is partitioned and mapped onto the dif-ferent PEs that run tasks concurrently Buffer size reduces on redistributing playout delay 46
3.2 Buffer fill levels with initial playout delay: (a) very small, (b) large, and (c) redistributed 49
3.3 System Model 52
3.4 Initial playout delay values as minimum required processor fre-quency drops and stabilizes 57
3.5 Change in buffer fill levels with redistributing playout delay 61
3.6 Playout delay estimation w.r.t processing requirement of tasks (VLD and IQ) running in PE 1 63
4.1 Setup for dynamic workload shaping 68
4.2 Dynamically controlling the playout buffer fill level as two ap-plications are being scheduled 71
4.3 Buffering time versus workload for a low bit rate and low res-olution video stream 78
4.4 A schedulable system 80
4.5 Schedulable regions for different flow 81
4.6 A non-schedulable system 82
v
Trang 7vi LIST OF FIGURES4.7 Schedulable regions of a periodic task (p = 600 ms, e = 80 ×
106
cycles) 834.8 Schedulable region for a setup consisting of a periodic taskalong with an MPEG-2 decoder decoding a low bit rate andlow resolution video stream 845.1 Processing requirement reduces with large initial delay Theproduction rate is high when playout starts after small delay 915.2 Delay value reduces on relaxing buffer constraints The out-put stream at times cannot catch-up with consumption andplayout buffer underflows 925.3 Correlation among playout delay, buffer size, and buffer un-derflow Increase in playout delay (and buffer size) decreasesbuffer underflow 955.4 Playout buffer underflow over time The variability in under-flow substantially reduces with large increase in playout delay 965.5 Meeting desired stochastic constraints The probability thatthe playout buffer underflows is no more than the stochasticbounding function 1055.6 The cumulative distribution of processor frequency Processorcycles/second allocated to the video decoding task and there-fore the playout buffer underflow are probabilistic 1055.7 Accuracy of analytical model Minimum playout delay esti-mated using mathematical model is close to the delay valuesobtained from simulation 1076.1 Multimedia SoC model 1116.2 Case a: Buffer underflow due to processor latency, Case b:Play-out constraint met with increase in processor share fordecoding 1126.3 Model of communication 1146.4 System architectures and models used for analysis in previousworks Memory latency modeled for architectures with off-chip memory, shared memory, and FIFO (right to left) 115
Trang 8Much research in system-level design for multimedia devices is based on ysis with system models, but how insightful are they? System simulation isthe prime technique used in computer architecture and embedded systemdesign to explore potential design solutions and validate design choices Un-fortunately, simulation seldomly gives real insight and strong guarantees onthe dynamic behavior of a system On the other hand, existing analyticalmodels could not capture some important attributes of multimedia systems.Consequently, the analysis with such mathematical models is not beneficialfor efficient system design A useful analysis with either simulation or an-alytical models should provide resource saving techniques These methodscan exploit the key characteristic features of the multimedia streams Thefluid nature in arrivals and inconstant processing requirements of data itemsare multimedia’s inherent characteristic features But, these characteristicfeatures are predictable So, the foreseeable properties could be studied toyield techniques that can significantly save on-chip resources.
anal-This thesis proposes techniques to shape multimedia workload so as toeffectively utilize on-chip resources such as processor and memory Theseshaping techniques attempt to solve the problem in providing guaranteesfor high-quality media output with minimal on-chip resources The researchapproach is to use analytical models and accurately capture the variablecharacteristics in arrival and execution of items in multimedia streams Suchmathematical models after analysis yield deep-insights to tune certain ap-plication parameters Using this parameter tuning, it is possible to reshapevariable media workloads to reduce processing and storage requirements Thecentral tenet of this parametric tuning is to adapt the workload such that
vii
Trang 9Abstract 1only average or minimum processor cycle time required for every multimediadata item is provided, and not the maximum.
Our results show that choosing the appropriate initial playout delay ter which the video starts) can lead to effective processor utilization Thisdelay parameter is typically arbitrarily chosen Instead, we propose to esti-mate the value of the parameter such that it is sufficient to provide averagecycle time required for every data item This delay, however, could be largeand can lead to huge buffer sizes Hence we propose two-ways to reducethe buffer sizes: (1) in a multi-processor set-up this delay parameter could
(af-be redistributed to different processors i.e., apart from the output device,the processors also start after some delay; and (2) allowing tolerable loss inquality Both these methods show substantial reduction in buffer size Themodel we have estimates the delay parameter in all of the above mentionedtechniques
Our mathematical framework fits well to deal with media streams in that
it could express variability effortlessly and quickly explore cost-quality offs These essential attributes of our model substantially brought out thebenefits in workload shaping An important advantage of the workload fittingtechniques is from the stochastic models; relaxing constraints that guaranteefull output quality yielded significant reductions in processing and memoryrequirements
Trang 11trade-List Of Publications and Talks
Published
• Balaji Raman and Samarjit Chakraborty Application-specific load shaping in multimedia-enabled personal mobile devices ACMTransactions on Embedded Computing Systems, 7(2) : 10, Feburary2008
work-• Balaji Raman, Samarjit Chakraborty, Wei Tsang Ooi, and SantanuDutta Reducing data-memory footprint of multimedia applications bydelay redistribution In Proceedings of the ACM/IEEE annual confer-ence on Design automation (DAC), pages 738 − 743, June 2007
• Balaji Raman and Samarjit Chakraborty Application-specific load shaping in multimedia-enabled personal mobile devices In Pro-ceedings of the international conference on Hardware/software codesignand system synthesis (CODES+ISSS), pages 4 − 9, New York, October
work-2006 (nominated for best-paper award, among top-2 papers)
• Balaji Raman, Samarjit Chakraborty, and Wei Tsang Ooi MeetingCPU constraints by delaying playout of multimedia tasks In Proceed-ings of the international workshop on Network and operating systems
3
Trang 12support for dig- ital audio and video (NOSSDAV), pages 165 − 170,New York, June 2005.
Workshop Talks
• Analytical Models of Communications of MPSoCs, International rum on Application-Specific Multi-Processor SoC (MPSOC), Aachen,Germany, June 2008 (an overview of my research was presented by
Fo-Dr Paolo Ienne, among top-5, most-relevant talks.)
• Analytical Models of Communications for SoC Multimedia Design,Models of Computer and Communications (MoCC), Eindhoven, Nether-lands, July 2008
Trang 13Chapter 1
Introduction
The usage of mobile devices is pervasive, and hearing music and watchingvideos with these media players have become commonplace Although VLSItechnology is advancing at an incredible rate, the processing and storagerequirements of multimedia applications are still a dominant factor in thecost price of a portable media device
Naturally, system designers want to reduce processor capacity and ory size, and this is achieved, typically, with slight degradation in outputquality On the other hand, researchers try to improve processor utilization(with scheduling) and buffer management, while providing guarantees on thedesired output This research often involves using analytical or simulationmodels, and the accuracy of these models determines the benefits of pro-posed ideas; indeed, not capturing the inherent characteristics of multimediaapplications can lead to losing valuable insights
mem-Instead, this thesis demonstrates that there is much room for ment in portable device design Our results show significant reduction in
improve-5
Trang 14processing and memory requirements, with no loss in output quality Theinsights that led to these resource savings were primarily due to the modeling
of data sequence in multimedia streams, before and after processing In dition, this report also proposes a model in which the constraints on qualitycould be relaxed This analytical framework enables an informed trade-offbetween tolerable loss in output and device cost Together, as explainedsoon, we term our techniques as ’workload shaping’
ad-The following section in this chapter defines workload shaping, and troduces three shaping techniques Then follows a brief discussion on thenovelty of the proposed research The secondary objective of this first chap-ter is to establish the thesis goal and the research approach for the problemstated Finally, the contributions and the organization of the document arepresented
in-1.1 What is Workload Shaping?
To define shaping, we must first understand the System-on-Chip (SoC) in amedia player Thus, we begin with an overview of the components in a SoC
in portable players, and their main functions
A SoC contains one or more processing elements, some buffer memoryand interfaces between memories and processors Figure 1.1 shows this: theinput and playout buffer are memories, and the processing element is linked
to these buffers Below, we look how these elements function while processing
a multimedia stream The advantages in capturing the characteristics of amultimedia stream will become clear
Trang 151.1 WHAT IS WORKLOAD SHAPING? 7
MPEG decoder
Figure 1.1: Shaping Techniques for Multimedia players
Trang 16The input buffer, temporarily stores the data items from a multimediastream The multimedia application being executed in the processor, fetchesdata from input buffer, and stores output in the playout buffer The outputdevice, displays items in the playout buffer at a constant rate For example,
a video decoding application decompresses the input stream and the decodeditems are displayed at the required rate (say 30fps) The workload then can
be described as follows
The load on the processor is to complete processing a certain number ofdata items per unit time, and the work the processor does is in providing themultimedia task with sufficient number of processor cycles per unit time suchthat the given load could be handled It is shown that different data itemstake a varying number of processor cycles to completely execute Therefore,the load, and consequently the work varies over time Note that the require-ment that a certain number of data items has to be processed per unit time,
is constant, and the processor cycles required to complete executing the specified number of items is that which varies To provide guarantees onthe output requirement, there is one naive method to handle the workloadvariability, although inefficiently
pre-If the processor allocates a constant number of cycles per unit time to themultimedia task, then the processor capacity required - to provide guarantee
- is higher than the cycle average of all data items; to always satisfy therequirement on output, all data items have to be allocated with the worst-case processor cycles required for processing an item It is then ensured thatirrespective of which data item is being processed, it is always completedwithin the desired time, thus guaranteeing display Clearly, the processor is
Trang 171.2 SHAPING TECHNIQUES 9ineffectively utilized; the variability in execution requirement is very high formultimedia items, so few data items require maximum processor cycles, andothers close to average If the variability, however, is a priori known, thenthere is a possibility that the processor works for necessary and sufficienttime on the given load.
This thesis proposes techniques to utilize the variability in shaping theworkload such that it is sufficient to allocate the average cycle time requiredfor every multimedia data item, and not the maximum The workload vari-ability, similar to ineffective processor utilization, can also lead to huge mem-ory requirements The above discussed reason for ineffective utilization of theprocessor cannot be exactly extended to requirements in memory; the worst-case processor cycle requirement of a data item extending to all multimediaitems in the stream does not also lead to large space to store those dataitems The how of variability in workload having large data-memory foot-print will be discussed in the following section The techniques proposed,the reader should note, target both processor utilization and buffer mem-ory requirement Following this, we briefly discuss these workload shapingtechniques
1.2 Shaping Techniques
Below, we explain the shaping techniques, emphasizing the benefits thatshaping provides in terms of resource utilization It will become clear thatthe advantages of the proposed techniques are primarily based on the model’saccuracy, that is in capturing the sequence of multimedia items in the stream
Trang 18The mathematical framework used to represent input and output dia streams is intrinsically good in modeling the inherent variability of mul-timedia workloads (The calculus that is used to construct these models isdescribed in detail in the subsequent chapter).
multime-The three shaping techniques are: (1) smoothing, (2) squeezing, and (3)slashing The first of these techniques, smoothing, shows the importance intuning a key application parameter, namely the playout delay, that is, theinitial delay after which the video is displayed Now, we describe why theplayout delay parameter has to be tuned and how it is done
Smoothing: Our results show that with appropriate playout delay for astream, it is sufficient to provide the multimedia task with average cyclesrequired per unit time, rather than the maximum (Raman et al., 2005) Thenumber of data items to be processed per unit time is given and average cyclesrequired per unit time is known Thus we compute the average processorcycles required per unit time
Clearly, delaying playout leads to saving processor resources, and it isfound that the gains are significant; there is a large difference in the maximumcycles required for a data item to that of the average; the number of dataitems requiring worst-case processor cycles in a stream are relatively lowerthan the items that require number of average cycles In other words, there
is a high variability in terms of the processor cycle requirement among dataitems in the multimedia stream This initial buffering of processed itemsbefore playout, basically, has smoothened the work that the processor does
on the given load; the reserved processor cycles for multimedia tasks doesnot vary over time
Trang 191.2 SHAPING TECHNIQUES 11Typically, the playout delay is arbitrarily chosen Instead, this thesisproposes an analytical framework, using which the delay can be precisely es-timated The delay computed corresponds to the scenario where there could
be maximum saving in terms of processing requirements, that is, it is cient to just provide the media task with average cycles that it requires perunit time The inherent variability in the multimedia workload is capturedusing the analytical model and that has led to precise computation of theplayout delay
suffi-Buffering is a powerful technique for reducing processing requirementsfor multimedia, but it is stymied by requiring large on-chip memory Inter-estingly, the reason that we require a large buffer is again due to workloadvariability, and the variability in arrival of data items The buffer size re-quired is usually calculated as follows: the maximum fill-level of the bufferover time is noted and that is buffer size The arrival of data items to theinput buffer, and the writing of data items to the output buffer varies overtime, varying the fill-level of the input and playout buffer Hence due to thehigh variability in the multimedia workload and in the arrival of input dataitems, we require a large buffer In addition, if we have significant initialplayout delay, during which items are stored and the buffer is not emptied,
we indeed need a very large buffer; the fill-level of the buffer during initialbuffering may be the maximum
To reduce the storage requirements this thesis proposes another techniqueusing the playout delay This, too, is a smoothing technique in that all pro-cessing elements including the one near to the playout buffer are considered;the processor work for the given variable load is smoothened irrespective
Trang 20of its position in the pipeline of processing elements and memories (for ample, in a multi-processor SoC) In the smoothing technique discussed forsingle processor SoC, the output device starts after a certain delay, and theprocessor starts without any delay.
ex-Instead, if the processor itself starts after a certain delay, which is a smallfraction of the actual playout delay, then our results show that the total buffersize required is reduced This is explained as follows The variability in thebuffer fill, as described earlier, is the reason for large buffer requirements In
a pipeline of processing elements and memories, if the buffer fill variabilitypropagates from one buffer to the other, each of the buffer size increases, andhence the total buffer requirement (sum of all buffers) is consequently large.But if the processor starts after certain delay, this variability in buffer fillstops propagating, reducing the memory requirements (Raman et al., 2007).The delay after which the processor should start could be exactly computedusing the mathematical framework In the case where there are multipleprocessors, each of the processing elements runs a part of the multimediaapplication The delay associated with each processor then corresponds tothe variability of the task that the processor runs
Squeezing: The squeezing technique proposes scheduling mechanisms toeffectively utilize processor bandwidth for multimedia tasks and other peri-odic tasks concurrently executed on a processor Since the processor cyclesallocated to the multimedia task over a time interval are adjusted such thatother tasks could fit in, we term this technique as squeezing Consider asituation in which the multimedia task running in the processor consumesmost of the processor bandwidth and could not run any other task Thus an
Trang 211.2 SHAPING TECHNIQUES 13incoming periodic task has to be shed because meeting the deadline of boththe periodic and the multimedia task is infeasible Now, we explain how with
a slight increase in buffer space, the multimedia task and the periodic taskcould concurrently run and still meet their deadlines
With a slight increase in buffer space, the multimedia task can pre-decodesome data items before the periodic tasks starts executing Note that thiswould require slightly higher processing capacity then the processing re-sources allocated normally (which corresponds to the average processor cyclerequirement per time unit) Once some extra data items are decoded, thenthe execution of the periodic task is started This is facilitated with reduc-ing the processor share (less than normal) for the multimedia task Duringthis time period, that is when the multimedia task is running at lower speedthan normal, the extra items that have been previously produced are beingconsumed After some pre-specified time, the periodic task is suspended andthe multimedia task is provided with a higher processor share This cycle oflowering and raising processing share of the multimedia task is repeated untilthe execution of the periodic task is complete The usage of our model inthis set-up enables the designer to decide apriori all scheduling parameters.Apparently, modeling the variability in the workload has helped to es-timate the processing requirements to decode the extra stream objects Inaddition, the time required to fill the playout buffer in excess and the timerequired to drain the buffer could also be estimated During the buffer fill,the periodic task has not started or the task is in suspension, and during thebuffer drain the periodic task is in execution Hence within a buffer fill anddrain is the period and the deadline of the periodic task The deadline of
Trang 22the multimedia task that is to display the multimedia stream at the requiredrate is met and the deadline of the periodic task is also met The analy-sis using the mathematical framework thus enhances schedulability of theseconcurrent tasks.
Slashing: Towards maximizing resource utilization, the slashing techniquetakes a different approach altogether The workload is reduced or cut inthis technique and hence we term this is as slashing While smoothing andsqueezing proposed methods to provide guarantees on display quality, theyalways required that the full output quality be met But, if the constraints
on the output are relaxed, then there could be significant resource savings.Also, studies have shown that multimedia applications can tolerate certainloss in quality, and this deterioration in quality is not perceivable up to someextent These quality degradations have been previously utilized in savingon-chip resources, albeit there were no guarantees on the design and SoCswere built to handle only average-case scenarios Instead, our technique pro-poses a framework where loss in quality could be represented and guarantees
on throughput could be obtained To illustrate this, along similar lines toprevious two techniques, we tuned the playout delay parameter with relaxedconstraints Now we explain in detail what is the benefit in having loss inquality with small delays
Consider the playout delay estimated using our mathematical framework.This delay is the minimum delay required such that there is no loss in qual-ity and the processing requirement were minimal The no loss in qualitycorresponds to the case where the buffer never underflows Now if we relaxthis buffer underflow constraint, that is, the buffer can underflow at times, it
Trang 231.3 THESIS 15corresponds to choosing a smaller delay than actually required With smallerdelay, the output requirement is not met; the buffer underflows, meaning thatthe consumption of items is at a faster rate than the production The playoutdelay, in slashing, however, could be smaller than required This is becausethe buffer can underflow to some extent and the loss in quality due to this isacceptable But then what is the benefit in lowering the delay? A legitimatequestion Recap that the initial playout delay is in fact the one that deter-mines the buffer size, and hence any reduction in delay consequently leads
to smaller buffer The amount of reduction in the playout delay from thevalue required for no loss in quality can be precisely computed, again usingthe models
There have been several efforts using the same framework that we use toestimate the buffer size and processor requirements (Liu et al., 2004; Wan-deler et al., 2005), but there has been little effort to use the models to ef-fectively utilize resources such as the processor and memory Also, there aretechniques that have been used with other models as well, but, either they
do not provide any guarantees on output or these models do not accuratelycapture the variable characteristics (Nandi and Marculescu, 2001) In the fol-lowing section, the goal of this thesis, the problem tackled, and the researchapproach used are discussed
1.3 Thesis
Having introduced the motivation and title terminology in the previous tions, we are now ready to describe the thesis itself, both its content and
Trang 24In summary, the motivation of this thesis is that there are several tunities for better design of portable media player; while providing desiredoutput quality, on-chip resources have to be minimal and therefore effec-tively utilized; designing media players, given their multitude of constraintsand unique needs, can be better handled when using analytical frameworks;the mathematical framework used if captures inherent characterstics of themultimedia application, then it can lead to efficient designs; the flexibility ofthe analytical model is desired, in particular, in accounting for the soft-realtime nature of the application So, what then is the goal of this thesis?Goal and Problem: The primary goal of this thesis is to tune certain ap-plication parameters, which can act as resource managers, so as to effectivelyutilize the on-chip resources In this thesis, we propose insights to shape me-dia application workloads using such design parameters (e.g playout delay)
oppor-so as to significantly reduce the on-chip reoppor-source requirements The shapingtechniques, the reader understands, exploits the inherent characterstics ofthe multimedia streams, the variability that is
The problem though is in predicting with accuracy the resource ments; if the input data items arrive in variable sizes and executing themtakes varying time then storage and processing capacity is variable, too But,fortunately, these characterstics are predictabily variable Hence we modelthis variability
require-Research Approach: Our methodology is a combination of system lation and analysis of mathematical models, to be precise The input to themodels are traces obtained from simulation, not a complete simulation of the
Trang 25simu-1.3 THESIS 17entire system, but the functional simulation of the individual components in
a SoC (such as processor, etc.,) For example, an instruction-set processorsimulator is used to provide the processor cycles required for executing dataitems in a multimedia stream
The models later constructed after one-time simulation provide bounds
on the arrived and processed data items over any time interval These boundsare for example the maximum and minimum number of data items that arriveover any time interval of 1 second The mathematical framework is a calculus(based on algebra with min and max operators) that with inputs as bounds
on arrival and processor capacities provides bounds on output Thus outputconstraints such as display rates could be formulated in terms of the inputand service provided in terms of processor capacities This way of modelingthen enables calculation of the minimum service, and the maximum storagerequired
Contributions: Listed below, are the key contributions of this thesis, andthey are primarily the insights obtained towards saving on-chip resources
• the observation that the increase in playout delay decreases the sor cycles required to meet certain output rate, has given opportunities
proces-to precisely estimate this delay and save processing resources;
• to reduce the memory requirements in delaying playout, the bution has provided significant gains, especially, in multi-processor set-ups;
redistri-• with slight increase in buffer space, the schedulability of the multimediaand the periodic task is enhanced;
Trang 26• a new stochastic framework to model tolerable loss in quality is a nificant advancement in terms of reducing on-chip resources;
sig-• finally, a major limitation in the existing model has been removed,namely, in modeling the processor to memory latency
Organization: Following this chapter the thesis consists of three mainparts The Chapter 3, buffering for smoothing, details the technique in whichthe play-out delay parameter is tuned to reduce the processing requirement
of multimedia applications Then the scheduling part in Chapter 4 explainshow different tasks can concurrently run with multimedia application if thescheduling of these tasks is not feasible Finally, the methodology to pro-vide stochastic guarantees to use the soft-real time nature of the multimediaapplications is introduced in Chapter 5
In this chapter, we established the aim of the thesis that is the effectiveutilization of on-chip resources in devices running multimedia applications.Towards achieving this goal, three workload shaping techniques are proposed
in this thesis, they are, smoothing, squeezing, and slashing The insightsfrom each of these techniques if applied to system design would potentiallysave significant resources and provide guarantees on the output quality ofthe multimedia applications The mathematical theory behind these shapingtechniques, accurately captures the stream characterstics, in particular, thearrival and execution variability The resource saving insights that we ob-served exploit this variabilistic nature of multimedia streams and the analyt-ical model that captures this variability tunes appropriate design parameters
to implement those insights
Trang 27Chapter 2
Background
The previous chapter introduced the main goal of the thesis - to developworkload shaping techniques so as to effectively utilize processing and mem-ory resources on-chip In achieving this goal, the preceding chapter alsostated the research approach, that is, to predict variability in the multimediaworkload It was argued that the variation in the number of data items ar-riving, and the variation in the execution requirement of the data items arethe major factors that leads to huge requirements in processing and memoryresources This chapter provides the reader with the understanding of otherexisting research approaches that mainly falls into two topics; estimating theon-chip resource requirements for multimedia processing; proposing ideas toreduce the on-chip resource needs
The primary aim of this chapter is to understand the existing literature
at two levels: (1) a broad perspective on the existing performance modelingtechniques for SoC design, and (2) a close look at the methods proposed forscheduling and buffer management for multimedia devices The former, the
19
Trang 28broader view, highlights the contribution of this thesis in that the matical framework proposed is appropriate for multimedia applications Thelatter, a close-up study, shows the effectiveness of having insights to saveresources from mathematical models, than from other techniques such assimulation, and so on.
Following this, firstly, we present the broader view on existing matical frameworks for SoC design Secondly, we zoom to the techniques
mathe-on tuning scheduler parameters for effective processor utilizatimathe-on and ory management Thirdly, we present the fundamentals of the mathematicalmodel proposed in this thesis In discussing the basics of the model, theMPEG2 application is described The application details provide informa-tion on how multimedia streams are modeled
mem-2.1 Analytical Model: A Bird’s Eye Review
First, we will look at the initial classification in methods for SoC design anddiscuss the pros and cons of the approaches Existing approaches for SoCdesign can be broadly classified as follows: (1) analytical models, and (2)simulation The main disadvantages of simulation based techniques is thatthey are slow for any design that involves a large number of iterations (forexample, designs that involves identifying several design parameters) More-over, simulation techniques do not provide any special insights that can lead
to resource savings Advantages and disadvantages of the simulation andmathematical modeling are discussed in the Table 2.1 More importantly, fordesigns with throughput requirements, the cycle-accurate simulation tech-
Trang 292.1 ANALYTICAL MODEL: A BIRD’S EYE REVIEW 21niques only provide guarantee on the correctness of the results, but do notprovide any guarantee involving the dynamic behavior of the system.
SoC
Architecture
Design
Analytical Models
Event arrivals
Throughput guarantees Design
methods
Figure 2.1: Dimensions of SoC Design
Now we detail other dimensions of SoC design Figure 2.1 sketches variousexisting design methodologies We discussed earlier the pitfalls in simulationbased techniques, now we note the advantages of the alternative method-ology, namely, the performance modeling techniques System-level designusing mathematical framework involves fast exploration of design parame-ters This thesis shows that there could be valuable insights obtained fromsuch analysis These insights provide significant resource savings Clearly,guarantees on throughput could be formulated in mathematical frameworks.With understanding of the benefits of mathematical models, we now look atthe second-level of classification in performance modeling
In general, analytical models can be divided as deterministic and bilistic frameworks Mostly, deterministic models are for worst-case analysis
proba-of the systems The worst-case analysis is suitable for hard real-time systems
Trang 30Table 2.1: Architecture Simulation Vs Mathematical Modeling
design parameters (e.g., buffer size) relative accuracy to simulation (e.g., playout delay)
design parameters could be observed ideas towards saving significant resources.e.g., buffer size vs processor frequency e.g., reducing buffer size with playout delay
simulator could model functionality of capture the data flow on a giventhe processor and memory architecture at a desired time granularityRe-usability Specific: system-level simulators could General: mathematical model can be
Limitations Middle: use of simulator tools early Early: The mathematical models cannot replace
designers to use the same design analytical modeling and system-level design
Trang 312.1 ANALYTICAL MODEL: A BIRD’S EYE REVIEW 23For soft-real time systems, however, a probabilistic framework is appropriate.Although, the existing probabilistic mathematical frameworks only analyzeaverage-case scenarios or most-probable scenarios We have seen so far twolevels of classification in methods for SoC design: (1) either they are simu-lation based or analytical models, and (2) in mathematical modeling if theyare deterministic or probabilistic Now we look at the third level.
The granularity of the application, more precisely, if they are modeled astasks or events is the basis of this classification Further, if they are eventsthey could be again divided into techniques that model standardized or gen-eral events Here we discuss in detail some of the popular analytical models.The reader will be able to map these models with the classification dis-cussed Analytical methods that have gained attention are: (1) synchronousdata flow graphs, (2) stochastic automata networks, and (3) event adaptationfunctions
Synchronous Data Flow graphs (SDF): These models are currentlybeing used in industry to analyze multimedia systems, in particular, to rep-resent data flows in DSP kernels (Stuijk et al., 2006) The primary benefit
of this model is that it naturally captures the concurrency behavior of theapplication in a multi-processor architecture The type of analysis that isusually done using SDFs is in deriving static time schedules (i.e., duringapplication compile time) Thus the timing behavior of the application ispredicted So, throughput constraints and storage space requirements of amultimedia system could be studied
Trang 32Stochastic Automata Networks (SAN): A network of automaton areSANs (Zamora et al., 2007) For example, an MPSoC could be modeled as aSAN where each automaton describes the state of an application running on
a single processor The edges in the automaton represent the communication.The edges between the nodes in the SAN represent the transition rates Themathematical theory behind the analysis of SAN is using Markovian models.The problem of state-space explosion is solved since the transition matrix isnot stored or generated
Event Adaptation Functions (EAF): EAFs are used to represent aheterogeneous MPSoC with processing elements and components having dif-ferent behavior (Richter and Ernst, 2002a) For example, one component(say an ASIC) may generate periodic output but an other component (say
a processor) may generate output with period and jitter The output ofthese two components are inputs to another component To analyze suchsystems EAFs provide functions to couple local event models using buffer-ing and time-triggering, which is how in reality systems with heterogeneouscomponents are designed Schedulability analysis, timing, and buffer memoryrequirement could be studied with EAFs
The mathematical model we propose differ from other existing models inthe following ways:
• The arrival of items in our model is not limited to standard inputmodels such as periodic, Poisson, and so on
• Our model captures the variability in the processing of data items such
Trang 332.2 TUNING SCHEDULER PARAMETERS 25that this variable nature of the media stream could itself be exploitedfor efficient system design.
• The analytical framework that we present can be applied to any level ofgranularity i.e., each data item in the stream can be a bit, a macroblock,
or a frame
• In contrast to average-case analysis of the existing probabilistic models,
we show that guarantees on output requirement could be still providedwith our stochastic mathematical framework
• Our framework applies to any kind of multimedia streaming tion In other words, it does not rely on the specific characteristics ofthe application
applica-2.2 Tuning Scheduler Parameters
This section takes a closer look at the techniques for scheduling and buffermanagement of SoC design for multimedia applications As mentioned be-fore, multimedia applications constitute sizable workloads in today’s hand-held devices such as PDAs, MP3 players and video players The characteristicfeature of the media workload is that they show high data-dependent vari-ability (Hughes et al., 2001; Liu et al., 2004) and this variability brings diffi-culty in predicting the resource demands of multimedia applications Hence,
in multimedia devices, it is a challenging job for the RTOS, which is ally designed for general purpose real-time systems use, to gracefully andefficiently allocate resources to various tasks The RTOS job is complicated
Trang 34actu-primarily due to the: (1) resource conflicts between the real-time media andother non-real time applications, and (2) unpredictability in the requirement
of processing resources by the media workload (Patil and Audsley, 2005)
So, there is a continuing interest in the OS research community in signing application-specific operating systems (Plagemann et al., 2000) suchthat in devices with limited processing capacity and buffer space, resourcescan be allocated with the knowledge of the application requirements Thiseffort in building application specific OS has opened a new research direction
de-in designde-ing embedded systems, RTOS co-design, which deals with explorde-ingmethodologies to build customized RTOS along with the hardware/softwareelements In this thesis, we address building one customizable portion of theRTOS, namely, the scheduler of the operating system
In real-time systems literature, very few techniques directly address theproblem of building a scheduler or choosing scheduler parameters (Maxi-aguine et al., 2004), so analytical models for designing schedulers are yet
to be extensively looked at In designing schedulers for embedded systems,application and architecture models should assist system design architects torealize efficient static/dynamic scheduling techniques Since designers cur-rently use off-the-shelf RTOS, they do not use models for evaluating theirsystem or for tuning the design-time parameters Hence, for designing cus-tomized schedulers for RTOS in multimedia embedded systems, we propose
a mathematical framework We illustrate the model usability in evaluatingscheduling polices; the models efficiency in predetermining scheduler parame-ters, such as the size of the play-out buffer (read by the real-time video/audiooutput devices)
Trang 352.2 TUNING SCHEDULER PARAMETERS 272.2.1 Methodologies
In this section, we will study some recent work that propose techniques tomodel and evaluate RTOS for embedded systems design We first classify pre-vious studies based on the level of abstraction used in designing the RTOSwith other hardware/software modules in system design Only few techniques
in the past focus on abstract models of the RTOS and even those techniques
do not fully utilize the potential of formal models in evaluating several rameters of the components (such as schedulers) in the RTOS
pa-Co-simulation: Gerstlauer et al., (Andreas Gerstlauer, 2003) built aRTOS model on top of the existing system level design language SpecC TheRTOS model allows the designer to model the dynamic behavior of multi-tasking systems at higher abstraction levels The RTOS main tasks such assystem management, task management, event handling and time modelingare modeled using special routines, which extensively use existing SpecC’sprimitives and libraries Using the routines defined for the RTOS interfacethe application models (task, synchronization) are appropriately refined In-order to evaluate the complete system design a co-simulation is performedusing refined system level design models integrated with the RTOS models.Moigne et al., (Moigne et al., 2004) model the RTOS’s behavior and its timingproperties using SystemC primitives The model reflects the scheduling pol-icy and the preemptive/non-preemptive modes of the RTOS behavior Aftersimulation, the models report the overhead incurred in making a schedulingdecision and in context switching among tasks
Trang 36He et al., (He et al., 2005) built a configurable RTOS model on top
of the SystemC framework The main contribution in this work is that atimed simulation could be performed with the configured RTOS Timing pa-rameters such as time required to create a task, hardware interrupt latency,software interrupt enable/disable costs, etc., are obtained from OS bench-marks and this timing information is embedded into RTOS models as delayannotations Using these delay annotations the authors have implementednovel algorithms to predict the next OS-wide event and estimate its time-stamp, and so an accurate timed simulation of the RTOS is achieved Thereexists other similar techniques (Chevalier et al., 2004; Yoo et al., 2002)
Performance Models: Madsen et al., (Madsen et al., 2003) proposed
a framework to study the effects of running multi-threaded applications on
a multiprocessor system-on-chip (SoC) platform with abstract RTOS’s Thetasks in the application are modeled as a finite state machine comprising ofstates such as idle, ready, running and preempted The execution order ofthe tasks (scheduling model), the synchronization among the tasks and theresources requested by the tasks are modeled based on the finite state machine
of the task Thus the modeling framework is composed of independent basicRTOS service models namely the scheduling, synchronization and resourceallocation models
Paul et al., (Paul et al., 2003) introduce an approach in designing ulers called model-based scheduling for programmable heterogeneous multi-processors (PHM) The custom schedulers in PHM, unlike the role played
sched-by the schedulers in a traditional OS, dynamically decide on the next thread
Trang 372.2 TUNING SCHEDULER PARAMETERS 29
to be executed or the next packet to be sent based on the way the cations will execute on the underlying hardware Also, some functionality
appli-of the schedulers in a PHM is derived statically from design time els, which are developed together with other design elements in the PHM.Model-based scheduling allows customization of schedulers to make dynamic,data-dependent scheduling decisions which in turn leads to optimized PHMperformance
mod-Implementation Techniques: Unlike performance models, there are merous research effort that propose OS software generation, synthesis andRTOS implementation techniques (Desmet et al., 2000; Gauthier et al., 2001;Cho et al., 2005; Patil and Audsley, 2005) Patil and Audsley implemented
nu-an application specific OS using the reflection mechnu-anism (Patil nu-and Audsley,2005) In particular, they proposed a reflective scheduler, which modifies thecurrent schedule of the tasks based on the reified data (data that the appli-cation sends to the scheduler) Cho et al., (Cho et al., 2005) implement astatic scheduler for multi-processor SoC, which is implemented with a pre-defined schedule of tasks/communications and the scheduler implemented isused to explore the trade-offs between a centralized and a distributed schedul-ing mechanism The methodology to automatize the construction of the OSspecific to an embedded systems software and automatic targeting of thesoftware to the OS is detailed by Guthier et al (Gauthier et al., 2001)
We now discuss the practical utility of our technique and briefly list someits merits We also discuss some limitations of our technique and highlightthe possible directions for future work Currently, SoC designers evaluate
Trang 38their designs by directly building a generic SoC simulation model, such astransaction-level models using SystemC (Rutten et al., 2002; Dutta et al.,2001) They then reuse this simulation template in all their future designs.The main drawbacks of this approach are: (1) it is inefficient to simulateall possible designs using the template, and (2) the template framework isnot flexible enough to try-out new and different designs and designers arereluctant to build a different and a new template High-level models of somecomponents of a RTOS and other hardware/software elements would easecomplex SoC design and lead to efficient, high-performance designs Wepresent one such model in this thesis.
Our technique can also be viewed as a preceding step to detailed lation Hence, our methodology could provide inputs (design parameters) tosimulation templates, which designers typically use during the design-phase
simu-In particular, the RTOS in embedded devices should exploit the tics of the application, which leads us to the topic: RTOS-co-design In thiscontext of designing application-specific RTOS, previous work have focusedmostly on software generation, whereas in this work we present high-levelmodels to evaluate parameters of certain components such as schedulers inthe RTOS The inherent limitation of any such high-level models is its ab-stract characterization of the application and the behavior of the systemunder design Hence it is essential to strike a balance between providingmodels useful for the actual design of the system and formulating modelingframeworks with less detail compared to their simulation counterparts
characteris-As mentioned earlier, high-level models for components of an RTOS havenot been extensively studied So, we want to follow this line of research and
Trang 392.3 PRELIMINARIES 31build a complete model suite for all possible components of an RTOS such
as a resource allocator, a synchronizer etc Some previous work (Madsen
et al., 2003) have followed a similar methodology but they have tightly pled their task models to the RTOS component models The task models arevery generic and hence do not exploit characteristic features of the applica-tion Some studies have completely modeled the RTOS behavior as a statemachine (He et al., 2005)
cou-We intend to preserve the system model and the stream models that wepresent and hope to mathematically evaluate parameters corresponding toother components such as the synchronizer and allocator of the RTOS Ourimmediate goal, however, is to construct models for evaluating application-specific hierarchical schedulers for heterogeneous SoCs It would also beinteresting to study using high-level models the trade-off between centralizedand distributed customized schedulers for complex MpSoCs In this section,
we identified where this thesis work fits in Now with this overall picture, wedetail the real-time calculus system model so as to illustrate its advantagesover other existing methods
2.3 Preliminaries
In this section, we first provide details on the mathematical framework andthe multimedia application used in our case studies, namely, the MPEG2decoding Secondly, the SoC architecture with components such as processorand memory are presented The mapping of the multimedia application onthis architecture is also discussed These details help to understand how the
Trang 40mathematical framework is constructed for the SoC application/architecturemodel For example, modeling the arrival of multimedia stream objects tothe input buffer is shown.
MPEG2 Decoding : In this section, we discuss the following topics on theMPEG2 application: the reason to choose MPEG2 decoding as the multime-dia application, and the processing and memory requirements for a multime-dia stream The primary reason that we chose MPEG2 decoding is as follows:(1) the reference implementation for MPEG2 decoding was widely available,(2) the MPEG2 reference implementation were also optimized for speed andmultimedia architectures (such as MMX), and (3) the application partitionand mapping to the hardware architecture is well studied in literature.Now we describe some of the application details The multimedia stream
we consider for analysis constitutes a sequence of data items These itemsare also known as stream objects They could be for example, macroblocks,frames, pictures, and so on The granularity of the multimedia stream that weuse in our analysis is a macroblock A macroblock constitutes of bits and eachand every macroblock can constitute different number of bits For example,
in a video decoding application, the input stream is a compressed sequence
of macroblocks and after processing it is a sequence of decompressed roblocks The number of macroblocks varies for each frame in a compressedvideo stream and the amount of macroblocks is constant for each frame in adecompressed video stream The required output rate is usually specified interms of frames to be decompressed per unit time
mac-As mentioned earlier, the processing and memory requirements is high