DESIGN AND MANAGMENT OF SCALABLE FPGA ARCHITECTURES

Furthermore, theincreasing area overheads due to larger programmable routing-interconnect re-quired to support increasing computation resources make the scalability an issuefor existing

Trang 1

DESIGN AND MANAGEMENT OF SCALABLE FPGA

ARCHITECTURES

RIZWAN SYED

(M.Sc Elect & Comp Engg, NUS)

A THESIS SUBMITTEDFOR THE DEGREE OF DOCTOR OF PHILOSOPHY

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

NATIONAL UNIVERSITY OF SINGAPORE

2013

Trang 3

I hereby declare that this thesis is my original work and it has been written

by me in its entirety I have duly acknowledged all the sources of tion which have been used in the thesis.

informa-This thesis has also not been submitted for any degree in any university previously.

RIZWAN SYED

DECEMBER 02, 2013

Trang 5

First of all I would like to thank Dr Yajun Ha, my supervisor, for ing me all through the journey of PhD He has always motivated me all through

mentor-my research and has constantly made me think of how I can improve mentor-my ideasand apply them in a more practical way His eye for details helped me maintain

a high quality of my research Despite being a very busy person, he alwaysensured that we had enough time for regular discussions Whenever I neededsomething done urgently, whether it was feedback on a draft or filling someform, he always gave it utmost priority He often worked in holidays and week-ends to give me feedback on my work in time

Further, I would like to thank A/P Bharadwaj Veeravalli for all the supportand supervision during all the years of my PhD He was always an inspiration for

me and always gave me useful insight into research methodology, and criticalcomments on my publications throughout my PhD project I was very fortunate

to have two supervisors who were all very hard working and motivating

Trang 6

My thanks also extend to all my friends most notably Xiaolei Chen, ZhaoWenfeng, Yu Heng, and Rajesh Panicker for all the support and all the nicediscussions and constructive feedback that I got from them.

I would like to thank my family and friends for their interest in my projectand the much needed support I would especially like to thank my parents with-out whom I would not have been able to achieve this result Last but not least,

I would like to thank my wife for her ongoing support and enduring all thetroubles during my PhD

Rizwan Syed

vi

Trang 7

Field Programmable Gate Arrays (FPGAs) have been demonstrated in ious applications to deliver significant performance speed-ups as compared toother software-centric platforms and in some case with significant improve-ment in efficiency as well Although, FPGAs are slower and inefficient thanApplication-Specific Integrated Circuits (ASICs), yet their disadvantage is off-set by their on-the-field programmability, low non-recurring engineering costsand fast-turnaround times However, designing larger FPGAs is getting moredifficult even with the ongoing advancement in semiconductor manufacturingtechnology Global wire delay has emerged as a problem for both ASICs and FP-GAs FPGAs conventionally implement global interconnects using long wiresand switch blocks, which make them slow and inefficient Another issue is clockskew in the distribution of the global clock for larger FPGAs Furthermore, theincreasing area overheads due to larger programmable routing-interconnect re-quired to support increasing computation resources make the scalability an issuefor existing FPGA architecture Another issue is decreasing yields with increas-ing FPGA die sizes This adversely affect the manufacturing cost of the device

var-On the other hand, as the resources in FPGAs increase, the runtime of theimplementation tools also increase This can be mitigated by adopting hierar-chical design methodology and reusing of implemented sub-components Thelater will require only the new components to be implemented and thus will re-quire lesser time to implement However, this would require runtime resource

Trang 8

management of FPGA Existing tools that do support resource management ofFPGA impose various constraints and overheads that make their proposals prac-tically less useful Furthermore, to improve the reliability of the system, such

a resource manager should be self-aware by measuring physical quantity such

as the temperature profile of a FPGA die This would allow the manager toperform thermal management by adjusting performance of running tasks.Envisioning that reconfigurable computing will be a part of mainstreamcomputing, this thesis presents a scalable FPGA architecture to mitigate theissues with scalability of single FPGAs and multi-FPGA systems Furthermore,

it proposes an abstract architecture for resources management of scalable FPGAarchitecture supporting the abstraction of both computing and communicationresources This thesis makes the following contributions

First, to solve the issues with scalability of FPGAs, we propose a novelscalable FPGA architecture and its design methodology This architecture al-lows us to model both single FPGAs and multi-FPGA systems under a singlearchitecture In this architecture, we partition a FPGA into multiple tiles Weassume a tile to be a current generation FPGA These tiles use a hierarchicalnetwork for routing between them and thus separate local routing from global.Due to the use of a hierarchical network, the architecture can be easily scaledjust by the addition of more tiles and higher-levels

Second, to enable efficient resource management on the scalable FPGAarchitecture, we initially develop a low overhead framework that supports im-portant features such as dynamically sized reconfigurable regions, abstraction

in communication among hardware applications, clock network managementand in-circuit debugging for hardware applications This allows us to abstractthe computational and communication resources of our scalable FPGA architec-ture

Third, to extend our framework to multi-FPGA systems, we develop a able AXI interconnect for existing FPGAs The scalable AXI interconnect useshigh-speed serial links for inter-FPGA communications Packet switching isused to deliver packets from source FPGA to destination FPGA using one ormore hops The interconnect has a low area and transport overheads while offer-ing high-bandwidth and low latency as compared to other interconnects

scal-Finally, we implement a LUT-based temperature sensor that has a tion of 0.5◦C and an accuracy of±0.5◦C using two-point calibration It utilizes

resolu-viii

Trang 9

52% less resources than the state-of-the-art We use an array of such sensors tomonitor the thermal profile of the FPGA die and then use it to and adjust perfor-mance of running applications as required through dynamic frequency scaling.

Trang 10

x

Trang 11

1 Trends and Challenges in Scalable Reconfigurable Computing 1

1.1 Trends in Scalability 4

1.2 Trends in Productivity 6

1.3 Trends in Reliability 8

1.4 Key Challenges in Reconfigurable Computing 9

1.4.1 Scalability of FPGAs 10

1.4.2 Efficient Resource Management 11

1.4.3 Scalable Interconnect 11

1.4.4 Self-Awareness 12

1.5 Scope of the Thesis 12

1.6 Key Contributions and Thesis Overview 13

Trang 12

2.1 FPGA Architecture 17

2.1.1 Design Flow 20

2.2 Runtime Resource Management of FPGA 22

2.3 Related Work 24

2.4 Closing Remarks 27

3 Scalable GALS FPGA Architecture 29 3.1 Problem Definition 30

3.2 sFPGA2 Architecture 32

3.2.1 Tiles 33

3.2.2 Switch 35

3.3 Design Methodology for sFPGA2 37

3.3.1 Routing Path Generation 38

3.4 Case Study 40

3.4.1 JPEG Encoder 40

3.4.2 Target sFPGA2 Architecture 41

3.4.3 Design Flow 41

3.5 Experimental Results 42

3.5.1 Setup 42

3.5.2 Results 42

3.5.3 Discussion 43

4 Abstract Architecture for FPGA Resource Management 47 4.1 AARP Architecture 49

4.1.1 Reconfigurable Regions 50

4.1.2 Network-on-Chip 51

4.1.3 Shared Resources 53

4.2 Design Methodology for AARP 55

4.3.1 Setup 56

4.3.2 Results 58

4.3.3 Discussion 58

xii

Trang 13

5.1 Scalable AXI Interconnect 68

5.1.1 Stream End-Point 70

5.1.2 Circuit-Switched Network 71

5.1.3 Arbiter 73

5.1.4 Packet-Switched Router 74

5.1.5 Links 75

5.2 Routing Methodology 76

5.2.1 Abstraction 78

5.3 Extending AARP 79

5.3.1 Smaller Reconfigurable Region 79

5.3.2 Scalable AXI Interconnect 79

5.3.3 Extensible Message Bus 81

5.4 Abstract Model for AARP 81

5.5.1 Setup 82

5.5.2 Results 84

5.5.3 Discussion 85

6 Thermal Management for Reconfigurable Platforms 89 6.1 LUT-Based Temperature Sensor 91

6.1.1 Main Sensor 91

6.1.2 Reading Circuit 93

6.1.3 Calibration Circuit 93

6.2 Calibration Methodology 93

6.3 Case Study 94

6.3.1 Sensor Array 95

6.3.2 FFT Cores 96

6.4.1 Setup 96

6.4.2 Results 97

6.4.3 Discussion 100

Trang 15

List of Figures

1.1 Basic Structure of FPGA 2

1.2 Progress in Silicon Manufacturing Technology [1] 4

1.3 Scaling in Virtex and Stratix FPGAs w.r.t Technology Node 5

1.4 The Design Productivity Gap [2] 7

1.5 Factors Affecting Device Reliability 8

2.1 Architectures of different programmable devices 16

2.2 Simplified architecture of a modern-day FPGA 18

2.3 Generic Design Flow 21

2.4 Configuration Architectures of FPGAs 22

3.2 Proposed sFPGA2 Implementation 32

3.3 sFPGA2 Architecture Block Diagram 33

3.4 IO Transceiver Design 34

3.5 Terminal Switch Block Diagram 35

3.6 Routing Switch Block Diagram 36

3.7 Design flow for sFPGA2 37

3.8 Routing Examples 39

3.9 Internal details of the JPEG Encoder 40

3.10 Design partitioning for JPEG encoder 42

Trang 16

LIST OF FIGURES

3.11 Block Diagram of Emulation Prototype 43

3.12 Illustration for Results 45

4.1 AARP Block Diagram 49

4.2 Block Diagram of a task mapped to Reconfigurable Regions 50

4.3 Block Diagram of the Network-on-Chip 53

4.4 Hardware Application Design Flow for AARP 55

4.5 Floorplan of Implementation 57

4.6 NoC Performance Overhead 61

5.1 Overview of Commercial Multi-FPGA Systems 66

5.2 Block Diagram of Scalable AXI Interconnect 69

5.3 Frame Format 71

5.4 Block Diagram of Circuit-Switched Network 72

5.5 Block Diagram of Arbiter 73

5.6 Block Diagram of Packet-Switched Router 74

5.7 Block Diagram of a Link 75

5.9 Block Diagram of AARP with the Scalable AXI Interconnect 80

5.10 Block Diagram of CS Network in AARP 81

5.11 Illustration to show abstraction in AARP 82

5.12 Benefit of Abstraction 83

5.13 Network Topology of Implementation 84

6.1 Block Diagram of Proposed Temperature Sensor 91

6.2 Retriggerable Ring Oscillator 92

6.3 Timing Diagram 93

6.4 Layout of Sensors and FFT Cores 95

6.5 Period of the Ring Oscillator Period in Sensors 98

6.6 Variation in period of the Ring Oscillators w.r.t location 99

xvi

Trang 17

List of Tables

3.1 Device utilization summary of our JPEG encoder 41

3.2 Results and Projected Values 44

4.1 Implementation summary 59

5.1 Conversion table for Control Codes and 8b/10b Codes 76

5.2 Implementation summary 85

5.3 On-chip NoC Comparison 86

5.4 Off-chip Network Comparison 87

6.1 Comparative Study 97

Trang 18

LIST OF TABLES

xviii

Trang 19

in seconds and can be reconfigured if there is a mistake in design Cost of velopment may vary from a few dollars to a few thousand dollars However,this flexibility in FPGA comes at a significant cost in area, speed and powerconsumption Typically, FPGAs require around 20 to 35 times more area thanstandard cell ASICs and has performance penalty of roughly 3 to 4 times whileconsuming approximately 10 times more dynamic power [3] This is largely due

de-to the programmable routing fabric which trades area, speed, and power in return

for instant fabrication capability Despite these disadvantages, FPGAs present

a compelling substitute for digital system implementation based on their turnaround and low volume cost Furthermore, it allows the ability to improve

Trang 20

fast-1 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING

Logic Block

Routing Interconnect

I/O Cells

Fig 1.1: Basic Structure of FPGA

or modify a design while being deployed in a product This may be required toremove a bug or perhaps to support new features

The first commercially viable field programmable gate array (FPGA), theXC2064, was invented by XilinxR

able by 1991 It had 64 logic cells arranged in 8×8 matrix and provided 58user inputs/outputs At that time programmable devices were mainly used forglue logic As Moore’s law progressed, FPGAs have evolved from simple regu-lar structures to complex heterogeneous devices containing fixed-function hard-ware as well as multi-core CPUs For example, the largest device (VirtexTM 7XC7V2000T) from XilinxR

DSP blocks [5] In addition to large devices, XilinxR

device, the ZynqTM-7000 [6], with 1.0 GHz dual-core ARMR TM-A9MPCore processor embedded within the FPGA’s logic fabric The device hasupto 444K logic cells and 2,020 DSP blocks However, keeping up with theMoore’s law [7] can be a daunting task Designing larger and larger devicesare getting more and more difficult This is mainly due two reasons: global wiredelay and increasing area overhead of routing interconnects Even in ASICs andmodern day processors, global wire delay is a significant issue and generally ittakes more than one clock cycle to deliver a signal across the die In case ofFPGAs, the delays are even more severe due to the slow and inefficient routing

2

Trang 21

1 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING

fabric Another issue is with clock distribution due to skew Yet another issue

is decreasing yields Although the physical layout of FPGAs is fairly regular,yet it is hard to fabricate larger FPGAs due to decreasing yields associated withlarger die sizes In other words, the current FPGA architecture is not scalable

As the logic density of FPGAs increase along with the inclusion of cated fixed-function blocks and processor cores, the role of FPGAs have shiftedfrom implementing glue logic and prototyping towards implementing completereconfigurable systems-on-chip and hardware accelerators FPGAs have beenshown to perform better and/or more efficiently than GPUs and Multi-CPU Sys-tems in many cases [8–12] However, coding in hardware description language(HDL) is known to be more difficult than either GPU programming or Multi-core CPU programming [13] This can be reduced to a great extent by high levelsynthesis like OpenCL [14–16] However, as the size of these devices increase,the time required for FPGA implementation tools also increases dramatically Asolution to this issue is to develop libraries of implemented sub-modules that can

dedi-be placed and connected within the FPGA as required by the large applications

In this case, only a subset of the design would be implemented and the rest will

be imported from the library thereby reducing the runtime of implementationtools This however requires a supporting framework to manage the resources

of the FPGA Such a framework can also be used to manage the resources at time However, tools that do support runtime resource management of FPGAare very restrictive and inefficient

run-In this thesis, we propose a novel scalable architecture for FPGAs Thisarchitecture addresses both issues related to scalability of FPGAs In addition, italso allows to model single and multi-FPGA systems under a single framework

In this architecture, we partition a FPGA into multiple tiles We assume a tile to

be a current generation FPGA These tiles use a hierarchical network for routingbetween them and thus we separate local routing from global Due to the use of

a hierarchical network, the architecture can be easily scaled just by the addition

of more tiles and higher-levels The largest FPGAs from Xilinx in the Virtex 7family use 3D fabrication to connect multiple FPGA die slices using a siliconinterposer [17] and can be considered as an instance of this architecture

We also propose an efficient framework for managing the resources of ourscalable FPGA architecture at runtime It abstracts both computation and com-munications resources of the FPGA Additionally, we provide support for fea-

Trang 22

Fig 1.2: Progress in Silicon Manufacturing Technology [1]

tures like clock management and debugging hardware tasks The frameworkwill enable multiple applications to run in FPGA without requiring reimplemen-tation of the complete system We then improve this framework with a scalableinterconnect to support multi-FPGA systems while maintaining the same ab-stract high-level architecture of the system Lastly, we also add support for real-time thermal management of the complete system by developing a low-overheadLUT-based temperature sensor and controlling the power dissipation of runningtasks using dynamic frequency scaling

This chapter is organized as follows In Section 1.1, we take a closer look atthe trends in FPGA Architectures from the perspective of scalability In Section1.2, we look at the trends in productivity In Section 1.3, we look at the trends inreliability Section 1.4 summarizes the key challenges that remain to be solved

as seen from the two trends Section 1.5 explains the overall scope of this thesis.Section 1.6 list the key contributions that led to this thesis, and their organization

Trang 23

20 40

80 160

20 40

80 160

Stratix Family (Max BRAM) Virtex Family (Max BRAM) Stratix Family (Max DSP) Virtex Family (Max DSP)

80 160

20 40

80 160

Stratix Family (Max FF) Virtex Family (Max FF) Stratix Family (Max LC) Virtex Family (Max LC)

Trang 24

sity in a chip with a constant chip size We find a similar trend in FPGAs

as well and FPGA vendors are doubling the resources with each technologynode Fig.1.3 demonstrate the evolution of the VirtexR

to Virtex-7)[5, 18–21] and the StratixR

respect to technology node Furthermore, with each generation, FPGAs are ting faster and more heterogeneous Now they are equipped with better andfaster memory controllers and high-speed transceivers While some FPGAseven have complete processing sub-systems with multi-core CPUs This nar-rows the gap between the performance and efficiency of FPGAs as compared toASICs [3] However, scaling of FPGAs is not possible without increasing therouting resources of the FPGA The increase in switching requirement is asymp-totically bounded below by Eq.1.1 [23] and is superlinear with number of logicresources

Managing the growing circuit complexity is one of the greatest challenges

in the modern IC design This is known as the so called productivity gap [2]

6

Trang 25

0.01 0.1 1 10 100 1000 10000 100000 1000000

Growth Transistors/Chip (Moore's Law) Growth Productivity

Expon (Growth Transistors/Chip (Moore's Law)) Expon (Growth Productivity)

58% / Year

21% / Year

Fig 1.4: The Design Productivity Gap [2]

and is shown in Fig.1.4 The productivity gap is a comparison between themanufacturing complexity (i.e number of transistors that we can manufacture)and the designer productivity (i.e number of transistors that we can design).Although this trend is for ASICs, we still find a similar trend in the design ofcircuits on FPGAs The complexity in designing digital systems can be reduced

by moving towards higher abstractions levels [24], i.e from gate level, registertransfer level (RTL), algorithmic level, system level etc Furthermore, efficientdesign space exploration methodologies can also improve productivity as shown

by Xydis et al [24] However, this does not solve the issue with the increasingruntime of implementations tools due to increasing design complexity This can

be mitigated to some extent by reuse of sub-components in the design and onlyimplementing newer sub-components as discussed by Jantsch et al [25] Incase of FPGAs, this is possible by using hierarchical design methodologies andflows supporting partition [26] However, in current design tools, design reuse

is only limited to partitions that been implemented in that design and cannot be

exported to another designs A solution to this issue is using dynamic partial

reconfiguration1 Again, tools that do support partial reconfiguration, eitheracademic or commercial, lack important features like dynamic sized regions,

1 Partial Reconfiguration is the ability of FPGA to dynamically modify blocks of logic by downloading partial bit files (programming files) while the remaining logic continues to operate without interruption.

Trang 26

b) a)

Fig 1.5: Factors Affecting Device Reliability; a) Failure Rate vs Time, b) Failure Rate as a Function of Junction Temperature [27]

abstract communication, multiple clock support, in-circuit debugging etc forhardware applications and also impose constraints on hardware designs makingthese solution practically less useful

The failure rate as a function of time of all system hardware, including

in-tegrated circuits, conform to Fig.1.5(a) also known as the bathtub curve [27].

There are various factors affecting failure rate which include pressure, ical stress, thermal cycling, and electrical stress However, the die temperature

mechan-of the the device during its useful life plays an important role in triggering theonset of wearout Continued technology scaling imposes an ever increasingtemperature stress on digital circuit design due to transistor density As chipsget smaller and denser, power dissipation becomes more difficult [28] It is be-coming increasingly more important to be able to monitor as well as managethe temperature of a device in order to ensure optimum performance as well aslongevity [29] The power dissipated in an integrated circuit (IC) is a function

of the switching activity The switching activity at different points in an IC can

be different If the activity in some region is very high, then those areas willget heated more than others This localized hot areas are called hotspots Stud-ies have shown that an increase in junction temperature by 30◦C decreases themean-time between failure (MTBF) by a factor of 10 [27] as shown in Fig.1.5(b)

In case of FPGAs, the authors observed that under high power dissipation, the

8

Trang 27

FPGA would stop operating Typically, conservative designs or cooling tions are used to solve this issue but there may be significant cost and perfor-mance implications due to such solutions Therefore, these reconfigurable plat-forms must be self-aware2 and must be able to adjust its performance to staywithin a thermal envelop

The trends outlined in the previous sections indicate the issues in bility, productivity and reliability of FPGAs The existing FPGA architecturedoesn’t scale well due to increasing area overheads On the other hand, theimplementation tools for FPGA fall short in terms of productivity They areunable to keep up with the increasing density in FPGAs Implementation toolsthat support the reuse of implemented sub-components doesn’t allow this acrossdifferent designs And then tools using runtime resource management that do al-low design reuse across different designs are less practical Furthermore, lack ofself-awareness in reconfigurable platforms leads to reliability issues caused due

scala-to long-term effects of high temperature as a result of high power dissipation inlocalized regions In short, following are the major challenges that remain inscalable reconfigurable computing, and are addressed in this thesis

• Scalability in FPGA Architecture: Design a FPGA architecture along with

its interconnect to allow scalability without significant performance and

area overheads (Scalability in FPGAs, Scalable Interconnect)

• Efficiency in Resource Management: Design and develop a framework to

allow reuse of implemented sub-components across designs efficiently so

as to reduce design and implementation times (Efficient Resource

Man-agement)

• Abstraction in Resource Management: Reduce the complexity associated with resource management in order to improve productivity (Efficient

Resource Management, Scalable Interconnect)

2 Self-Awareness is the capability of a systems to adapt their behavior and resources edly based on changing environmental conditions and demands This allows such systems to automatically find the best way to accomplish a given goal with the resources at hand.

Trang 28

repeat-1 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING

• Flexibility in Resource Management: Support features required by real

hardware design without imposing constraints that impact performance

as well as productivity (Efficient Resource Management)

• Self-Awareness in Resource Management: Design and develop a thermal

monitoring solution to adjust the runtime power dissipation of running

hardware applications (Efficient Resource Management, Self-Awareness)

In this thesis, we present a scalable FPGA architecture to overcome theissues related to scalability in reconfigurable architectures We introduce a hi-erarchy in routing Unlike Altera FPGAs Flex10K [30], Apex [31], and Apex

II [31], we use a hybrid interconnect that is composed of island-style routinginterconnect locally and a hierarchical interconnect globally Furthermore, weimprove global connectivity by multiplexing multiple wires on a single wire us-ing TDMA3 Lastly, we use circuit-switching instead of packet-switching in thehierarchical interconnect to avoid area and performance overheads associatedwith packet-switching4 The island-style routing regions, called tiles, in our

architecture are similar to existing FPGAs and are connected with each otherusing our hierarchical global interconnect Our architecture uses globally asyn-chronous locally synchronous (GALS) methodology Inside a tile, all the signalsare assumed to be synchronous using local clocks while communications amongdifferent tiles are asynchronous This reduces the impact of delays associatedwith global routing The proposed architecture not only solves the issue related

to global wire delay but also solves the issue of reduction in yields with ing die sizes In our architecture, each tiles can be essentially fabricated as aseparate silicon and integrated together on a larger substrate using either 2.5D

increas-or 3D-IC fabrication As proof of this concept, Virtex 7 [5] uses similar

method-ology, called stacked silicon interconnect [17], to integrate multiple dies onto a

3 Time division multiple access (TDMA) is a channel access method for shared channels It allows several users to share the same channel by dividing the access of the channel into different time slots.

4 Packet-switching typically allows better bandwidth utilization due to the fact that it allow reuse of the same channel Packet-switching can also be more reliable However, circuit- switching is usually better in cases where the cost of not using channels is low Also, packet- switched routers are typically bigger than the circuit-switched router for the same required performance.

10

Trang 29

single substrate If the multiplexing factor in our architecture is assumed to be 1(i.e no multiplexing), then the result is the silicon interposer of Virtex 7 Thus,Virtex 7 can be considered as an instance of our architecture

As mentioned in sections earlier, more resources will lead to larger mentation runtimes if we do not reuse sub-components across designs There-fore, to solve this issue, we propose a low-overhead resource management frame-work for FPGAs This will allow sub-components to be reused as partial pro-gramming files, called bitstreams, and our infrastructure will connect these sub-components as required by the overall hardware application The area allocated

imple-to each sub-component depends upon its size and is not fixed In other words,our resource management framework abstracts both the computational and com-munication resources of the FPGA Furthermore, we support multiple clocks foreach sub-component and also support in-circuit debugging Lastly, our frame-work does not impose any constraints on the design of hardware application.Due to low area and performance overheads as well as reasons mentioned ear-lier, it is a good solution to the issues associated with large runtimes of imple-mentation tools

Our scalable FPGA architecture presented earlier solves the issue of ability for single-FPGA systems However, for multi-FPGA systems, the solu-tion presented earlier is too expensive in terms of required off-chip infrastruc-ture This is due to the fact that the hierarchical interconnect requires a largenumber of switching-units To solve this issue, we develop a scalable off-chipinterconnect to supplement our hierarchical on-chip interconnect We maintainthe abstraction of computation and communication resources in the multi-FPGAsystem This make the complete solution uniform and thus allow resource man-agers to manage the multi-FPGA resources in same way as it would do for asingle FPGA system

Trang 30

scal-1 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING

to be self-aware and improve reliability of the complete system

Key challenges highlighted earlier can be summarized into three categories:(1) Scalability, (2) Productivity, and (3) Reliability For an engineering solution

to be practical and usable, we need to address major issues in all the three gories This will make the solution acceptable in a holistic way As mentionedearlier, existing solutions in reconfigurable computing with respect to scalabil-ity, productivity and reliability address individual issues and do not take intoaccount the global constraints Therefore, we approach these issues in a holisticway by addressing major issues in all three categories so that our solutions areall the more useful for engineering design We therefore, limit our focus to onlythe major issues in the scalability, productivity and reliability of reconfigurablesystems as highlighted in the subsequent paragraphs

cate-We first will analyze the difficulties in the scalability of FPGAs and thenattempt to solve it by designing a scalable FPGA architecture We will limit our-selves to the development and study of an emulation platform as well as estima-tion of the performance and overheads of an actual implementation of a FPGAbased on the proposed architecture However, as the actual implementation ofthe FPGA is a huge engineering endeavor, therefore the actual implementationwill be beyond the scope of this thesis

12

Trang 31

Second, we will study the issues with productivity in large reconfigurablesystems As mentioned earlier, reuse of implemented sub-components can im-prove productivity To enable reuse, one solution is to develop a runtime re-source management framework In this thesis, we will limit ourselves to onlythe design and development of a practical resource management framework forscalable reconfigurable systems and also the evaluation of its performance andoverheads However, study of other methods to improve productivity like betteralgorithms and design methodologies will be beyond the scope of this thesis.Third, to improve the reliability of reconfigurable systems, we will studyexisting works and understand the difficulties in real time thermal monitoring

of existing FPGAs Then, we will design and develop a low overhead ature sensor using the resources within existing FPGAs This will enable us tomonitor the temperature profile of the die of existing FPGAs and use this infor-mation to control the regional power dissipation of the FPGA die However, thestudy of other effects on the reliability of FPGAs will be beyond the scope ofthis thesis

The main aim of this thesis is to study the problems associated with bility in FPGAs and its resource management and propose practical and novelsolutions that can improve the scalability of FPGAs and also improve the pro-ductivity of designers by providing better tools and framework Following aresome of the major contributions that have been achieved during the course ofthis research and have led to this thesis

scala-• A novel Scalable FPGA Architecture to significantly reduce the overheadsassociated with designing larger FPGAs This work was published in [32]

• A design methodology to allow designer to use our Scalable FPGA tecture This work was published as part of [32]

Archi-• An Abstract Architecture for FPGA Resource Management to allow reuse

of implemented sub-components across designs efficiently and also to duce the complexity associated with resource management This workwas published in [33]

Trang 32

re-1 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING

• A low overhead and high-performance Network-on-Chip to allow ministic communication among hardware tasks This work was published

deter-a pdeter-art of [33]

• A Scalable AXI Interconnect to enable fast and efficient communicationamong hardware tasks in multi-FPGA systems It extends our previouswork in [33]

• A Low Overhead Temperature Sensor for Reconfigurable Platforms tomeasure the die temperature This work was published in [34]

• A Thermal Aware Resource Manager to monitor the temperature profile

of the FPGA die and adjust the performance of running applications Thiswork extends our previous work in [34]

The organization of thesis as follows Chapter 2 gives an overview of theexisting FPGA architectures and explains the design flow associated with im-plementing hardware applications on FPGA Then it discusses runtime resourcemanagement for reconfigurable platforms and also present related works and itsshortcoming In Chapter 3, we introduce our scalable FPGA architecture andits design methodology We then discuss the merits and demerits of such anarchitecture by the use of a case study In Chapter 4, we present our frameworkfor resource management of FPGA and demonstrate its effectiveness experimen-tally using a prototype Chapter 5 introduces our scalable off-chip interconnectand we evaluate its performance and overheads on a multi-FPGA system InChapter 6, we present our LUT-based temperature sensor and compare its per-formance with state-of-the-art We later build a platform using an array of oursensor to demonstrate self-awareness in runtime resource management Finally,Chapter 7 concludes this thesis and gives directions for future work

14

Trang 33

CHAPTER 2

FPGAs and Resource Management

Historically, Texas Instruments introduced the first programmable logic ray (PLA) device in 1970 This device, the TMS2000[35], had 17 inputs and 18outputs with 8 JK Flip-Flops These devices had to be programmed during pro-

ar-duction This process was termed as mask-programmable They were used in

many VLSI devices mostly for implementing glue/control logic These devicescould implement any set of sum-of-product (SOP) logic equations Many terms

in the equations could be shared by the outputs as shown in Fig.2.1(a) Later, in

1980, Monolithic Memories Inc (MMI) introduced programmable array logic(PAL) devices The primary difference between PALs and PLAs was that thePALs had fixed-OR and programmable-AND arrays while PLAs had both ORand AND arrays programmable as shown in Fig.2.1(b) This made the PALs sim-pler, cheaper and faster Furthermore, PALs were also one-time programmableusing on-chip fuses and thus allowed end-user to program them unlike majority

of the PLAs which were mask-programmable Later, field-programmable vices were also introduced which were pin-compatible to PALs called genericarray logic (GAL) Unlike PALs, they could be electrically erased and repro-grammed These devices grew in flexibility and size Later, in 1990, more

de-advanced devices were introduced called complex programmable logic devices

(CPLDs) They had an array of PALs with a central routing network to provide

Trang 34

2 FPGAS AND RESOURCE MANAGEMENT

Programmable AND

Connection Block Switch Block Input/Output Routing Channel Logic Block

Programmable AND Fixed OR Input/Output

Fig 2.1: Architectures of different programmable devices; a) Architecture

of PLA, b) Architecture of PAL, c) Architecture of CPLD, and d) Architecture

of FPGA

16

Trang 35

even greater flexibility as shown in Fig.2.1(c) Although these devices werequite flexible but had limited storage elements in them To resolve this issue, analternate architecture was developed which consisted of logic blocks surrounded

by programmable interconnect as shown in Fig.2.1(d) This class of devices was

termed as field-programmable gate arrays (FPGAs) XilinxR

pany to introduce the first commercial FPGA, the XC2064, in 1985 This devicehad only 64 configurable logic blocks (CLBs), with two 3-input lookup tables(LUTs) In the 1990s, FPGAs evolved rapidly in both sophistication and vol-ume of production By 2000s, FPGAs like Virtex and Spartan 2 had small staticmemory blocks called block RAMs (BRAMs) in the fabric Later, in 2005, evenlarger BRAMs with hardware-multipliers were added to Virtex-2 and Spartan-3FPGAs By this time, FPGAs were not only used for implementing control logicbut also as hardware-accelerators Later, in 2007, FPGAs had on-chip DSP ca-pabilities with even embedded processors This allowed the implementation ofcomplete reconfigurable system-on-chip Furthermore, they allowed portions ofthe device to be configured while the rest remain operational This capabilityallowed designer to reuse the resources or FPGA at runtime

In this chapter, we introduce the architecture of modern day FPGA in tion 2.1 and take a closer look at the routing resources in terms of scalability Wealso take a look at the processes in design flow for FPGA-implementation andtouch up their runtime complexity Then we introduce some of the techniquesand issues in the runtime resource management of FPGAs in Section 2.2 Then

Sec-we present some of the related works in Section 2.3 and Section 2.4 concludesthis chapter

As mentioned earlier, FPGAs are integrated circuits that can be programmedafter fabrication to function as almost any kind of digital circuit Modern day FP-GAs, as illustrated in Fig.2.2, consist of an array of programmable logic blocks

of potentially different types, including general logic, memory and multiplierblocks, surrounded by a programmable routing fabric that allows the blocks to

be interconnected by simply reconfiguring the routing fabric The array alsocontains programmable input/output blocks that allows the chip to interface theoutside world As shown in the figure, the major components of a FPGA are: (1)

Trang 36

LUT SRL RAM

LUT SRL RAM LUT SRL RAM

Trang 37

Configurable Logic Blocks, (2) Memory Blocks (Block RAM), (3) DSP Blocks,(4) Input/Output Blocks, and (5) Routing Interconnect (includes Switch Blocks).Each component is briefly explained in the following paragraphs

Configurable Logic Blocks (CLBs) themselves comprise of Look-up tables

(LUTs), RAM, Shift Registers (SRL), Multiplexers, Carry Logic, and Registers

(DFF) LUTs are addressable array of SRAM cells in which the inputs are

con-nected to the address lines Thus the inputs select a particular SRAM bit fromthe array In other words, we can load a truth table of any logic function in theLUT and the LUT will behave as the logic function Therefore, by cascadingsuch LUTs, we can implement any boolean equation The number of inputs of

an LUT is a architecture parameter and varies with different families of FPGAs.For example, FPGA developed prior to Virtex 5 [20] had 4-input LUTs whilelater FPGAs has 6-input LUTs However, LUTs alone are insufficient if youconsider the need to hold state information, especially for recursive/iterativecomputations that depends on the results from previous states Therefore, we

also have registers (DFF) to store the previous state information Carry logic is used to implement the look-ahead carry chains of adders and subtractors Mul-

tiplexers are used to combine the outputs of the LUTs to even create bigger

LUTs RAM and shift register are alternate operating modes of LUTs As the

name suggests, they allow compact implementation of shift registers and RAMs

as compared to using the individual registers in the CLBs The truth table ineach LUT, the initial values of registers, the operating mode of each component,and the configuration of the interconnection within a CLB are stored in SRAMcells associated with that CLB When a FPGA is programmed, the SRAM cellsfor each CLB is loaded with appropriate values and the internals of the CLBoperate as required

Memory Blocks, commonly referred as Block RAM, are larger RAM

com-ponents that can be used if required They are used to implementing large dressable arrays of storage efficiently in the FPGA fabric Like the CLBs, theyare also configured during the programming of the FPGA

ad-DSP Blocks are configurable multipliers and adders They enable efficient

and faster implementations of arithmetic operations as compared to tions using CLBs only These unit are particularly very useful for implementingapplications in signal processing, image processing and compute domains Due

implementa-to this reason, these blocks are named as digital signal processing (DSP) blocks.

Trang 38

Input and output blocks allow the FPGA to interface the outside world.

They typically support multiple I/O standards (e.g LVDS, CMOS etc) so as

to allow the FPGA to communicate with most of the devices However,

high-speed serial interfaces requires special blocks called Multi-Gigabit Transceivers

(MGTs) Transceivers allow FPGAs to adopt standards like PCIe, SATA, SGMIIetc

The interconnect are channels of wire connected using switch box The

switch box connects wires in one channel to others to create an electrical pathbetween them By using this methodology, we can connect different compo-nents together to form any digital circuit The exact connectivity offered by aswitch box is also an architecture parameter and has evolved over different fam-ilies of FPGAs Like the CLBs, the switch box also uses SRAM cells to storeits configuration

Scalability: As the resources increase, the channel size required for a

rea-sonably routable FPGA also grows This implies a larger switch box For ample, in case of a crossbar switch box, the number of switches in a switch boxwill grow quadratically with the increase in channel width (assuming that chan-nels in both X and Y are increasing proportionally) This also means that moreSRAM cells will also be required to store the configuration of the switch box.This area overhead limits the scalability of the FPGA The channel widths, type

ex-of switch boxes, distribution ex-of different types ex-of wire, number/organization ex-oflogic elements are architectural parameters and are evolving with each FPGAfamily In addition to the area overheads introduced by the routing intercon-nect as a result of scaling the FPGA, scaling the clocking network also becomes

an issue This is due to fact that it becomes harder to keep resources far apartfrom one another synchronized with the same clock due to clock skew Lastly,

it is well known fact that larger die size using particular manufacturing processwill have lower yields as compared to a small die size using the same process.Smaller die size essentially means that a defect will affect less dies as compared

Trang 39

Design

I - Synthesis

II - Translate Primitives

III - Map/Place

Resource

V - Generate Bitstream

FPGA Architecture

Fig 2.3: Generic Design Flow

consisting of gates, registers and wires as shown in stage-I of Fig.2.3 Such a

circuit diagram is called a netlist The process of converting a high level scription into a netlist is called synthesis Once, the design is synthesized, each component in the netlist is converted into equivalent constructs, called primi-

de-tives, available on FPGA (as shown in stage-II of the figure) This process is

called translate The database of primitives used by the translate stage is always

generated by the manufacturer of the FPGA After translate stage, a gate would

be converted into a LUT, and a register to a flip-flop within a CLB etc In somecases, a vendor specific synthesis tools may also uses the primitives databaseand directly generates a translated netlist In this case, translate stage will onlymerge the various translated netlists and constraints into one global translateddesign Then each component in the translated design is placed at specific lo-

cation in the FPGA (as shown in stage-III) This process is called placement.

Then the placed components are connected together using the interconnect (in

stage-IV) This process is called routing Finally, the programming file, called

bitstream, is generated (in stage-V) and downloaded into the FPGA These

pro-cesses are computationally expensive specially the placement and routing

Runtime Complexity: FPGA placement has been shown to be an

Trang 40

NP-2 FPGAS AND RESOURCE MANAGEMENT

Configuring Sequence Configuring Sequence

DSP/BRAM

Clock Region

Fig 2.4: Configuration Architectures of FPGAs; (a) Virtex 2 Pro tion Architecture, and (b) Configuration Architecture of Virtex 4 and newer FPGAs

Configura-hard combinatorial optimization problem and therefore no polynomial time gorithm is known to produce an exact solution [36] Most placement tools use

al-simulated annealing which typically requires large computation times

Addi-tionally, these tools also consider timing constraints which make them evenmore computationally expensive Routing also is an NP-complete problem [37].Typically, most routers use 2-pass routing in which first they perform globalrouting that balances the densities of all routing channels followed by detailedrouting that assigns specific wiring segments for each connection Detailed rout-ing is typically performed by directed graph search algorithms which are deriva-tives of Dijkstra’s algorithm [38] Like the placement algorithms, these routingalgorithms also have to take timing into consideration and typically take largecomputation times to generate good results This is why vendors are continu-ously trying to improve these stages as well as the rest of the flow both in terms

of the result and runtime so as to improve the overall productivity

Initially, FPGAs only supported static reconfiguration During this kind of

configuration, the device is not active and only brought up after configuration.Although it was still possible to send partial configurations, yet the configurationprocess itself caused the entire device to be inactive Starting from Virtex 2/2P

22

Định dạng
Số trang	135
Dung lượng	2,47 MB