Furthermore, theincreasing area overheads due to larger programmable routing-interconnect re-quired to support increasing computation resources make the scalability an issuefor existing
Trang 1DESIGN AND MANAGEMENT OF SCALABLE FPGA
ARCHITECTURES
RIZWAN SYED
(M.Sc Elect & Comp Engg, NUS)
A THESIS SUBMITTEDFOR THE DEGREE OF DOCTOR OF PHILOSOPHY
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
NATIONAL UNIVERSITY OF SINGAPORE
2013
Trang 3I hereby declare that this thesis is my original work and it has been written
by me in its entirety I have duly acknowledged all the sources of tion which have been used in the thesis.
informa-This thesis has also not been submitted for any degree in any university previously.
RIZWAN SYED
DECEMBER 02, 2013
Trang 5First of all I would like to thank Dr Yajun Ha, my supervisor, for ing me all through the journey of PhD He has always motivated me all through
mentor-my research and has constantly made me think of how I can improve mentor-my ideasand apply them in a more practical way His eye for details helped me maintain
a high quality of my research Despite being a very busy person, he alwaysensured that we had enough time for regular discussions Whenever I neededsomething done urgently, whether it was feedback on a draft or filling someform, he always gave it utmost priority He often worked in holidays and week-ends to give me feedback on my work in time
Further, I would like to thank A/P Bharadwaj Veeravalli for all the supportand supervision during all the years of my PhD He was always an inspiration for
me and always gave me useful insight into research methodology, and criticalcomments on my publications throughout my PhD project I was very fortunate
to have two supervisors who were all very hard working and motivating
Trang 6My thanks also extend to all my friends most notably Xiaolei Chen, ZhaoWenfeng, Yu Heng, and Rajesh Panicker for all the support and all the nicediscussions and constructive feedback that I got from them.
I would like to thank my family and friends for their interest in my projectand the much needed support I would especially like to thank my parents with-out whom I would not have been able to achieve this result Last but not least,
I would like to thank my wife for her ongoing support and enduring all thetroubles during my PhD
Rizwan Syed
vi
Trang 7Field Programmable Gate Arrays (FPGAs) have been demonstrated in ious applications to deliver significant performance speed-ups as compared toother software-centric platforms and in some case with significant improve-ment in efficiency as well Although, FPGAs are slower and inefficient thanApplication-Specific Integrated Circuits (ASICs), yet their disadvantage is off-set by their on-the-field programmability, low non-recurring engineering costsand fast-turnaround times However, designing larger FPGAs is getting moredifficult even with the ongoing advancement in semiconductor manufacturingtechnology Global wire delay has emerged as a problem for both ASICs and FP-GAs FPGAs conventionally implement global interconnects using long wiresand switch blocks, which make them slow and inefficient Another issue is clockskew in the distribution of the global clock for larger FPGAs Furthermore, theincreasing area overheads due to larger programmable routing-interconnect re-quired to support increasing computation resources make the scalability an issuefor existing FPGA architecture Another issue is decreasing yields with increas-ing FPGA die sizes This adversely affect the manufacturing cost of the device
var-On the other hand, as the resources in FPGAs increase, the runtime of theimplementation tools also increase This can be mitigated by adopting hierar-chical design methodology and reusing of implemented sub-components Thelater will require only the new components to be implemented and thus will re-quire lesser time to implement However, this would require runtime resource
Trang 8management of FPGA Existing tools that do support resource management ofFPGA impose various constraints and overheads that make their proposals prac-tically less useful Furthermore, to improve the reliability of the system, such
a resource manager should be self-aware by measuring physical quantity such
as the temperature profile of a FPGA die This would allow the manager toperform thermal management by adjusting performance of running tasks.Envisioning that reconfigurable computing will be a part of mainstreamcomputing, this thesis presents a scalable FPGA architecture to mitigate theissues with scalability of single FPGAs and multi-FPGA systems Furthermore,
it proposes an abstract architecture for resources management of scalable FPGAarchitecture supporting the abstraction of both computing and communicationresources This thesis makes the following contributions
First, to solve the issues with scalability of FPGAs, we propose a novelscalable FPGA architecture and its design methodology This architecture al-lows us to model both single FPGAs and multi-FPGA systems under a singlearchitecture In this architecture, we partition a FPGA into multiple tiles Weassume a tile to be a current generation FPGA These tiles use a hierarchicalnetwork for routing between them and thus separate local routing from global.Due to the use of a hierarchical network, the architecture can be easily scaledjust by the addition of more tiles and higher-levels
Second, to enable efficient resource management on the scalable FPGAarchitecture, we initially develop a low overhead framework that supports im-portant features such as dynamically sized reconfigurable regions, abstraction
in communication among hardware applications, clock network managementand in-circuit debugging for hardware applications This allows us to abstractthe computational and communication resources of our scalable FPGA architec-ture
Third, to extend our framework to multi-FPGA systems, we develop a able AXI interconnect for existing FPGAs The scalable AXI interconnect useshigh-speed serial links for inter-FPGA communications Packet switching isused to deliver packets from source FPGA to destination FPGA using one ormore hops The interconnect has a low area and transport overheads while offer-ing high-bandwidth and low latency as compared to other interconnects
scal-Finally, we implement a LUT-based temperature sensor that has a tion of 0.5◦C and an accuracy of±0.5◦C using two-point calibration It utilizes
resolu-viii
Trang 952% less resources than the state-of-the-art We use an array of such sensors tomonitor the thermal profile of the FPGA die and then use it to and adjust perfor-mance of running applications as required through dynamic frequency scaling.
Trang 10x
Trang 111 Trends and Challenges in Scalable Reconfigurable Computing 1
1.1 Trends in Scalability 4
1.2 Trends in Productivity 6
1.3 Trends in Reliability 8
1.4 Key Challenges in Reconfigurable Computing 9
1.4.1 Scalability of FPGAs 10
1.4.2 Efficient Resource Management 11
1.4.3 Scalable Interconnect 11
1.4.4 Self-Awareness 12
1.5 Scope of the Thesis 12
1.6 Key Contributions and Thesis Overview 13
Trang 122.1 FPGA Architecture 17
2.1.1 Design Flow 20
2.2 Runtime Resource Management of FPGA 22
2.3 Related Work 24
2.4 Closing Remarks 27
3 Scalable GALS FPGA Architecture 29 3.1 Problem Definition 30
3.2 sFPGA2 Architecture 32
3.2.1 Tiles 33
3.2.2 Switch 35
3.3 Design Methodology for sFPGA2 37
3.3.1 Routing Path Generation 38
3.4 Case Study 40
3.4.1 JPEG Encoder 40
3.4.2 Target sFPGA2 Architecture 41
3.4.3 Design Flow 41
3.5 Experimental Results 42
3.5.1 Setup 42
3.5.2 Results 42
3.5.3 Discussion 43
3.6 Closing Remarks 46
4 Abstract Architecture for FPGA Resource Management 47 4.1 AARP Architecture 49
4.1.1 Reconfigurable Regions 50
4.1.2 Network-on-Chip 51
4.1.3 Shared Resources 53
4.2 Design Methodology for AARP 55
4.3 Experimental Results 56
4.3.1 Setup 56
4.3.2 Results 58
4.3.3 Discussion 58
4.4 Closing Remarks 62
xii
Trang 135.1 Scalable AXI Interconnect 68
5.1.1 Stream End-Point 70
5.1.2 Circuit-Switched Network 71
5.1.3 Arbiter 73
5.1.4 Packet-Switched Router 74
5.1.5 Links 75
5.2 Routing Methodology 76
5.2.1 Abstraction 78
5.3 Extending AARP 79
5.3.1 Smaller Reconfigurable Region 79
5.3.2 Scalable AXI Interconnect 79
5.3.3 Extensible Message Bus 81
5.4 Abstract Model for AARP 81
5.5 Experimental Results 82
5.5.1 Setup 82
5.5.2 Results 84
5.5.3 Discussion 85
5.6 Closing Remarks 88
6 Thermal Management for Reconfigurable Platforms 89 6.1 LUT-Based Temperature Sensor 91
6.1.1 Main Sensor 91
6.1.2 Reading Circuit 93
6.1.3 Calibration Circuit 93
6.2 Calibration Methodology 93
6.3 Case Study 94
6.3.1 Sensor Array 95
6.3.2 FFT Cores 96
6.4 Experimental Results 96
6.4.1 Setup 96
6.4.2 Results 97
6.4.3 Discussion 100
6.5 Closing Remarks 101
Trang 15List of Figures
1.1 Basic Structure of FPGA 2
1.2 Progress in Silicon Manufacturing Technology [1] 4
1.3 Scaling in Virtex and Stratix FPGAs w.r.t Technology Node 5
1.4 The Design Productivity Gap [2] 7
1.5 Factors Affecting Device Reliability 8
2.1 Architectures of different programmable devices 16
2.2 Simplified architecture of a modern-day FPGA 18
2.3 Generic Design Flow 21
2.4 Configuration Architectures of FPGAs 22
3.1 Routing Methodology 31
3.2 Proposed sFPGA2 Implementation 32
3.3 sFPGA2 Architecture Block Diagram 33
3.4 IO Transceiver Design 34
3.5 Terminal Switch Block Diagram 35
3.6 Routing Switch Block Diagram 36
3.7 Design flow for sFPGA2 37
3.8 Routing Examples 39
3.9 Internal details of the JPEG Encoder 40
3.10 Design partitioning for JPEG encoder 42
Trang 16LIST OF FIGURES
3.11 Block Diagram of Emulation Prototype 43
3.12 Illustration for Results 45
4.1 AARP Block Diagram 49
4.2 Block Diagram of a task mapped to Reconfigurable Regions 50
4.3 Block Diagram of the Network-on-Chip 53
4.4 Hardware Application Design Flow for AARP 55
4.5 Floorplan of Implementation 57
4.6 NoC Performance Overhead 61
5.1 Overview of Commercial Multi-FPGA Systems 66
5.2 Block Diagram of Scalable AXI Interconnect 69
5.3 Frame Format 71
5.4 Block Diagram of Circuit-Switched Network 72
5.5 Block Diagram of Arbiter 73
5.6 Block Diagram of Packet-Switched Router 74
5.7 Block Diagram of a Link 75
5.8 Routing Methodology 77
5.9 Block Diagram of AARP with the Scalable AXI Interconnect 80
5.10 Block Diagram of CS Network in AARP 81
5.11 Illustration to show abstraction in AARP 82
5.12 Benefit of Abstraction 83
5.13 Network Topology of Implementation 84
6.1 Block Diagram of Proposed Temperature Sensor 91
6.2 Retriggerable Ring Oscillator 92
6.3 Timing Diagram 93
6.4 Layout of Sensors and FFT Cores 95
6.5 Period of the Ring Oscillator Period in Sensors 98
6.6 Variation in period of the Ring Oscillators w.r.t location 99
xvi
Trang 17List of Tables
3.1 Device utilization summary of our JPEG encoder 41
3.2 Results and Projected Values 44
4.1 Implementation summary 59
5.1 Conversion table for Control Codes and 8b/10b Codes 76
5.2 Implementation summary 85
5.3 On-chip NoC Comparison 86
5.4 Off-chip Network Comparison 87
6.1 Comparative Study 97
Trang 18LIST OF TABLES
xviii
Trang 19in seconds and can be reconfigured if there is a mistake in design Cost of velopment may vary from a few dollars to a few thousand dollars However,this flexibility in FPGA comes at a significant cost in area, speed and powerconsumption Typically, FPGAs require around 20 to 35 times more area thanstandard cell ASICs and has performance penalty of roughly 3 to 4 times whileconsuming approximately 10 times more dynamic power [3] This is largely due
de-to the programmable routing fabric which trades area, speed, and power in return
for instant fabrication capability Despite these disadvantages, FPGAs present
a compelling substitute for digital system implementation based on their turnaround and low volume cost Furthermore, it allows the ability to improve
Trang 20fast-1 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING
Logic Block
Routing Interconnect
I/O Cells
Fig 1.1: Basic Structure of FPGA
or modify a design while being deployed in a product This may be required toremove a bug or perhaps to support new features
The first commercially viable field programmable gate array (FPGA), theXC2064, was invented by XilinxR
able by 1991 It had 64 logic cells arranged in 8×8 matrix and provided 58user inputs/outputs At that time programmable devices were mainly used forglue logic As Moore’s law progressed, FPGAs have evolved from simple regu-lar structures to complex heterogeneous devices containing fixed-function hard-ware as well as multi-core CPUs For example, the largest device (VirtexTM 7XC7V2000T) from XilinxR
DSP blocks [5] In addition to large devices, XilinxR
device, the ZynqTM-7000 [6], with 1.0 GHz dual-core ARMR TM-A9MPCore processor embedded within the FPGA’s logic fabric The device hasupto 444K logic cells and 2,020 DSP blocks However, keeping up with theMoore’s law [7] can be a daunting task Designing larger and larger devicesare getting more and more difficult This is mainly due two reasons: global wiredelay and increasing area overhead of routing interconnects Even in ASICs andmodern day processors, global wire delay is a significant issue and generally ittakes more than one clock cycle to deliver a signal across the die In case ofFPGAs, the delays are even more severe due to the slow and inefficient routing
2
Trang 211 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING
fabric Another issue is with clock distribution due to skew Yet another issue
is decreasing yields Although the physical layout of FPGAs is fairly regular,yet it is hard to fabricate larger FPGAs due to decreasing yields associated withlarger die sizes In other words, the current FPGA architecture is not scalable
As the logic density of FPGAs increase along with the inclusion of cated fixed-function blocks and processor cores, the role of FPGAs have shiftedfrom implementing glue logic and prototyping towards implementing completereconfigurable systems-on-chip and hardware accelerators FPGAs have beenshown to perform better and/or more efficiently than GPUs and Multi-CPU Sys-tems in many cases [8–12] However, coding in hardware description language(HDL) is known to be more difficult than either GPU programming or Multi-core CPU programming [13] This can be reduced to a great extent by high levelsynthesis like OpenCL [14–16] However, as the size of these devices increase,the time required for FPGA implementation tools also increases dramatically Asolution to this issue is to develop libraries of implemented sub-modules that can
dedi-be placed and connected within the FPGA as required by the large applications
In this case, only a subset of the design would be implemented and the rest will
be imported from the library thereby reducing the runtime of implementationtools This however requires a supporting framework to manage the resources
of the FPGA Such a framework can also be used to manage the resources at time However, tools that do support runtime resource management of FPGAare very restrictive and inefficient
run-In this thesis, we propose a novel scalable architecture for FPGAs Thisarchitecture addresses both issues related to scalability of FPGAs In addition, italso allows to model single and multi-FPGA systems under a single framework
In this architecture, we partition a FPGA into multiple tiles We assume a tile to
be a current generation FPGA These tiles use a hierarchical network for routingbetween them and thus we separate local routing from global Due to the use of
a hierarchical network, the architecture can be easily scaled just by the addition
of more tiles and higher-levels The largest FPGAs from Xilinx in the Virtex 7family use 3D fabrication to connect multiple FPGA die slices using a siliconinterposer [17] and can be considered as an instance of this architecture
We also propose an efficient framework for managing the resources of ourscalable FPGA architecture at runtime It abstracts both computation and com-munications resources of the FPGA Additionally, we provide support for fea-
Trang 221 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING
Fig 1.2: Progress in Silicon Manufacturing Technology [1]
tures like clock management and debugging hardware tasks The frameworkwill enable multiple applications to run in FPGA without requiring reimplemen-tation of the complete system We then improve this framework with a scalableinterconnect to support multi-FPGA systems while maintaining the same ab-stract high-level architecture of the system Lastly, we also add support for real-time thermal management of the complete system by developing a low-overheadLUT-based temperature sensor and controlling the power dissipation of runningtasks using dynamic frequency scaling
This chapter is organized as follows In Section 1.1, we take a closer look atthe trends in FPGA Architectures from the perspective of scalability In Section1.2, we look at the trends in productivity In Section 1.3, we look at the trends inreliability Section 1.4 summarizes the key challenges that remain to be solved
as seen from the two trends Section 1.5 explains the overall scope of this thesis.Section 1.6 list the key contributions that led to this thesis, and their organization
Trang 231 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING
20 40
80 160
20 40
80 160
Stratix Family (Max BRAM) Virtex Family (Max BRAM) Stratix Family (Max DSP) Virtex Family (Max DSP)
80 160
20 40
80 160
Stratix Family (Max FF) Virtex Family (Max FF) Stratix Family (Max LC) Virtex Family (Max LC)
Trang 241 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING
sity in a chip with a constant chip size We find a similar trend in FPGAs
as well and FPGA vendors are doubling the resources with each technologynode Fig.1.3 demonstrate the evolution of the VirtexR
to Virtex-7)[5, 18–21] and the StratixR
respect to technology node Furthermore, with each generation, FPGAs are ting faster and more heterogeneous Now they are equipped with better andfaster memory controllers and high-speed transceivers While some FPGAseven have complete processing sub-systems with multi-core CPUs This nar-rows the gap between the performance and efficiency of FPGAs as compared toASICs [3] However, scaling of FPGAs is not possible without increasing therouting resources of the FPGA The increase in switching requirement is asymp-totically bounded below by Eq.1.1 [23] and is superlinear with number of logicresources
Managing the growing circuit complexity is one of the greatest challenges
in the modern IC design This is known as the so called productivity gap [2]
6
Trang 251 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING
0.01 0.1 1 10 100 1000 10000 100000 1000000
Growth Transistors/Chip (Moore's Law) Growth Productivity
Expon (Growth Transistors/Chip (Moore's Law)) Expon (Growth Productivity)
58% / Year
21% / Year
Fig 1.4: The Design Productivity Gap [2]
and is shown in Fig.1.4 The productivity gap is a comparison between themanufacturing complexity (i.e number of transistors that we can manufacture)and the designer productivity (i.e number of transistors that we can design).Although this trend is for ASICs, we still find a similar trend in the design ofcircuits on FPGAs The complexity in designing digital systems can be reduced
by moving towards higher abstractions levels [24], i.e from gate level, registertransfer level (RTL), algorithmic level, system level etc Furthermore, efficientdesign space exploration methodologies can also improve productivity as shown
by Xydis et al [24] However, this does not solve the issue with the increasingruntime of implementations tools due to increasing design complexity This can
be mitigated to some extent by reuse of sub-components in the design and onlyimplementing newer sub-components as discussed by Jantsch et al [25] Incase of FPGAs, this is possible by using hierarchical design methodologies andflows supporting partition [26] However, in current design tools, design reuse
is only limited to partitions that been implemented in that design and cannot be
exported to another designs A solution to this issue is using dynamic partial
reconfiguration1 Again, tools that do support partial reconfiguration, eitheracademic or commercial, lack important features like dynamic sized regions,
1 Partial Reconfiguration is the ability of FPGA to dynamically modify blocks of logic by downloading partial bit files (programming files) while the remaining logic continues to operate without interruption.
Trang 261 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING
b) a)
Fig 1.5: Factors Affecting Device Reliability; a) Failure Rate vs Time, b) Failure Rate as a Function of Junction Temperature [27]
abstract communication, multiple clock support, in-circuit debugging etc forhardware applications and also impose constraints on hardware designs makingthese solution practically less useful
The failure rate as a function of time of all system hardware, including
in-tegrated circuits, conform to Fig.1.5(a) also known as the bathtub curve [27].
There are various factors affecting failure rate which include pressure, ical stress, thermal cycling, and electrical stress However, the die temperature
mechan-of the the device during its useful life plays an important role in triggering theonset of wearout Continued technology scaling imposes an ever increasingtemperature stress on digital circuit design due to transistor density As chipsget smaller and denser, power dissipation becomes more difficult [28] It is be-coming increasingly more important to be able to monitor as well as managethe temperature of a device in order to ensure optimum performance as well aslongevity [29] The power dissipated in an integrated circuit (IC) is a function
of the switching activity The switching activity at different points in an IC can
be different If the activity in some region is very high, then those areas willget heated more than others This localized hot areas are called hotspots Stud-ies have shown that an increase in junction temperature by 30◦C decreases themean-time between failure (MTBF) by a factor of 10 [27] as shown in Fig.1.5(b)
In case of FPGAs, the authors observed that under high power dissipation, the
8
Trang 271 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING
FPGA would stop operating Typically, conservative designs or cooling tions are used to solve this issue but there may be significant cost and perfor-mance implications due to such solutions Therefore, these reconfigurable plat-forms must be self-aware2 and must be able to adjust its performance to staywithin a thermal envelop
The trends outlined in the previous sections indicate the issues in bility, productivity and reliability of FPGAs The existing FPGA architecturedoesn’t scale well due to increasing area overheads On the other hand, theimplementation tools for FPGA fall short in terms of productivity They areunable to keep up with the increasing density in FPGAs Implementation toolsthat support the reuse of implemented sub-components doesn’t allow this acrossdifferent designs And then tools using runtime resource management that do al-low design reuse across different designs are less practical Furthermore, lack ofself-awareness in reconfigurable platforms leads to reliability issues caused due
scala-to long-term effects of high temperature as a result of high power dissipation inlocalized regions In short, following are the major challenges that remain inscalable reconfigurable computing, and are addressed in this thesis
• Scalability in FPGA Architecture: Design a FPGA architecture along with
its interconnect to allow scalability without significant performance and
area overheads (Scalability in FPGAs, Scalable Interconnect)
• Efficiency in Resource Management: Design and develop a framework to
allow reuse of implemented sub-components across designs efficiently so
as to reduce design and implementation times (Efficient Resource
Man-agement)
• Abstraction in Resource Management: Reduce the complexity associated with resource management in order to improve productivity (Efficient
Resource Management, Scalable Interconnect)
2 Self-Awareness is the capability of a systems to adapt their behavior and resources edly based on changing environmental conditions and demands This allows such systems to automatically find the best way to accomplish a given goal with the resources at hand.
Trang 28repeat-1 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING
• Flexibility in Resource Management: Support features required by real
hardware design without imposing constraints that impact performance
as well as productivity (Efficient Resource Management)
• Self-Awareness in Resource Management: Design and develop a thermal
monitoring solution to adjust the runtime power dissipation of running
hardware applications (Efficient Resource Management, Self-Awareness)
In this thesis, we present a scalable FPGA architecture to overcome theissues related to scalability in reconfigurable architectures We introduce a hi-erarchy in routing Unlike Altera FPGAs Flex10K [30], Apex [31], and Apex
II [31], we use a hybrid interconnect that is composed of island-style routinginterconnect locally and a hierarchical interconnect globally Furthermore, weimprove global connectivity by multiplexing multiple wires on a single wire us-ing TDMA3 Lastly, we use circuit-switching instead of packet-switching in thehierarchical interconnect to avoid area and performance overheads associatedwith packet-switching4 The island-style routing regions, called tiles, in our
architecture are similar to existing FPGAs and are connected with each otherusing our hierarchical global interconnect Our architecture uses globally asyn-chronous locally synchronous (GALS) methodology Inside a tile, all the signalsare assumed to be synchronous using local clocks while communications amongdifferent tiles are asynchronous This reduces the impact of delays associatedwith global routing The proposed architecture not only solves the issue related
to global wire delay but also solves the issue of reduction in yields with ing die sizes In our architecture, each tiles can be essentially fabricated as aseparate silicon and integrated together on a larger substrate using either 2.5D
increas-or 3D-IC fabrication As proof of this concept, Virtex 7 [5] uses similar
method-ology, called stacked silicon interconnect [17], to integrate multiple dies onto a
3 Time division multiple access (TDMA) is a channel access method for shared channels It allows several users to share the same channel by dividing the access of the channel into different time slots.
4 Packet-switching typically allows better bandwidth utilization due to the fact that it allow reuse of the same channel Packet-switching can also be more reliable However, circuit- switching is usually better in cases where the cost of not using channels is low Also, packet- switched routers are typically bigger than the circuit-switched router for the same required per- formance.
10
Trang 291 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING
single substrate If the multiplexing factor in our architecture is assumed to be 1(i.e no multiplexing), then the result is the silicon interposer of Virtex 7 Thus,Virtex 7 can be considered as an instance of our architecture
As mentioned in sections earlier, more resources will lead to larger mentation runtimes if we do not reuse sub-components across designs There-fore, to solve this issue, we propose a low-overhead resource management frame-work for FPGAs This will allow sub-components to be reused as partial pro-gramming files, called bitstreams, and our infrastructure will connect these sub-components as required by the overall hardware application The area allocated
imple-to each sub-component depends upon its size and is not fixed In other words,our resource management framework abstracts both the computational and com-munication resources of the FPGA Furthermore, we support multiple clocks foreach sub-component and also support in-circuit debugging Lastly, our frame-work does not impose any constraints on the design of hardware application.Due to low area and performance overheads as well as reasons mentioned ear-lier, it is a good solution to the issues associated with large runtimes of imple-mentation tools
Our scalable FPGA architecture presented earlier solves the issue of ability for single-FPGA systems However, for multi-FPGA systems, the solu-tion presented earlier is too expensive in terms of required off-chip infrastruc-ture This is due to the fact that the hierarchical interconnect requires a largenumber of switching-units To solve this issue, we develop a scalable off-chipinterconnect to supplement our hierarchical on-chip interconnect We maintainthe abstraction of computation and communication resources in the multi-FPGAsystem This make the complete solution uniform and thus allow resource man-agers to manage the multi-FPGA resources in same way as it would do for asingle FPGA system
Trang 30scal-1 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING
to be self-aware and improve reliability of the complete system
Key challenges highlighted earlier can be summarized into three categories:(1) Scalability, (2) Productivity, and (3) Reliability For an engineering solution
to be practical and usable, we need to address major issues in all the three gories This will make the solution acceptable in a holistic way As mentionedearlier, existing solutions in reconfigurable computing with respect to scalabil-ity, productivity and reliability address individual issues and do not take intoaccount the global constraints Therefore, we approach these issues in a holisticway by addressing major issues in all three categories so that our solutions areall the more useful for engineering design We therefore, limit our focus to onlythe major issues in the scalability, productivity and reliability of reconfigurablesystems as highlighted in the subsequent paragraphs
cate-We first will analyze the difficulties in the scalability of FPGAs and thenattempt to solve it by designing a scalable FPGA architecture We will limit our-selves to the development and study of an emulation platform as well as estima-tion of the performance and overheads of an actual implementation of a FPGAbased on the proposed architecture However, as the actual implementation ofthe FPGA is a huge engineering endeavor, therefore the actual implementationwill be beyond the scope of this thesis
12
Trang 311 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING
Second, we will study the issues with productivity in large reconfigurablesystems As mentioned earlier, reuse of implemented sub-components can im-prove productivity To enable reuse, one solution is to develop a runtime re-source management framework In this thesis, we will limit ourselves to onlythe design and development of a practical resource management framework forscalable reconfigurable systems and also the evaluation of its performance andoverheads However, study of other methods to improve productivity like betteralgorithms and design methodologies will be beyond the scope of this thesis.Third, to improve the reliability of reconfigurable systems, we will studyexisting works and understand the difficulties in real time thermal monitoring
of existing FPGAs Then, we will design and develop a low overhead ature sensor using the resources within existing FPGAs This will enable us tomonitor the temperature profile of the die of existing FPGAs and use this infor-mation to control the regional power dissipation of the FPGA die However, thestudy of other effects on the reliability of FPGAs will be beyond the scope ofthis thesis
The main aim of this thesis is to study the problems associated with bility in FPGAs and its resource management and propose practical and novelsolutions that can improve the scalability of FPGAs and also improve the pro-ductivity of designers by providing better tools and framework Following aresome of the major contributions that have been achieved during the course ofthis research and have led to this thesis
scala-• A novel Scalable FPGA Architecture to significantly reduce the overheadsassociated with designing larger FPGAs This work was published in [32]
• A design methodology to allow designer to use our Scalable FPGA tecture This work was published as part of [32]
Archi-• An Abstract Architecture for FPGA Resource Management to allow reuse
of implemented sub-components across designs efficiently and also to duce the complexity associated with resource management This workwas published in [33]
Trang 32re-1 TRENDS AND CHALLENGES IN SCALABLE RECONFIGURABLE COMPUTING
• A low overhead and high-performance Network-on-Chip to allow ministic communication among hardware tasks This work was published
deter-a pdeter-art of [33]
• A Scalable AXI Interconnect to enable fast and efficient communicationamong hardware tasks in multi-FPGA systems It extends our previouswork in [33]
• A Low Overhead Temperature Sensor for Reconfigurable Platforms tomeasure the die temperature This work was published in [34]
• A Thermal Aware Resource Manager to monitor the temperature profile
of the FPGA die and adjust the performance of running applications Thiswork extends our previous work in [34]
The organization of thesis as follows Chapter 2 gives an overview of theexisting FPGA architectures and explains the design flow associated with im-plementing hardware applications on FPGA Then it discusses runtime resourcemanagement for reconfigurable platforms and also present related works and itsshortcoming In Chapter 3, we introduce our scalable FPGA architecture andits design methodology We then discuss the merits and demerits of such anarchitecture by the use of a case study In Chapter 4, we present our frameworkfor resource management of FPGA and demonstrate its effectiveness experimen-tally using a prototype Chapter 5 introduces our scalable off-chip interconnectand we evaluate its performance and overheads on a multi-FPGA system InChapter 6, we present our LUT-based temperature sensor and compare its per-formance with state-of-the-art We later build a platform using an array of oursensor to demonstrate self-awareness in runtime resource management Finally,Chapter 7 concludes this thesis and gives directions for future work
14
Trang 33CHAPTER 2
FPGAs and Resource Management
Historically, Texas Instruments introduced the first programmable logic ray (PLA) device in 1970 This device, the TMS2000[35], had 17 inputs and 18outputs with 8 JK Flip-Flops These devices had to be programmed during pro-
ar-duction This process was termed as mask-programmable They were used in
many VLSI devices mostly for implementing glue/control logic These devicescould implement any set of sum-of-product (SOP) logic equations Many terms
in the equations could be shared by the outputs as shown in Fig.2.1(a) Later, in
1980, Monolithic Memories Inc (MMI) introduced programmable array logic(PAL) devices The primary difference between PALs and PLAs was that thePALs had fixed-OR and programmable-AND arrays while PLAs had both ORand AND arrays programmable as shown in Fig.2.1(b) This made the PALs sim-pler, cheaper and faster Furthermore, PALs were also one-time programmableusing on-chip fuses and thus allowed end-user to program them unlike majority
of the PLAs which were mask-programmable Later, field-programmable vices were also introduced which were pin-compatible to PALs called genericarray logic (GAL) Unlike PALs, they could be electrically erased and repro-grammed These devices grew in flexibility and size Later, in 1990, more
de-advanced devices were introduced called complex programmable logic devices
(CPLDs) They had an array of PALs with a central routing network to provide
Trang 342 FPGAS AND RESOURCE MANAGEMENT
Programmable AND
Connection Block Switch Block Input/Output Routing Channel Logic Block
Programmable AND Fixed OR Input/Output
Fig 2.1: Architectures of different programmable devices; a) Architecture
of PLA, b) Architecture of PAL, c) Architecture of CPLD, and d) Architecture
of FPGA
16
Trang 352 FPGAS AND RESOURCE MANAGEMENT
even greater flexibility as shown in Fig.2.1(c) Although these devices werequite flexible but had limited storage elements in them To resolve this issue, analternate architecture was developed which consisted of logic blocks surrounded
by programmable interconnect as shown in Fig.2.1(d) This class of devices was
termed as field-programmable gate arrays (FPGAs) XilinxR
pany to introduce the first commercial FPGA, the XC2064, in 1985 This devicehad only 64 configurable logic blocks (CLBs), with two 3-input lookup tables(LUTs) In the 1990s, FPGAs evolved rapidly in both sophistication and vol-ume of production By 2000s, FPGAs like Virtex and Spartan 2 had small staticmemory blocks called block RAMs (BRAMs) in the fabric Later, in 2005, evenlarger BRAMs with hardware-multipliers were added to Virtex-2 and Spartan-3FPGAs By this time, FPGAs were not only used for implementing control logicbut also as hardware-accelerators Later, in 2007, FPGAs had on-chip DSP ca-pabilities with even embedded processors This allowed the implementation ofcomplete reconfigurable system-on-chip Furthermore, they allowed portions ofthe device to be configured while the rest remain operational This capabilityallowed designer to reuse the resources or FPGA at runtime
In this chapter, we introduce the architecture of modern day FPGA in tion 2.1 and take a closer look at the routing resources in terms of scalability Wealso take a look at the processes in design flow for FPGA-implementation andtouch up their runtime complexity Then we introduce some of the techniquesand issues in the runtime resource management of FPGAs in Section 2.2 Then
Sec-we present some of the related works in Section 2.3 and Section 2.4 concludesthis chapter
As mentioned earlier, FPGAs are integrated circuits that can be programmedafter fabrication to function as almost any kind of digital circuit Modern day FP-GAs, as illustrated in Fig.2.2, consist of an array of programmable logic blocks
of potentially different types, including general logic, memory and multiplierblocks, surrounded by a programmable routing fabric that allows the blocks to
be interconnected by simply reconfiguring the routing fabric The array alsocontains programmable input/output blocks that allows the chip to interface theoutside world As shown in the figure, the major components of a FPGA are: (1)
Trang 36LUT SRL RAM
LUT SRL RAM LUT SRL RAM
Trang 372 FPGAS AND RESOURCE MANAGEMENT
Configurable Logic Blocks, (2) Memory Blocks (Block RAM), (3) DSP Blocks,(4) Input/Output Blocks, and (5) Routing Interconnect (includes Switch Blocks).Each component is briefly explained in the following paragraphs
Configurable Logic Blocks (CLBs) themselves comprise of Look-up tables
(LUTs), RAM, Shift Registers (SRL), Multiplexers, Carry Logic, and Registers
(DFF) LUTs are addressable array of SRAM cells in which the inputs are
con-nected to the address lines Thus the inputs select a particular SRAM bit fromthe array In other words, we can load a truth table of any logic function in theLUT and the LUT will behave as the logic function Therefore, by cascadingsuch LUTs, we can implement any boolean equation The number of inputs of
an LUT is a architecture parameter and varies with different families of FPGAs.For example, FPGA developed prior to Virtex 5 [20] had 4-input LUTs whilelater FPGAs has 6-input LUTs However, LUTs alone are insufficient if youconsider the need to hold state information, especially for recursive/iterativecomputations that depends on the results from previous states Therefore, we
also have registers (DFF) to store the previous state information Carry logic is used to implement the look-ahead carry chains of adders and subtractors Mul-
tiplexers are used to combine the outputs of the LUTs to even create bigger
LUTs RAM and shift register are alternate operating modes of LUTs As the
name suggests, they allow compact implementation of shift registers and RAMs
as compared to using the individual registers in the CLBs The truth table ineach LUT, the initial values of registers, the operating mode of each component,and the configuration of the interconnection within a CLB are stored in SRAMcells associated with that CLB When a FPGA is programmed, the SRAM cellsfor each CLB is loaded with appropriate values and the internals of the CLBoperate as required
Memory Blocks, commonly referred as Block RAM, are larger RAM
com-ponents that can be used if required They are used to implementing large dressable arrays of storage efficiently in the FPGA fabric Like the CLBs, theyare also configured during the programming of the FPGA
ad-DSP Blocks are configurable multipliers and adders They enable efficient
and faster implementations of arithmetic operations as compared to tions using CLBs only These unit are particularly very useful for implementingapplications in signal processing, image processing and compute domains Due
implementa-to this reason, these blocks are named as digital signal processing (DSP) blocks.
Trang 382 FPGAS AND RESOURCE MANAGEMENT
Input and output blocks allow the FPGA to interface the outside world.
They typically support multiple I/O standards (e.g LVDS, CMOS etc) so as
to allow the FPGA to communicate with most of the devices However,
high-speed serial interfaces requires special blocks called Multi-Gigabit Transceivers
(MGTs) Transceivers allow FPGAs to adopt standards like PCIe, SATA, SGMIIetc
The interconnect are channels of wire connected using switch box The
switch box connects wires in one channel to others to create an electrical pathbetween them By using this methodology, we can connect different compo-nents together to form any digital circuit The exact connectivity offered by aswitch box is also an architecture parameter and has evolved over different fam-ilies of FPGAs Like the CLBs, the switch box also uses SRAM cells to storeits configuration
Scalability: As the resources increase, the channel size required for a
rea-sonably routable FPGA also grows This implies a larger switch box For ample, in case of a crossbar switch box, the number of switches in a switch boxwill grow quadratically with the increase in channel width (assuming that chan-nels in both X and Y are increasing proportionally) This also means that moreSRAM cells will also be required to store the configuration of the switch box.This area overhead limits the scalability of the FPGA The channel widths, type
ex-of switch boxes, distribution ex-of different types ex-of wire, number/organization ex-oflogic elements are architectural parameters and are evolving with each FPGAfamily In addition to the area overheads introduced by the routing intercon-nect as a result of scaling the FPGA, scaling the clocking network also becomes
an issue This is due to fact that it becomes harder to keep resources far apartfrom one another synchronized with the same clock due to clock skew Lastly,
it is well known fact that larger die size using particular manufacturing processwill have lower yields as compared to a small die size using the same process.Smaller die size essentially means that a defect will affect less dies as compared
Trang 392 FPGAS AND RESOURCE MANAGEMENT
Design
I - Synthesis
II - Translate Primitives
III - Map/Place
Resource
V - Generate Bitstream
FPGA Architecture
Fig 2.3: Generic Design Flow
consisting of gates, registers and wires as shown in stage-I of Fig.2.3 Such a
circuit diagram is called a netlist The process of converting a high level scription into a netlist is called synthesis Once, the design is synthesized, each component in the netlist is converted into equivalent constructs, called primi-
de-tives, available on FPGA (as shown in stage-II of the figure) This process is
called translate The database of primitives used by the translate stage is always
generated by the manufacturer of the FPGA After translate stage, a gate would
be converted into a LUT, and a register to a flip-flop within a CLB etc In somecases, a vendor specific synthesis tools may also uses the primitives databaseand directly generates a translated netlist In this case, translate stage will onlymerge the various translated netlists and constraints into one global translateddesign Then each component in the translated design is placed at specific lo-
cation in the FPGA (as shown in stage-III) This process is called placement.
Then the placed components are connected together using the interconnect (in
stage-IV) This process is called routing Finally, the programming file, called
bitstream, is generated (in stage-V) and downloaded into the FPGA These
pro-cesses are computationally expensive specially the placement and routing
Runtime Complexity: FPGA placement has been shown to be an
Trang 40NP-2 FPGAS AND RESOURCE MANAGEMENT
Configuring Sequence Configuring Sequence
DSP/BRAM
Clock Region
Clock Region
Clock Region
Clock Region
Fig 2.4: Configuration Architectures of FPGAs; (a) Virtex 2 Pro tion Architecture, and (b) Configuration Architecture of Virtex 4 and newer FPGAs
Configura-hard combinatorial optimization problem and therefore no polynomial time gorithm is known to produce an exact solution [36] Most placement tools use
al-simulated annealing which typically requires large computation times
Addi-tionally, these tools also consider timing constraints which make them evenmore computationally expensive Routing also is an NP-complete problem [37].Typically, most routers use 2-pass routing in which first they perform globalrouting that balances the densities of all routing channels followed by detailedrouting that assigns specific wiring segments for each connection Detailed rout-ing is typically performed by directed graph search algorithms which are deriva-tives of Dijkstra’s algorithm [38] Like the placement algorithms, these routingalgorithms also have to take timing into consideration and typically take largecomputation times to generate good results This is why vendors are continu-ously trying to improve these stages as well as the rest of the flow both in terms
of the result and runtime so as to improve the overall productivity
Initially, FPGAs only supported static reconfiguration During this kind of
configuration, the device is not active and only brought up after configuration.Although it was still possible to send partial configurations, yet the configurationprocess itself caused the entire device to be inactive Starting from Virtex 2/2P
22