4 Embedded Memory On-Chip Memory Interface • System Integration • Memory Size 4.3 Technology Integration and Applications 4-3 Design Methodology A Flexible Embedded DRAM Design • Embedd
Trang 1FIGURE 3.18 Noise-reduction output circuit (From Izumikawa, M et al., IEEE J Solid-State Circuits, 32, 1, 52,
1997 With permission.)
FIGURE 3.19 Waveforms of noise-reduction output circuit (solid line) and conventional output circuit: (a) gate
bias, (b) data output, and (c) GND bounce (From Miyaji, F et al., IEEE Solid-State Circuits, 24, 5, 1213, 1989 With
permission.)
Trang 2inductance of the GND line Therefore, the address buffer and the ATD circuit are influenced by theGND bounce, and unnecessary signals are generated.
Figure 3.18 shows a noise-reduction output circuit The waveforms of the noise-reduction outputcircuit and conventional output circuit are shown in Fig 3.19 In the conventional circuit, nodes A and
B are connected directly as shown in Fig 3.18 Its operation and characteristics are shown by thedotted lines in Fig 3.18 Due to the high-speed driving of transistor M4, the GND potential goes up,and the valid data is delayed by the output ringing A new noise-reduction output circuit consists ofone PMOS transistor, two NMOS transistors, one NAND gate, and the delay part (its characteristicsare shown by the solid lines in Fig 3.19) The operation of this circuit is explained as follows Thecontrol signals CE and OE are at high level and signal WE is at low level in the read operation Whenthe data zero output of logical high level is transferred to node C, transistor M1 is cut off, and M2 raisesnode A to the middle level Therefore, the peak current that flows into the GND line through transistorM4 is reduced to less than one half that of the conventional circuit because M4 is driven by the middlelevel After a 5-ns delay from the beginning of the middle level, transistor M3 raises node A to the VDDlevel As a result, the conductance of M4 becomes maximum, but the peak current is small because ofthe low output voltage Therefore, the increase of GND potential is small, and the output ringing doesnot appear
3 Chen, C.-W et al., “A Fast 32KX8 CMOS Static RAM with Address Transition Detection,”
IEEE J Solid-State Circuits, vol SC-22, no 4, pp 533–537, Aug 1987.
4 Miyaji, F et al., “A 25-ns 4-Mbit CMOS SRAM with Dynamic Bit-Line Loads,” IEEE J State Circuits, vol 24, no 5, pp.1213–1217, Oct 1989.
Solid-5 Matsumiya, M et al., “A 15-ns 16-Mb CMOS SRAM with Interdigitated Bit-Line Architecture,”
IEEE J Solid-State Circuits, vol 27, no 11, pp 1497–1502, Nov 1992.
6 Mizuno, H and Nagano, T., “Driving Source-Line Cell Architecture for Sub-lV High-Speed
Low-Power Applications,” IEEE J Solid-State Circuits, no 4, pp 552–557, Apr 1996.
7 Morimura, H and Shibata, N., “A Step-Down Boosted-Wordline Scheme for 1-V
Battery-Operated Fast SRAM’s,” IEEE J Solid-State Circuits, no 8, pp 1220–1227, Aug 1998.
8 Yoshimito, M et al., “A Divided Word-Line Structure in the Static RAM and Its Application to
a 64 K Full CMOS RAM,” IEEE J Solid-State Circuits, vol SC-18, no 5, pp 479–485, Oct 1983.
9 Hirose, T et al., “A 20-ns 4-Mb CMOS SRAM with Hierarchical Word Decoding Architecture,”
IEEE J Solid-State Circuits, vol 25, no 5, pp 1068–1074, Oct 1990.
10 Itoh, K., Sasaki, K., and Nakagome, Y., “Trends in Low-Power RAM Circuit Technologies,”
Proceedings of the IEEE, pp 524–543, Apr 1995.
11 Nambu, H et al., “A 1.8-ns Access, 550-MHz, 4.5-Mb CMOS SRAM,” IEEE J Solid-State Circuits, vol 33, no 11, pp 1650–1657, Nov 1998.
12 Cararella, J.S., “A Low Voltage SRAM for Embedded Applications,” IEEE J Solid-State Circuits,
vol 32, no 3, pp 428–432, Mar 1997
13 Prince, B., Semiconductor Memories: A Handbook of Design, Manufacture, and Application, 2nd edition,
John Wiley & Sons, 1991
14 Minato, O et al., “A 20-ns 64 K CMOS RAM,” in ISSCC Dig Tech Papers, pp 222–223, Feb.
1984
15 Sasaki, K., et al., “A 9-ns 1-Mbit CMOS SRAM,” IEEE J Solid-State Circuits, vol 24, no 5, pp.
1219–1224, Oct 1989
Trang 316 Seki, T et al., “A 6-ns 1-Mb CMOS SRAM with Latched Sense Amplifier” IEEE J Solid-State Circuits, vol 28, no 4, pp 478–482, Apr 1993.
17 Kushiyama, N et al., “An Experimental 295 MHz CMOS 4K X 256 SRAM Using Bidirectional
Read/Write Shared Sense Amps and Self-Timed Pulse Word-Line Drivers,” IEEE J Solid-State Circuits, vol 30, no 11, pp 1286–1290, Nov 1995.
18 Izumikawa, M et al., “A 0.25-µm CMOS 0.9-V 100M-Hz DSP Core,” IEEE J Solid-State Circuits, vol 32, no 1, pp 52–60, Jan 1997.
Trang 54
Embedded Memory
On-Chip Memory Interface • System Integration • Memory Size
4.3 Technology Integration and Applications 4-3
Design Methodology
A Flexible Embedded DRAM Design • Embedded Memories
in MPEG Environment • Embedded Memory Design for a bit Superscaler RISC Microprocessor
64-4.1 Introduction
As CMOS technology progresses rapidly toward the deep submicron regime, the integration level,performance, and fabrication cost increase tremendously Thus, low-integration, low-performance smallcircuits or systems chips designed using deep submicron CMOS technology are not cost-effective.Only high-performance system chips that integrate CPU (central processing unit), DSP (digital signalprocessing) processors or multimedia processors, memories, logic circuits, analog circuits, etc can affordthe deep submicron technology Such system chips are called system-on-a-chip (SOC) or system-on-silicon (SOS).1,2 A typical example of SOC chips is shown in Fig 4.1
Embedded memory has become a key component of SOC and more practical than ever for at leasttwo reasons:3
1 Deep submicron CMOS technology affords a reasonable trade-off for large memory integration
in other circuits It can afford ULSI (ultra large-scale integration) chips with over 109 elements
on a single chip This scale of integration is large enough to build an SOC system This size ofcircuitry inevitably contains different kinds of circuits and technologies Data processing andstorage are the most primitive and basic components of digital circuits, so that the memoryimplementation on logic chips has the highest priority Currently in quarter-micron CMOStechnology, chips with up to 128 Mbits of DRAM and 500 Kgates of logic circuit, or 64 Mbits
of DRAM and 1 Mgates of logic circuit, are feasible
2 Memory bandwidth is now one of the most serious bottlenecks to system performance Thememory bandwidth is one of the performance determinants of current von Neuman-typeMPU (microprocessing unit) systems The speed gap between MPUs and memory devices hasbeen increased in the past decade As shown in Fig 4.1, the MPU speed has improved by afactor of 4 to 20 in the past decade On the other hand, in spite of exponential progress instorage capacity, minimum access times for each quadrupled storage capacity have improvedonly by a factor of two, as shown in Fig 4.2 This is partly due to the I/O speed limitation and
to the fact that major efforts in semiconductor memory development have focused on density
Chung-Yu Wu
National Chiao Tang University
0–8493–1737–1/03/$0.00+$ 1.50
© 2003 by CRC Press LLC
Trang 6and bit cost improvements This speed gap creates a strong demand for memory integrationwith MPU on the same chip In fact, many MPUs with cycle times better than 60 ns have on-chip memories The new trend in MPUs, (i.e., RISC architecture) is another driving force forembedded memory, especially for cache applications.4 RISC architecture is strongly dependent
on memory bandwidth, so that high-performance, non-ECL-based RISC MPUs with morethan 25 to 50 MHz operation must be equipped with embedded cache on the chip
4.2 Merits and Challenges
The main characteristics of embedded memories can be summarized as follows.5
4.2.1 On-Chip Memory Interface
Advantages include:
1 Replacing off-chip drivers with smaller on-chip drivers can reduce power consumptionsignificantly, as large board wire capacitive loads are avoided For instance, consider a systemwhich needs a 4-Gbyte/s bandwidth and a bus width of 256 bits A memory system built withdiscrete SDRAMs (16-bit interface at 100 MHz) would require about 10 times the power of
an embedded DRAM with an internal 256-bit interface
2 Embedded memories can achieve much higher fill frequencies,6 which is defined as thebandwidth (in Mbit/s) divided by the memory size in Mbit (i.e., the fill frequency is thenumber of times per second a given memory can be completely filled with new data), thandiscrete memories This is because the on-chip interface can be up to 512 bits wide, whereasdiscrete memories are limited to 16 to 64 bits Continuing the above example, it is possible tomake a 4-Mbit embedded DRAM with a 256-bit interface In contrast, it would take 16discrete 4-Mbit chips (256 K×16) to achieve the same width, so the granularity of such adiscrete system is 64 Mbits But the application may only call for, say, 8 Mbits of memory
3 As interface wire lengths can be optimized for application in embedded memories, lower propagationtimes and thus higher speeds are possible In addition, noise immunity is enhanced
Challenges and disadvantages include:
FIGURE 4.1 An example of system-on-a-chip (SOC).
Trang 71 Although the power consumption per system decreases, the power consumption per chip mayincrease Therefore, junction temperature may increase and memory retention time may decrease.However, it should be noted that memories are usually low-power devices.
2 Some sort of minimal external interface is still needed in order to test the embedded memory.The hybrid chip is neither a memory nor a logic chip Should it be tested on a memory orlogic tester, or on both?
4.2.2 System Integration
Advantages include:
1 Higher system integration saves board space, packages, and pins, and yields better form factors
2 Pad-limited design may be transformed into non-pad-limited by choosing an embedded solution
3 Better speed scalability, along with CMOS technology scaling
Challenges and disadvantages include:
1 More expensive packages may be needed Also, memories and logic circuits require differentpower supplies Currently, the DRAM power supply (2.5 V) is less than the logic power supply(3.3 V), but this situation will reverse in the future due to the back-biasing problem in DRAMs
2 The embedded memory process adds another technology for which libraries must be developedand characterized, macros must be ported, and design flows must be tuned
3 Memory transistors are optimized for low leakage currents, yielding low transistor performance,whereas logic transistors are optimized for high saturation currents, yielding high leakage currents
If a compromise is not acceptable, expensive extra manufacturing steps must be added
4 Memory processes have fewer layers of metal than do logic circuit processes Layers can beadded at the expense of fabrication cost
5 Memory fabs are optimized for large-volume production of identical products, for high-capacityutilization, and for high yield Logic fabs, while sharing these goals, are slanted toward lowerbatch sizes and faster turnaround time
4.2.3 Memory Size
The advantage is that:
• Memory size can be customized and memory architecture can be optimized for dedicatedapplications
Challenges and disadvantages include:
• On the other hand, the system designer must know the exact memory requirement at the time
of design Later extensions are not possible, as there is no external memory interface From thecustomer’s point of view, the memory component goes from a commodity to a highly specializedpart that may command premium pricing As memory fabrication processes are quite different,second-sourcing problems abound
4.3 Technology Integration and Applications3,5
The memory technologies for embedded memories have a wide variation—from ROM to RAM—aslisted in Table 4.1.3 In choosing these technologies, one of the most important figure of merits is thecompatibility to logic process
1 Embedded ROM: ROM technology has the highest compatibility to logic process However,its application is rather limited PLA, or ROM-based logic design, is a well-used but ratherspecial case of embedded ROM category Other applications are limited to storage for
Trang 8microcode or well-debugged control code A large size ROM for tables or dictionaryapplications may be implemented in generic ROM chips with lower bit cost.
2 Embedded EPROM/E2PROM: EPROM/E2PROM technology includes high-voltage devicesand/or thin tunneling insulators, which require two to three additional mask steps and processingsteps to logic process Due to its unique functionality, PROM-embedded MPUs7 are well used Tominimize process overhead, a single poly E2PROM cell has been developed.8 Counterparts to thisapproach are piggy-back packaged EPROM/MPUs or battery-backed SRAM/MPUs However,considering process technology innovation, on-chip PROM implementation is winning the game
3 Embedded SRAM is one of the most frequently used memory embedded in logic chips.Major applications are high-speed on-chip buffers such as TLB, cache, register file, etc Table4.2 gives a comparison of some approaches for SRAM integration A six-transistor cell approachmay be the most highly compatible process, unless any special structures used in standard 6-TrSRAMs are employed The bit density is not very high Polysilicon resistor load 4-Tr cellsprovide higher bit density with the cost of process complexity associated with additionalpolysilicon-layer resistors The process complexity and storage density may be compromised tosome extent using a single layer of polysilicon In the case of a polysilicon resistor load SRAM,which may have relaxed specifications with respect to data holding current, the requirementfor substrate structure to achieve good soft error immunity is more relaxed as compared to lowstand-by generic SRAMs Therefore, the TFT (thin-film transistor) load cell may not be requiredfor several generations due to its complexity
4 Embedded DRAM (eDRAM) is not as widely used as SRAMs Its high density features, however,are very attractive Several different embedded DRAM approaches are listed in Table 4.3 Atrench or stacked cell used in commodity DRAMs has the highest density, but the complexity isalso high The cost is seldom attractive when compared to a multi-chip approach using standardDRAM, which is the ultimate in achieving low bit cost This type of cell is well suited for ASM(application-specific memory), which will be described in the next section A planar cell with
TABLE 4.2 Embedded SRAM Options
TABLE 4.1 Embedded Memory Technologies and Applications
Trang 9multiple (double) polysilicon structures is also suitable for memory-rich applications.9 A gatecapacitor storage cell approach can be fully compatible two with logic process providing relativelyhigh density.10 The four-Tr cell (4-Tr SRAM cell minus resistive load) provides the same speedand density as SRAM, but full compatibility to logic process and requires refresh operation.11
4.4 Design Methodology and Design Space3,5
4.4.1 Design Methodology
The design style of embedded memory should be selected according to applications This choice iscritically important for the best performance and cost balancing Figure 4.2 shows the various designstyles to implement embedded memories
The most primitive semi-custom design style is based on the memory cell It provides high flexibility
in memory architecture and short design TAT (turnaround time) However, the memory density is thelowest among various approaches
The structured array is a kind of gate array that has a dedicated memory array region in the masterchip that is configurable to several variations of memory organizations by metal layer customization.Therefore, it provides relatively high density and short TAT Configurability and fixed maximum memoryarea are the limitations to this approach
TABLE 4.3 Embedded DRAM Technology Options
FIGURE 4.2 Various design styles for embedded memories.
Trang 10The standard cell design has high flexibility to the extent that the cell library has a variety ofembedded memory designs But in many cases, new system design requires new memory architectures.The memory performance and density is high, but the mask-to-chip TAT tends to be long.
Super-integration is an approach that integrates existing chip design, including I/O pads, so thedesign TAT is short and proven designs can be used However, availability of memory architecture islimited and the mask-to-chip TAT is long
Hand-craft design (does not necessarily mean the literal use of human hands, but heavy interactivedesign) provides the most flexibility, high performance, and high density; but design TAT is the longest.Thus, design cost is the highest so that the applications are limited to high-volume and/or high-endsystems Standard memories, well-defined ASMs, such as video memories,12 integrated cache memories,13and high-performance MPU-embedded memories, are good examples
An eDRAM (embedded DRAM) designer faces a design space that contains a number of dimensionsnot found in standard ASICs, some of which we will subsequently review The designer has to choose from
a wide variety of memory cell technologies which differ in the number of transistors and in performance.Also, both DRAM technology and logic technology can serve as a starting point for embedding DRAM.Choosing a DRAM technology as the base technology will result in high memory densities but suboptimallogic performance On the other hand, starting with logic technology will result in poor memory densities,but fast logic circuits To some extent, one can therefore trade logic speed against logic area Finally, it is alsopossible to develop a process that gives the best of both worlds—most likely at higher expense Furthermore,the designer can trade logic area for memory area in a way heretofore impossible
Large memories can be organized in very different ways Free parameters include the number ofmemory banks, which allow the opening of different pages at the same time, the length of a single page,the word width, and the interface organization Since eDRAM allows one to integrate SRAMs andDRAMs, the decision between on/off-chip DRAM- and SRAM/DRAM-partitioning must be made
In particular, the following problems must be solved at the system level:
• Optimizing the memory allocation
• Optimizing the mapping of the data into memory such that the sustainable memory bandwidthapproaches the peak bandwidth
• Optimizing the access scheme to minimize the latency for the memory clients and thus minimizethe necessary FIFO depth
The goals are to some extent independent of whether or not the memory is embedded However, thenumber of free parameters available to the system designer is much larger in an embedded solution,and the possibility of approaching the optimal solution is thus correspondingly greater On the otherhand, the complexity is also increased It is therefore incumbent upon eDRAM suppliers to make thetrade-offs transparent and to quantize the design space into a set of understandable if slightly subopti-mal solutions
4.5 Testing and Yield3,5
Although embedded memory occupies a minor portion of the total chip area, the device density in theembedded memory area is generally overwhelming Failure distribution is naturally localized at memoryareas In other words, embedded memory is a determinant of total chip yield to the extent that thememory portion has higher device density weighted by its silicon area
For a large memory-embedded VLSI, memory redundancy is helpful to enhance the chip yield.Therefore, the embedded-memory testing, combined with the redundancy scheme, is an importantissue The implementation of means for direct measurement of embedded memory on wafer as well as
in assembled samples is necessary
Trang 11In addition to off-chip measurement, on-chip measurement circuitry is essential for accurate ACevaluation and debugging Testing DRAMs is very different from testing logic In the following, themain points of notice are discussed.
• The fault models of DRAMs explicitly tested for are much richer They include bit-line andword-line failures, crosstalk, retention time failures, etc
• The test patterns and test equipment are highly specialized and complex As DRAM testprograms include a lot of waiting, DRAM test times are quite high, and test costs are a significantfraction of total cost
• As DRAMs include redundancy, the order of testing is: (1) pre-fuse testing, (2) fuse blowing,(3) post-fuse testing There are thus two wafer-level tests
The implication on eDRAMs is that a high degree of parallelism is required in order to reduce test costs.This necessitates on-chip manipulation and compression of test data in order to reduce the off-chipinterface width For instance, Siemens Corp offers a synthesizable test controller supporting algorithmictest pattern generation (ATPG) and expected-value comparison [partial built-in self test (BIST)].Another important aspect of eDRAM testing is the target quality and reliability If eDRAM is usedfor graphics applications, occasional “soft” problems, such as too short retention time of a few cells, aremuch more acceptable than if eDRAM is used for program data The test concept should take thiscostreduction potential into account, ideally in conjunction with the redundancy concept
A final aspect is that a number of business models are common in eDRAM, from foundry business
to ASIC-type business The test concept should thus support testing the memory, either from a logictester or a memory tester, so that the customer can do memory testing on his logic tester if required
4.6 Design Examples
Three examples of embedded memory designs are described The first one is a flexible embeddedDRAM design from Siemens Corp.5 The second one is the embedded memories in MPEG environ-ment from Toshiba Corp.14 The last one is the embedded memory design for a 64-bit superscalerRISC microprocessor from Toshiba Corp and Silicon Graphics, Inc.15
There is an increasing gap between processor and DRAM speed: processor performance increases by60% per year in contrast to only a 10% improvement in the DRAM core Deep cache structures are used
to alleviate this problem, albeit at the cost of increased latency, which limits the performance of manyapplications Merging a microprocessor with DRAM can reduce the latency by a factor of 5 to 10,increase the bandwidth by a factor of 50 to 100, and improve the energy efficiency by a factor of 2 to 4.16Developing memory is a time-consuming task and cannot be compared with a high-level basedlogic design methodology which allows fast design cycles Thus, a flexible memory concept is aprerequisite for a successful application of eDRAM Its purpose is to allow fast construction ofapplication-specific memory blocks that are customized in terms of bandwidth, word width, memorysize, and the number of memory banks, while guaranteeing first-time-right designs accompanied by allviews, test programs, etc
A powerful eDRAM approach that permits fast and safe development of embedded memory modules
is described The concept, developed by Siemens Corp for its customers, uses a 0.24-µm technologybased on its 64/256 Mbit SDRAM process.5 Key features of the approach include:
• Two building-block sizes, 256 Kbit and 1 Mbit; memory modules with these granularities can
be constructed
• Large memory modules, from 8 to 16 Mbit upwards, achieving an area efficiency of about 1Mbit/mm2
Trang 12• Embedded memory sizes up to at least 128 Mbits
• Interface widths ranging from 16 to 512 bits per module
• Flexibility in the number of banks as well as the page length
• Different redundancy levels, in order to optimize the yield of the memory module to thespecific chip
• Cycle times better than 7 ns, corresponding to clock frequencies better than 143 MHz
• A maximum bandwidth per module of about 9 Gbyte/s
• A small, synthesizable BIST controller for the memory (see next section)
• Test programs, generated in a modular fashion
Siemens Corp has made eDRAMs since 1989 and has a number of possible applications of its eDRAMapproach in the pipeline, including TV scan-rate converters, TV picture-in-picture chips, modems, speech-processing chips, hard-disk drive controllers, graphics controllers, and networking switches These applica-tions cover the full range of memory sizes (from a few Mbits to 128 Mbits), interface widths (from 32 to 512bits), and clock frequencies (from 50 to 150 MHz), which demonstrates the versatility of the concept
4.6.2 Embedded Memories in MPEG Environment14
Recently, multimedia LSIs, including MPEG decoders, have been drawing attention The key ments in realizing multimedia LSIs are their low-power and low-cost features This example presentsembedded memory-related techniques to achieve these requirements, which can be considered as areview of the state-of-the-art embedded memory macro techniques applicable to other logic LSIs.Figure 4.3 shows embedded memory macros associated with the MPEG2 decoder Most of thefunctional blocks use their own dedicated memory blocks and, consequently, memory macros arerather small and distributed on a chip Memory blocks are also connected to a central address/data busfor implementing direct test mode
require-FIGURE 4.3 Block diagram of MPEG2 decoder LSI.
Trang 13An input buffer for the IDCT is shown in Fig 4.4 Eight 16-bit data from D0 to D7 come from theinverse quantization block sequentially The stored data should then be read out as 4-bit chunksorthogonal to the input sequence The 4-bit data is used to address a ROM in the IDCT to realize adistributed arithmetic algorithm.
The circuit diagram of an orthogonal memory whose circuit diagram is shown in Fig 4.5 It realizesthe above-mentioned functionality with 50% of the area and the power that would be needed if theIDCT input buffer were built with flip-flops In the orthogonal memory, word-lines and bit-lines runboth vertically and horizontally to achieve the functionality The macro size of the orthogonal memory
is 420 µm×760 µm, with a memory cell size of 10.8 µm×32.0 µm
FIGURE 4.4 Input buffer structure for IDCT.
FIGURE 4.5 Circuit diagram of orthogonal memory.
Trang 14FIFOs and other dual-port memories are designed using a single-port RAM operated twice in oneclock cycle to reduce area, as shown in Fig 4.6 A dual-port memory cell is twice as large as a single-port memory cell.
All memory blocks are synchronous self-timed macros and contain address pipeline latches Otherwise,the timing design needs more time, since the lengths of the interconnections between latches and adecoder vary from bit to bit Memory power management is carried out using a Memory MacroEnable signal when a memory macro is not accessed, which reduces the total memory power to 60%.Flip-flop (F/F) is one of the memory elements in logic LSIs Since digital video LSIs tend toemploy several thousand F/Fs on a chip, the design of the F/F is crucial for small area and low power.The optimized F/F with hold capability is shown in Fig 4.7 Due to the optimized smaller transistorsizes, especially for clock input transistors, and a minimized layout accomodating a multiplexer and aD-F/F in one cell, 40% smaller power and area are realized compared with a normal ASIC F/F.Establishing full testability of on-chip memories without much overhead is another important issue.Table 4.4 compares three on-chip memory test strategies: a built-in self-test (BIST), a scan test, and adirect test The direct test mode, where all memories can be directly accessed from outside in a test mode,
is implemented because of its inherent small area In a test mode, DRAM interface pads are turned intotest pins and can access to each memory block through internal buses, as shown in Figs 4.3 and 4.8
FIGURE 4.6 Realizing dual-port memory with a single-port memory (FIFO case).
FIGURE 4.7 Optimized flip-flop.
Trang 15The present MPEG2 decoder contains a RISC whose firmware is stored in an on-chip ROM Inorder to make the debugging easy and extensive, an instruction RAM is put outside the pads in parallel
to the instruction ROM and activated by an Al-masterslice in an initial debugging stage as shown inFig 4.9 For a sample chip mounted in a plastic package, the instruction RAM is cut out by a scribeline This scheme enables extensive debugging and early sampling at the same time for firmware-ROMembedded LSIs
High-performance embedded memory is a key component in VLSI systems because of the high-speedand wide bus width capability eliminating inter-chip communication In addition, multi-ported buffermemories are often demanded on a chip Furthermore, a dedicated memory architecture that meetsthe special constraint of the system can neatly reduce the system critical path
On the other hand, there are several issues in embedded RAM implementation The specialty orvariety of the memories could increase design cost and chip cost Reading very wide data causes largepower dissipation Test time of the chip could be increased because of the large memory Therefore,design efficiency, careful power bus design, and careful design for testability are necessary
TABLE 4.4 Comparison of Various Memory Test Strategies
FIGURE 4.8 Direct test architecture for embedded memories.
o: Good D: Fair X: Poor
Trang 16TFP is a high-speed and highly concurrent 64-bit superscaler RISC microprocessor, which canissue up to four instructions per cycle.17,18 Very wide bandwidth of on-chip caches is vital in thisarchitecture The design of the embedded RAMs, especially on caches and TLB, is reported.
The TFP integer unit (IU) chip implements two integer ALU pipelines and two load/store pipelines.The block diagram is shown in Fig 4.10 A five-stage pipeline is shown in Fig 4.11 In the TFP IUchip, RAM blocks occupy a dominant part of the real estate The die size is 17.3 mm×17.3 mm Inaddition to other caches, TLB, and register file, the chip also includes two buffer queues: SAQ (storeaddress queue) and FPQ (floating point queue) Seventy-one percent of all overall 2.6 million transistorsare used for memory cells Transistor counts of each block are listed in Table 4.5
The first generation of TFP chip was fabricated using Toshiba’s high-speed 0.8 µm CMOS technology:double poly-Si, triple metal, and triple well A deep n-well was used in PLL and cache cell arrays inorder to decouple these circuits from the noisy substrate or power line of the CMOS logic part Thechip operates up to 75 MHz at 3.1 V and 70°C, and the peak performance reaches 300 MIPS.Features of each embedded memory are summarized in Table 4.6 Instruction, branch, and datacaches are direct mapped because of the faster access time High-resistive poly-Si load cells are used forthese caches since the packing density is crucial for the performance
FIGURE 4.9 Instruction RAM masterslice for code debugging.
FIGURE 4.10 Block diagram of TFP IU.
Trang 17Instruction cache (ICACHE) is 16 KB of virtual address memory It provides four instructions (128bits wide) per cycle Branch cache (BCACHE) contains branch target address with one flag bit to indicate
a predicted branch BCACHE contains 1-K entries and is virtually indexed in parallel with ICACHE.Data cache (DCACHE) is 16 KB, dual ported, and supports two independent memory instructions(two loads, or one load and one store) per cycle Total memory bandwidth of ICACHE and DCACHEreaches 2.4 GB/s at 75 MHz Floating point load/store data bypass DCACHE and go directly tobigger external global cache.17,19 DCACHE is virtually indexed and physically tagged
TLB is dual ported, three-set-associative memory containing 384 entries A unique address comparisonscheme is employed here, which will be described in the following section It supports several differentpage sizes, ranging from 4 KB to 16 MB TLB is indexed by low-order 7 bits of virtual page number(VPN) The index is hashed by exclusive-OR with a low-order ASID (address space identifier) so thatmany processes can coexist in TLB at one time
Since several different RAMs are used in TFP chips, the design efficiency is important Consistentcircuit schemes are used for each of the caches and TLB RAMs Layout is started from the block thathas the tightest area restriction, and the created layout modules are exported to other blocks with smallmodification
The basic block diagram of cache blocks is shown in Fig 4.12, and the timing diagram is shown in Fig.4.13 Unlike a register file or other smaller queue buffers, these blocks employ dual-railed bit-lines Toachieve 75-MHz operation in the worst-case condition, it should operate at 110 MHz under typicalconditions In this targeted 9-ns cycle time, address generation is done about 3 ns before the end of thecycle, as shown in Fig 4.11 To take advantage of this big address setup time, address is received bytransparent latch: TLAT_N (transparent while clock is low) instead of flip-flop Thus, decode is started as
FIGURE 4.11 TFP IU pipelining.
TABLE 4.5 Transistor Counts
Trang 18TABLE 4.6 Summary of Embedded RAM Features
FIGURE 4.12 Basic RAM block diagram.
Trang 19soon as address generation is done and is finished before the end of the cycle Another transparentlatch—TLAT_P (transparent while clock is high)—is placed after the sense amplifier and it holds readdata while the clock is low.
Word-line (WL) is enabled while clock is high Since the decode is already finished, WL can be driven
to “high” as fast as possible The sense amplifier is enabled (SAE) with a certain delay after the word-line.The paired current-mirror sense amplifier is chosen since it provides good performance without overlystrict SAE timing Bit-line is precharged and equalized while the clock is low The clock-to-data delay ofDCACHE, which is the biggest array, is 3.7 ns under typical conditions: clock-to-WL is 0.9 ns and WL-to-data is 2.8 ns Since on-chip PLL provides 50% duty clock, timing pulses such as SAE or WE (writeenable) are created from system clock by delaying the positive edge and negative edge appropriately
As both word-line and sense amplifier are enabled in just half the time of one cycle, the currentdissipation is reduced by half However, the power dissipation and current spike are still an issuebecause the read/write data width is extremely large Robust power bus matrix is applied in the cacheand TLB blocks so that the dc voltage drop at the worst place is limited to 60 mV inside the block.From a minimum cycle time viewpoint, write is more critical than read because write needs biggerbit-line swing, and the bit-line must be precharged before the next read To speed up precharge time,precharge circuitry is placed on both the top and bottom of the bit-line In addition, the write circuitrydedicated to cache-refill is placed on the top side of DCACHE and ICACHE to minimize the wiredelay of the write data from input pad Write data bypass selector is implemented so that the write data
is available as read data in the same cycle with no timing penalty
Virtual to physical address translation and following cache hit check are almost always one of thecritical paths in a microprocessor This is because the cache tag comparison has to wait for the VTLB(RAM that contains virtual address tag) search operation and the following physical address selectionfrom PTLB (RAM that contains physical address).20 A timing example of the conventional scheme isshown in Fig 4.14 In TFP, the DCACHE tag is directly compared with all the three sets of PTLB data
in parallel—which are merely candidates of physical address at this stage—without waiting for theVTLB hit results The block diagram and timing are shown in Figs 4.15 and 4.16 By the time this hitcheck of the cache tag is done, VTLB hit results are just ready and they select the PTLB hit resultimmediately The “ePmatch” signal in Fig 4.16 is the overall cache hit result Although three timesmore comparators are needed, this scheme saves about 2.8 ns as compared to the conventional one
FIGURE 4.13 RAM timing diagram.
Trang 20In TLB, sense amplifiers of each port are separately placed on the top and bottom of the array tomitigate the tight layout pitch of the circuit A large amount of wire creates problems around VTLB,PTLB, and DTAG (DCACHE tag RAM) from both layout and critical path viewpoints This wassolved by piling them to build a data path (APATH: Address Data Path) by making the most of themetal-3 vertical interconnection Although this metal-3 signal line runs over TLB arrays in parallelwith the metal-1 bit-line, the TLB access time is not degraded since horizontal metal-2 word-lineshields the bit-line from the coupling noise The data fields of three sets are scrambled to make the datapath design tidy; 39-bit (in VTLB) and 28-bit (in PTLB) comparators of each set consist of optimizedAND-tree Wired-OR type comparators are rejected because a longer wired-OR node in this arrayconfiguration would have a speed penalty.
FIGURE 4.14 Conventional physical cache hit check.
FIGURE 4.15 TFP physical cache hit check.
Trang 21As TFP supports different page sizes, VPN and PFN (page frame number) fields change, depending
on the page size The index and comparison field of TLB are thus made selectable by control signals.32-bit DCACHE data are qualified by one valid bit A valid bit needs the read-modify-writeoperation based on the cache hit results However, this is not realized in one cycle access because oftight timing Therefore, two write ports are added to valid bit and write access is moved to the nextcycle: the W-stage The write data bypass selector is essential here to avoid data hazards
To minimize the hardware overhead of the VRAM (valid bit RAM) row decoder, two schemes areapplied First, row decoders of read ports are shared with DCACHE by pitch-matching one VRAMcell height with two DCACHE cells Second, write word-line drivers are made of shift registers thathave read word-lines as inputs The schematic is shown in Fig 4.17
Although the best way to verify the whole chip layout is to do DRC (design rule check) and LVS(layout versus schematic) check that includes all sections and the chip, it was not possible in TFP sincethe transistor count is too large for CAD tools to handle Thus, it was necessary to exclude a large part
of the memory cells from the verification flow To avoid possible mistakes around the boundary of thememory cell array, a few rows and columns were sometimes retained on each of the four sides of a cellarray In the case when this breaks signal continuity, text is added on the top level of the layout to make
FIGURE 4.16 Block diagram of TLB and DTAG.
FIGURE 4.17 VRAM row decoder.