7. Algorithms and Heuristics in VLSI Design
7.4 Heuristics for Optimizing OBDD Applications –
The quality of the partitioning is crucial for the efficiency of the reachable states computation. The image computation is iterated over the partitions and includes costly computations. Therefore, maintaining a large number of partitions is time consuming. A small number of partitions may lead to un- manageable large OBDDs. One extremum of this trade-off is the partitioning where each latch forms a partition, which is usually small but requires many iterations. The other extremum is a monolithic transition relation (TR), that can be computed in one iteration but has large OBDD-size. Furthermore, the ordering of latches and clusters is crucial for an efficient AndExist operation.
Using a poor order may lead to extremely large intermediate OBDD sizes that could make a complete image computation impossible.
Table 7.1. Overview of computation and reordering effort for the benchmarks using standard sifting
Vars CPU- Reorder- Re- avg. Reord. Peakn.
time/s time/s ord. Size >5% in 1000 Red.
dartes 198 504 441 87% 3 8% 1 583
dme2-16 586 3757 2331 62% 5 18% 3 5151
dpd75 600 4574 2676 58% 5 0% 0 3296
ftp3 100 1119 588 52% 4 1% 0 3126
furnace17 184 3938 1328 33% 5 21% 1 2373
key10 140 846 643 76% 6 24% 3 1099
mmgt20 264 1610 860 53% 4 2% 0 2904
motors-stuck 172 265 142 53% 4 36% 3 670
over12 174 3002 2526 84% 6 7% 2 4725
phone-async 86 2604 1094 42% 5 8% 1 6118
valves-gates 172 268 200 74% 5 35% 5 542
sum 2686 22487 12829 57% 52 160% 19 30587
avg 244 2044 116661% 4.7 15% 1.7 2781
Table 7.2. Comparison of CPU-time for standard sifting and sample sifting Sifting Sample Sifting
Sample Size 30% 40%
time/s % time/s % time/s
dartes 504 +70 149 +62 194
dme2-16 3757 +45 2073 +53 1765
dpd75 4574 +28 3304 + 9 4144
ftp3 1119 +43 635 +34 742
furnace17 3938 +41 2341 +35 2545
key10 846 +33 568 +28 610
mmgt20 1610 - 9 1770 -17 1961
motors-stuck 265 +44 147 +38 164
over12 3002 +51 1475 +39 1831
phone-async 2604 +13 2268 +13 2273
valves-gates 268 +24 202 +14 220
sum 22487 +34 14934 +27 16458
avg +35 +28
In the following we will describe the standard partitioning strategy, fol- lowed by a description of theRTL partitioning heuristic.
7.4.1 Common Partitioning Strategy
A common strategy for partitioning of the TR as it is used e.g., by VIS [7.3, 7.23] proceeds in three steps:
1. Order latches. First, the latches are ordered by using a benefit heuris- tic [7.13] that performs a structural analysis of the latches’ transition
Table 7.3. Comparison of peaknodes in thousands for standard sifting and sample sifting
Sifting Sample Sifting
Sample Size 30% 40%
nodes % nodes % nodes
dartes 583 -17 707 -17 707
dme2-16 5151 -12 5824 -13 5945
dpd75 3296 - 9 3633 - 8 3566
ftp3 3126 + 4 2986 +10 2806
furnace17 2373 -16 2841 - 3 2439
key10 1099 -51 2236 -51 2236
mmgt20 2904 - 1 2945 - 1 2944
motors-stuck 670 -38 1073 -37 1058
over12 4725 + 4 4550 + 4 4543
phone-async 6118 - 7 6603 -24 8080 valves-gates 542 -43 950 -42 941
sum 30593 -11 34353 +13 35270
avg -17 +16
function to address an effective AndExist operation. During the iterated image computation next state variables are added while present state variables are quantified out. the benefit heuristic uses a greedy scheme to minimize the balance of introduced next state variables and quanti- fied present state variables Additionally, the heuristic takes into account the highest index of a variable to be quantified out, resulting in a more efficient AndExist.
2. Cluster latches. The single latch relations are clustered by following a greedy strategy. Latches are added to an OBDD (i.e., by performing AND) until the size of the OBDD exceeds a given threshold.
3. Order clusters. In the last step the clusters are ordered similarly to the latches by using a benefit heuristic (VIS uses the same heuristic as in Step 1).
Figure 7.5a gives a schematic overview of this process.
7.4.2 RTL Based Partitioning Heuristic
Since modern complex designs require a structured hierarchical description to be feasible they are currently written in a hardware description language (HDL) at register transfer level (RTL). The term RTL is used for an HDL description style that utilizes a combination ofdata flowandbehavioral con- structs. Logic synthesis tools take the RTL HDL description to produce an optimized gate level netlist and high level synthesis tools at the behavioral level output RTL HDL descriptions. Verilog [7.31] and VHDL [7.15] are the most popular HDLs used for describing the functionality at RTL. Within the
00000000 00000000 00000000 00000000 00000000
11111111 11111111 11111111 11111111 11111111
00000000 00000000 00000000 00000000 00000000
11111111 11111111 11111111 11111111 11111111
000000 000000 000000 000000 000000 000000 000000 000000 000000 000000
111111 111111 111111 111111 111111 111111 111111 111111 111111 111111 b) RTL Method a) Standard Method
1. Group latches Latches
1.Order latches
3. Order clusters 2. Cluster Latches
2. Cluster latches
3. Order cluster within groups acc. to RTL modules Relations (BDDs)
Fig. 7.5. Schematic of partitioning strategies
design cycle of optimization and verification the RTL level is an important and frequently used part.
The design methodology in Verilog is a top down hierarchical modeling concept based on modules, which are the basic building block. The experimen- tal work for the following heuristic based on designs written in this language, but our approaches can be easily extended to any HDL or hierarchical FSM representation as it is, e.g., provided by state space decomposition algorithms (see, e.g. [7.18]).
As mentioned above, the way to build a complex design is to break it into modules, each with a dedicated functionality and a smaller complex- ity. For example communication protocols contain transmitters and receivers that represent independent modules. These modules are usually not too com- plex, thus the complexity of their TRs will be small. If a partition contains state variables of several modules, we need to represent the Cartesian prod- uct of these modules leading to a much more complex TR. The main reason for the efficiency of the partitioned TR approach is that state variables not appearing in other partitions are quantified out during the AndExist oper- ation. This leads to much smaller OBDD-sizes and a faster computation. If the state variables of a module are spread over several partitions, the quan- tification does take effect only late in image computation. Therefore, most of the computation has to be done with large OBDDs.
RTL level description languages like Verilog [7.31] or VHDL [7.15] support a hierarchical design methodology by providing module constructs. As it can be seen this modularization has effects on the image computation that should not be neglected.
Although the standard method optimizes the partitioning twice, its main disadvantage is that it only uses structural information to optimize the par- titioning for an efficient order for the AndExist operation during the image computation.
The RTL heuristic improves this optimization by including additional semantical information about the represented functions. As the experimental results show, there is a close connection between the RTL description and an efficient image computation.
The RTL heuristic proceeds in three steps:
1. Group latches. The latches are grouped according to the modules given in the top module of the RTL description in Verilog. Within the groups the latches are ordered by a lexicographic order that takes into account submodule names and bit numbers (names of latches from submodules are prefixed by the submodule name). Also, the bits of a certain register are named by the register and the bit number. The effect of this sorting is, that latches of a submodule within the group stay adjacent, without being grouped explicitly. The same holds for the bits of a register.
2. Cluster groups. The groups represent borders for the clusters. There is no cluster containing latches from different groups. To control the OBDD size of the clusters, the greedy partitioning strategy is applied within the groups. The clustering given by the groups lowers the influence of the arbitrary clustering produced by the OBDD-size threshold. Thus, resulting in a morenaturalpartitioning.
3. Order clusters.(optional) In the last step the clusters may be ordered by using the benefit heuristic from the standard method.
Figure 7.5b gives an overview of this strategy.
Modifications of this strategy are possible:
– Step 1a) As an additional step the benefit heuristic of the standard method may be applied to order the latches within the single groups. It emerged that in our case the lexicographic order of the latches preserves more of the structure of the design and leads to better results.
– Step 2a)One may allow to create clusters that cross a group border. This will lead to a more compact representation of the TR with fewer clusters.
Although the representation is more efficient the image computation does not perform as efficient as with the strict group borders.
7.4.3 Experiments
We implemented our strategy in the VIS-package [7.3] (version 1.3) using the underlying CUDD-package [7.29] (version 2.3.0). VIS is a popular verifica- tion and synthesis package in academic research. It inherits state of the art techniques for OBDD manipulation, image and reachable states computation as well as formal verification techniques. Together with the vl2mv translator VIS provides a Verilog front-end needed for our heuristic.
For our experiments we used Verilog designs from the Texas97 benchmark suite [7.1]. This publicly available benchmark suite contains real life designs including:
– MSI Cache Coherence Protocol – PCI Local BUS
– PI BUS Protocol
– MESI Cache Coherence Protocol – MPEG System Decoder
– DLX
– PowerPC 60x Bus Interface
The benchmark suite also contains properties given in CTL formulas for verification.
We left all parameters of VIS and CUDD unchanged. The most important default values are:
– Partition cluster size = 5000
– Partition method for MDDs = inout – OBDD variable reordering method = sifting – First reordering threshold = 4004 nodes
The reachable states computation or the model checking was preceeded by an explicitely triggered variable reordering. The CPU time was limited to 2 CPU hours and memory usage was limited to 200MB. All experiments were performed on Linux PentiumIII 500Mhz workstations.
Results. For results see Table 7.4 and Table 7.5. Img.comp. is the sum of all image and pre-image computations performed during the analysis. Part gives the number of partitions of the transition relation. The OBDD-size of the transition relation cluster and the peak number of live nodes is given by TRnresp.Peakn. The CPU time is measured in seconds and given asTime.
The columns denoted with% describe the improvement in percent1.
At the bottom of Table 7.5 you can find the sum of all numbers of parti- tions, BDD-sizes and CPU-times. Also, theaverage of the relative improve- mentis given as well as thetotal improvement
The experiments show significant improvements in time and space: The overall CPU time decreased by 67% overall and 40% on average. The method outperforms the standard method in 45 of the 47 benchmarks. The decrease in computation time ranges up to 90%. The OBDD peak sizes could be lowered by 62% overall and 25% on average. Interestingly, the RTL method results in 5% less partitions without requiring more OBDD nodes for the transition relation. This also proves the improved quality of the partitioning.