Heuristics for Optimizing OBDD Applications

7. Algorithms and Heuristics in VLSI Design

7.4 Heuristics for Optimizing OBDD Applications –

The quality of the partitioning is crucial for the efficiency of the reachable states computation. The image computation is iterated over the partitions and includes costly computations. Therefore, maintaining a large number of partitions is time consuming. A small number of partitions may lead to un- manageable large OBDDs. One extremum of this trade-off is the partitioning where each latch forms a partition, which is usually small but requires many iterations. The other extremum is a monolithic transition relation (TR), that can be computed in one iteration but has large OBDD-size. Furthermore, the ordering of latches and clusters is crucial for an efficient AndExist operation.

Using a poor order may lead to extremely large intermediate OBDD sizes that could make a complete image computation impossible.

Table 7.1. Overview of computation and reordering eﬀort for the benchmarks using standard sifting

Vars CPU- Reorder- Re- avg. Reord. Peakn.

time/s time/s ord. Size >5% in 1000 Red.

dartes 198 504 441 87% 3 8% 1 583

dme2-16 586 3757 2331 62% 5 18% 3 5151

dpd75 600 4574 2676 58% 5 0% 0 3296

ftp3 100 1119 588 52% 4 1% 0 3126

furnace17 184 3938 1328 33% 5 21% 1 2373

key10 140 846 643 76% 6 24% 3 1099

mmgt20 264 1610 860 53% 4 2% 0 2904

motors-stuck 172 265 142 53% 4 36% 3 670

over12 174 3002 2526 84% 6 7% 2 4725

phone-async 86 2604 1094 42% 5 8% 1 6118

valves-gates 172 268 200 74% 5 35% 5 542

sum 2686 22487 12829 57% 52 160% 19 30587

avg 244 2044 116661% 4.7 15% 1.7 2781

Table 7.2. Comparison of CPU-time for standard sifting and sample sifting Sifting Sample Sifting

Sample Size 30% 40%

time/s % time/s % time/s

dartes 504 +70 149 +62 194

dme2-16 3757 +45 2073 +53 1765

dpd75 4574 +28 3304 + 9 4144

ftp3 1119 +43 635 +34 742

furnace17 3938 +41 2341 +35 2545

key10 846 +33 568 +28 610

mmgt20 1610 - 9 1770 -17 1961

motors-stuck 265 +44 147 +38 164

over12 3002 +51 1475 +39 1831

phone-async 2604 +13 2268 +13 2273

valves-gates 268 +24 202 +14 220

sum 22487 +34 14934 +27 16458

avg +35 +28

In the following we will describe the standard partitioning strategy, fol- lowed by a description of theRTL partitioning heuristic.

7.4.1 Common Partitioning Strategy

A common strategy for partitioning of the TR as it is used e.g., by VIS [7.3, 7.23] proceeds in three steps:

1. Order latches. First, the latches are ordered by using a beneﬁt heuristic [7.13] that performs a structural analysis of the latches’ transition

Table 7.3. Comparison of peaknodes in thousands for standard sifting and sample sifting

Sifting Sample Sifting

Sample Size 30% 40%

nodes % nodes % nodes

dartes 583 -17 707 -17 707

dme2-16 5151 -12 5824 -13 5945

dpd75 3296 - 9 3633 - 8 3566

ftp3 3126 + 4 2986 +10 2806

furnace17 2373 -16 2841 - 3 2439

key10 1099 -51 2236 -51 2236

mmgt20 2904 - 1 2945 - 1 2944

motors-stuck 670 -38 1073 -37 1058

over12 4725 + 4 4550 + 4 4543

phone-async 6118 - 7 6603 -24 8080 valves-gates 542 -43 950 -42 941

sum 30593 -11 34353 +13 35270

avg -17 +16

function to address an effective AndExist operation. During the iterated image computation next state variables are added while present state variables are quantified out. the benefit heuristic uses a greedy scheme to minimize the balance of introduced next state variables and quantified present state variables Additionally, the heuristic takes into account the highest index of a variable to be quantified out, resulting in a more efficient AndExist.

2. Cluster latches. The single latch relations are clustered by following a greedy strategy. Latches are added to an OBDD (i.e., by performing AND) until the size of the OBDD exceeds a given threshold.

3. Order clusters. In the last step the clusters are ordered similarly to the latches by using a beneﬁt heuristic (VIS uses the same heuristic as in Step 1).

Figure 7.5a gives a schematic overview of this process.

7.4.2 RTL Based Partitioning Heuristic

Since modern complex designs require a structured hierarchical description to be feasible they are currently written in a hardware description language (HDL) at register transfer level (RTL). The term RTL is used for an HDL description style that utilizes a combination ofdata ﬂowandbehavioral con- structs. Logic synthesis tools take the RTL HDL description to produce an optimized gate level netlist and high level synthesis tools at the behavioral level output RTL HDL descriptions. Verilog [7.31] and VHDL [7.15] are the most popular HDLs used for describing the functionality at RTL. Within the

00000000 00000000 00000000 00000000 00000000

11111111 11111111 11111111 11111111 11111111

00000000 00000000 00000000 00000000 00000000

11111111 11111111 11111111 11111111 11111111

000000 000000 000000 000000 000000 000000 000000 000000 000000 000000

111111 111111 111111 111111 111111 111111 111111 111111 111111 111111 b) RTL Method a) Standard Method

1. Group latches Latches

1.Order latches

3. Order clusters 2. Cluster Latches

2. Cluster latches

3. Order cluster within groups acc. to RTL modules Relations (BDDs)

Fig. 7.5. Schematic of partitioning strategies

design cycle of optimization and veriﬁcation the RTL level is an important and frequently used part.

The design methodology in Verilog is a top down hierarchical modeling concept based on modules, which are the basic building block. The experimental work for the following heuristic based on designs written in this language, but our approaches can be easily extended to any HDL or hierarchical FSM representation as it is, e.g., provided by state space decomposition algorithms (see, e.g. [7.18]).

As mentioned above, the way to build a complex design is to break it into modules, each with a dedicated functionality and a smaller complexity. For example communication protocols contain transmitters and receivers that represent independent modules. These modules are usually not too complex, thus the complexity of their TRs will be small. If a partition contains state variables of several modules, we need to represent the Cartesian prod- uct of these modules leading to a much more complex TR. The main reason for the efficiency of the partitioned TR approach is that state variables not appearing in other partitions are quantified out during the AndExist operation. This leads to much smaller OBDD-sizes and a faster computation. If the state variables of a module are spread over several partitions, the quan- tification does take effect only late in image computation. Therefore, most of the computation has to be done with large OBDDs.

RTL level description languages like Verilog [7.31] or VHDL [7.15] support a hierarchical design methodology by providing module constructs. As it can be seen this modularization has eﬀects on the image computation that should not be neglected.

Although the standard method optimizes the partitioning twice, its main disadvantage is that it only uses structural information to optimize the partitioning for an eﬃcient order for the AndExist operation during the image computation.

The RTL heuristic improves this optimization by including additional semantical information about the represented functions. As the experimental results show, there is a close connection between the RTL description and an eﬃcient image computation.

The RTL heuristic proceeds in three steps:

1. Group latches. The latches are grouped according to the modules given in the top module of the RTL description in Verilog. Within the groups the latches are ordered by a lexicographic order that takes into account submodule names and bit numbers (names of latches from submodules are preﬁxed by the submodule name). Also, the bits of a certain register are named by the register and the bit number. The eﬀect of this sorting is, that latches of a submodule within the group stay adjacent, without being grouped explicitly. The same holds for the bits of a register.

2. Cluster groups. The groups represent borders for the clusters. There is no cluster containing latches from diﬀerent groups. To control the OBDD size of the clusters, the greedy partitioning strategy is applied within the groups. The clustering given by the groups lowers the inﬂuence of the arbitrary clustering produced by the OBDD-size threshold. Thus, resulting in a morenaturalpartitioning.

3. Order clusters.(optional) In the last step the clusters may be ordered by using the beneﬁt heuristic from the standard method.

Figure 7.5b gives an overview of this strategy.

Modiﬁcations of this strategy are possible:

– Step 1a) As an additional step the beneﬁt heuristic of the standard method may be applied to order the latches within the single groups. It emerged that in our case the lexicographic order of the latches preserves more of the structure of the design and leads to better results.

– Step 2a)One may allow to create clusters that cross a group border. This will lead to a more compact representation of the TR with fewer clusters.

Although the representation is more eﬃcient the image computation does not perform as eﬃcient as with the strict group borders.

7.4.3 Experiments

We implemented our strategy in the VIS-package [7.3] (version 1.3) using the underlying CUDD-package [7.29] (version 2.3.0). VIS is a popular veriﬁca- tion and synthesis package in academic research. It inherits state of the art techniques for OBDD manipulation, image and reachable states computation as well as formal veriﬁcation techniques. Together with the vl2mv translator VIS provides a Verilog front-end needed for our heuristic.

For our experiments we used Verilog designs from the Texas97 benchmark suite [7.1]. This publicly available benchmark suite contains real life designs including:

– MSI Cache Coherence Protocol – PCI Local BUS

– PI BUS Protocol

– MESI Cache Coherence Protocol – MPEG System Decoder

– DLX

– PowerPC 60x Bus Interface

The benchmark suite also contains properties given in CTL formulas for veriﬁcation.

We left all parameters of VIS and CUDD unchanged. The most important default values are:

– Partition cluster size = 5000

– Partition method for MDDs = inout – OBDD variable reordering method = sifting – First reordering threshold = 4004 nodes

The reachable states computation or the model checking was preceeded by an explicitely triggered variable reordering. The CPU time was limited to 2 CPU hours and memory usage was limited to 200MB. All experiments were performed on Linux PentiumIII 500Mhz workstations.

Results. For results see Table 7.4 and Table 7.5. Img.comp. is the sum of all image and pre-image computations performed during the analysis. Part gives the number of partitions of the transition relation. The OBDD-size of the transition relation cluster and the peak number of live nodes is given by TRnresp.Peakn. The CPU time is measured in seconds and given asTime.

The columns denoted with% describe the improvement in percent1.

At the bottom of Table 7.5 you can ﬁnd the sum of all numbers of partitions, BDD-sizes and CPU-times. Also, theaverage of the relative improve- mentis given as well as thetotal improvement

The experiments show signiﬁcant improvements in time and space: The overall CPU time decreased by 67% overall and 40% on average. The method outperforms the standard method in 45 of the 47 benchmarks. The decrease in computation time ranges up to 90%. The OBDD peak sizes could be lowered by 62% overall and 25% on average. Interestingly, the RTL method results in 5% less partitions without requiring more OBDD nodes for the transition relation. This also proves the improved quality of the partitioning.

Heuristics for Optimizing OBDD Applications –

Interesting Events versus State Mapping

Animation Systems and Heuristics: Max Flow