The acceptable vector length for optimalmachine utilization is now required to increase again according to the number of processors.For the Cray-C90, with 16 processors, the vector lengt
Trang 1Figure 15.15 Point range accessed by edge groups
15.2.4 SWITCHING ALGORITHM
A typical one-processor register-to-register vector machine will work at its peak efficiency ifthe vector lengths are multiples of the number of vector registers For a traditional Cray, thisnumber was 64 An easy way to achieve higher rates of performance without a large change
in architecture is through the use of multiple vector pipes A Cray-C90, for example, hastwo vector pipes per processor, requiring vector lengths that are multiples of 128 for optimalhardware use For a NEC-SX4, which has four vector pipes per processor, this number rises
to 256, and for the NEC SX8 to 512 The most common way to use multiple processors inthese systems is through autotasking, i.e by splitting the work at theDO-loop level Typically,this is done by simply setting a flag in the compiler The acceptable vector length for optimalmachine utilization is now required to increase again according to the number of processors.For the Cray-C90, with 16 processors, the vector lengths have to be multiples of 2048.The message is clear: for shared- and fixed-memory machines, make the vectors as long
as possible while achieving lengths that are multiples of a possibly large machine-dependentnumber
Trang 2Consider a typical unstructured mesh Suppose that the maximum number of edgessurrounding a particular point ismesup The minimum number of groups required for avectorizable scatter-add pass over the edges is then given bymesup On the other hand, theaverage number of edges surrounding a pointaesupwill in most cases be lower Typicalnumbers areaesup=6for linear triangles,aesup=22for linear tetrahedra,aesup=4forbilinear quads andaesup=8for trilinear bricks For a traditional edge renumberer like theone presented in the previous section, the result will be a grouping or colouring of edgesconsisting ofaesuplarge groups of approximatelynedge/aesupedges, and some verysmall groups for the remaining edges While this way of renumbering is optimal for memory-to-memory machines, better ways are possible for register-to-register machines.
Consider the following ‘worst-case scenario’ for a triangular mesh ofnedge=391edgeswith a grouping of edges according to
lgrou(1:ngrou)=64,64,64,64,64,64,7,
which is possible due to the first grouping On the other hand, a simple forward or even abackward–forward pass renumbering will in most cases not yield this desired grouping Onepossible way of achieving this optimal grouping is the following switching algorithm that isrun after a first colouring is obtained (see Figure 15.16)
b) Switching Sequence Figure 15.16 Switching algorithm
Trang 3The idea is to interchange edges at the end of the list with some edge inside one ofthe vectorizable groups without incurring a multiple point access Algorithmically, this isaccomplished as follows.
S1 Determine the last group of edges with acceptable length; this group is denoted
by ngro0; the edges in the next group will be nedg0=lgrou(ngro0)+1
to nedg1=lgrou(ngro0+1), and the remaining edges will be at locationsnedg1+1 to nedge;
S2 Transcribe edges nedg1+1 to nedge into an auxiliary storage list;
S3 Set a maximum desired vector length mvecl;
S4 Initialize a point array: lpoin=0;
S5 For all the points belonging to edges nedg0 to nedg1: set lpoin=1;
S6 Set current group vector length mvecl=nedg1-nedg0+1;
S7 Set remaining edge counter ierem=nedg1;
S8 Loop over the large groups of edges igrou=1,ngro0;
S9 Initialize a second point array: lpoi1=0;
S10 For all the points belonging to the edges in this group: set lpoi1=1;
S11 Loop over the edges iedge in this group:
If lpoin=0 for all the points touching iedge: then
Set lpoi1=0 for the points touching iedge;
If lpoin=0 for all the points touching ierem: then
- Set lpoin=1 for the points touching iedge;
- Set lpoi1=1 for the points touching ierem;
- Interchange edges
- Update remaining edge counter: ierem=ierem+1
- Update current vector length counter: nvecl=nvecl+1;
- If nvecl.eq.mvecl: exit edge loop (Goto S10);
Else
- Re-set lpoi1=1 for the points touching iedge;
Endif
End of loop over the large groups of edges
S12 Store the group counter: lgrou(ngrou)=nenew;
S13 If unmarked edges remain (ienew.ne.nedge): reset counters and Goto S1.All of the renumberings described work in linear time complexity and are straightforward
to code Thus, they are ideally suited for applications requiring frequent mesh changes, e.g.adaptive h-refinement (Löhner and Baum (1992)) or remeshing (Löhner (1990)) for transientproblems Furthermore, although the ideas were exemplified on edge-based solvers, theycarry over to element- or face-based solvers
15.2.5 REDUCED I/A LOOPS
Suppose the edges are defined in such a way that the first point always has a lower pointnumber than the second point Furthermore, assume that the first point of each edge increaseswith stride one as one loops over the edges In this case, Loop 1 (or the inner loop of Loop 2)may be rewritten as follows
Trang 4Compared to Loop 1, the number of i/a fetch and store operations has been halved while
the number of FLOPS remains unchanged However, unlike stars, superedges and chains, thebasic connectivity arrays remain unchanged, implying that a progressive rewrite of codes ispossible In the following, a point and edge renumbering algorithm is described that seeks tomaximize the number of loops of this kind that can be obtained for tetrahedral grids
From the edge-connectivity array lnoed:
Obtain the points that surround each point;
Store the number of points surrounding each point: lpsup(1:npoin);
Set npnew=0;
Point Renumbering:
- while(npnew.ne.npoin):
- Obtain the point ipmaxwith the maximum value of lpsup(ip);
- do: for all points jpoinsurrounding ipmax:
Trang 5exam-1 2
3 4
5
6 7
15.2.5.2 Edge renumbering
Once the points have been renumbered, the edges are reordered according to point numbers
as described above (section 15.1.4) Thereafter, they are grouped into vector groups to avoidmemory contention (sections 15.2.1–15.2.3) In order to achieve the maximum (ordered)vector length possible, the highest point number is processed first In this way, memorycontention is delayed as much as possible The resulting vector groups obtained in this wayfor the small 2-D example considered above is shown in Figure 15.18
Trang 6It is clear that not all of these groups will lead to a uniform stride access of the first point.These loops are still processed as before in Loop 1 The final form for the edge loop thentakes the following form.
a function of the desired vector length chosen The table contains two values: the first isobtained if one insists on the vector length chosen; the second is obtained if the usual i/a
vector groups are examined further, and snippets of sufficiently long (>64) reduced i/a edge
groups are extracted from them Observe that, even with considerable vector lengths, morethan 90% of the edges can be processed in reduced i/a mode
15.2.5.3 Avoidance of cache-misses
Renumbering the points according to their maximum connectivity, as required for the reducedi/a point renumbering described above, can lead to very large jumps in the point index for anedge (or, equivalently, the bandwidth of the resulting matrix)
One can discern that for the structured mesh shown in Figure 15.20 the maximum jump
in point index for edges is nb max = O(Np/2), where Npis the number of points in the mesh
Trang 7Figure 15.19 F117: surface discretization
Table 15.10 F117 Configuration:nedge=4,179,771mvecl % reduced i/a nvecl % reduced i/a nvecl
Figure 15.20 Point jumps per edge for a structured grid
In general, the maximum jump will be nbmax = O((1 − 2/ncmax)Np), where ncmax is themaximum number of neighbours for a point For tetrahedral grids, the average number of
neighbours for a point is approximately nc = 14, implying nb = O(6/7Np) This high jump
Trang 8in point index per edge in turn leads to a large number of cache-misses and consequent loss ofperformance for RISC-based machines In order to counter this effect, a two-step procedure
is employed In a first pass, the points are renumbered for optimum cache performance using
a bandwidth minimization technique (Cuthill–McKee, wave front, recursive bisection, filling curve, bin, coordinate-based, etc.) In a second pass, the points are renumbered foroptimal i/a reduction using the algorithm outlined above However, the algorithm is appliedprogressively on point groupsnpoi0:npoi0+ngrou, until all points have been covered.The size of the groupngroucorresponds approximately to the average bandwidth In thisway, the point renumbering operates on only a few hyperplanes or local zones at a time,avoiding large point jumps per edge and the associated cache-misses
space-15.2.6 ALTERNATIVE RHS FORMATION
A number of test cases (Löhner and Galle (2002)) were run on both the CRAY-SV1 andNEC-SX5 using the conventional Loop 2 and Loop 2a The results were rather disappointing:
Loop 2a was slightly more expensive than Loop 2, even for moderate (>256) vector lengths
and more than 90% of edges processed in reduced i/a mode Careful analysis on the SX5 revealed that the problem was not in the fetches, but rather in the stores Removing one
NEC-of the stores almost doubled CPU performance This observation led to the unconventionalformation of the RHS with two vectors
no additional initialization or summation of the two arrays is required The same use of dualRHS vectors implemented in Loop 2b is denoted as Loop 2br in what follows Finally, the
if-test in Loop 2a may be removed by reordering the edge-groups in such a way that all usuali/a groups are treated first and all reduced i/a thereafter This loop is denoted as Loop 2bg
As an example for the kind of speedup that can be achieved with this type of modified,reduced i/a loop, a sphere close to a wall is considered, typical of low-Reynolds-number
Trang 9Table 15.11 Sphere close to the wall:nedge=328,634mvecl % reduced i/a nvecl % reduced i/a nvecl
of edges processed in reduced i/a mode as a function of the desired vector length chosen
Figure 15.21 Sphere in the wall proximity: mesh in the cut plane
Table 15.12 shows the relative timings recorded for a desired edge group length of 2048
on the SGI Origin2000, Cray-SV1 and NEC-SX5 One can see that gains are achieved in allcases, even though these machines vary in speed by approximately an order of magnitude,and the SGI has an L1 and L2 cache, i.e no direct memory access The biggest gains areachieved on the NEC-SX5 (almost 30% speedup)
Table 15.12 Laplacian RHS evaluation (relative timings)
Loop 2 1.0000 1.0000 1.0000Loop 2a 0.9563 1.0077 0.8362Loop 2b 0.9943 0.8901 0.7554Loop 2br 0.9484 0.8416 0.7331
Trang 10Table 15.13 Speedups obtainable (Amdahl’s Law)
R p /R s 50% 90% 99% 99.9%
100 1.98 9.90 50.25 90.99
1000 2.00 9.91 90.99 500.25
15.3 Parallel machines: general considerations
With the advent of massively parallel machines, i.e machines in excess of 500 nodes, theexploitation of parallelism in solvers has become a major focus of attention According to
Amdahl’s Law, the speedup s obtained by parallelizing a portion α of all operations required
is given by
α · (Rs /R p ) + (1 − α) , (15.1)where R s and R p denote the scalar and parallel processing rates (speeds), respectively.Table 15.13 shows the speedups obtained for different percentages of parallelization anddifferent numbers of processors
Note that even on a traditional shared-memory, multi-processor vector machine, such asthe Cray-T90 with 16 processors, the maximum achievable speedup between scalar code
and parallel vector code is a staggering R p /R s= 240 What is important to note is that
as one migrates to higher numbers of processors, only the embarrassingly parallel codeswill survive Most of the applications ported successfully to parallel machines to date havefollowed the single program multiple data (SPMD) paradigm For grid-based solvers, aspatial sub-domain was stored and updated in each processor For particle solvers, groups
of particles were stored and updated in each processor For obvious reasons, load balancing
(Williams (1990), Simon (1991), Mehrota et al (1992), Vidwans et al (1993)) has been a
major focus of activity
Despite the striking successes reported to date, only the simplest of all solvers, explicittimestepping or implicit iterative schemes, perhaps with multigrid added on, have been portedwithout major changes and/or problems to massively parallel machines with distributed mem-ory Many code options that are essential for realistic simulations are not easy to parallelize
on this type of machine Among these, we mention local remeshing (Löhner (1990)), repeatedh-refinement, such as that required for transient problems (Löhner and Baum (1992)), contact
detection and force evaluation (Haug et al (1991)), some preconditioners (Martin and Löhner
(1992), Ramamurti and Löhner (1993)), applications where particles, flow and chemistryinteract, and other applications with rapidly varying load imbalances Even if 99% of alloperations required by these codes can be parallelized, the maximum achievable gain will
be restricted to 1:100 If we accept as a fact that for most large-scale codes we may not beable to parallelize more than 99% of all operations, the shared-memory paradigm, discardedfor a while as non-scalable, may make a comeback It is far easier to parallelize some of themore complex algorithms, as well as cases with large load imbalance, on a shared-memorymachine Moreover, it is within present technological reach to achieve a 100-processor,shared-memory machine
Trang 1115.4 Shared-memory parallel machines
The alternative of having less expensive RISC chips linked via shared memory is currentlybeing explored by a number of vendors One example of such a machine is the SGI Altix,which at the time of writing allows up to 256 processors to work in shared-memory mode on
a problem In order to obtain proper performance from such a machine, the codes must bewritten so as to avoid:
(a) cache-misses (in order to perform well on each processor);
(b) cache overwrite (in order to perform well in parallel); and
(c) memory contention (in order to allow pipelining)
Thus, although in principle a good compromise, shared-memory, RISC-based parallelmachines actually require a fair degree of knowledge and reprogramming for codes to runoptimally
The reduction of cache-misses and the avoidance of memory contention have already beendiscussed before Let us concentrate on the avoidance of cache overwrite If one desires toport the element loop described above to a parallel, shared-memory machine, the easiestway to proceed is to let the auto-parallelizing compiler simply split the inner loop acrossprocessors It would then appear that increasing the vector length to a sufficiently large valuewould offer a satisfactory solution However, this is not advisable for the following reasons.(a) Every time a parallel do-loop is launched, a start-up time penalty, equivalent toseveral hundred FLOPs is incurred This implies that if scalability to even 16 processors
is to be achieved, the vector loop lengths would have to be 16× 1000 Typicaltetrahedral grids yield approximately 22 maximum vector length groups, indicatingthat one would need at least 22× 16 × 1000 = 352 000 edges to run efficiently.(b) Because the range of points in each group increases at least linearly with vector length,
so do cache misses This implies that, even though one may gain parallelism, theindividual processor performance would degrade The end result is a very limited, non-scalable gain in performance
(c) Because the points in a split group access a large portion of the edge array, differentprocessors may be accessing the same cache-line When a ‘dirty cache-line’ overwriteoccurs, all processors must update this line, leading to a large increase of interprocessorcommunication, severe performance degradation and non-scalability Experiments on
an 8-processor SGI Power Challenge showed a maximum speedup of only 1:2.5 whenusing this option This limited speedup was attributed, to a large extent, to cache-lineoverwrites
In view of these consequences, additional renumbering strategies have to be implemented Inthe following, two edge group agglomeration techniques that minimize cache-misses, allowfor pipelining on each processor and avoid cache overwrite across processors are discussed.Both techniques operate on the premise that the points accessed within each parallel inneredge loop (1600 loop) do not overlap
Before going on, we define edmin(1:npass) and edmax(1:npass) to be theminimum and maximum points accessed within each group, respectively,nprocthe number
of processors, and [edmin(ipass),edmax(ipass)]the point range of each groupipass
Trang 1215.4.1 LOCAL AGGLOMERATION
The first way of achieving pipelining and parallelization is by processing, in parallel,nprocindependent vector groups whose individual point range does not overlap The idea is torenumber the edges in such a way thatnprocgroups are joined together where, for each one
of these groups, the point ranges do not overlap (see Figure 15.22)
Figure 15.22 Local agglomeration
As each one of the sub-groups has the same number of edges, the load is balanced acrossthe processors The actual loop is given by the following
do 1600 iedge=nedg0,nedg1
ipoi1=lnoed(1,iedge)
ipoi2=lnoed(2,iedge)
redge=geoed( iedge)*(unkno(ipoi2)-unkno(ipoi1))rhspo(ipoi1)=rhspo(ipoi1)+redge