Given this relatively small number of processors, and the time constraints for new options/variants, shared memory parallelism becomes the most attractive option.. THE CONSEQUENCES OF MO
Trang 1L1 Initialize pointer lists for elements, points and receive lists;
L2 For each pointipoin:
Get the smallest domain numberidminof the elements that surround it; store thisnumber inlpmin(ipoin);
For each element that surrounds this point:
If the domain number of this element is larger thanidmin:
- Add this element to domainidmin;
L3 For the points of each sub-domainidomn:
If lpmin(ipoin).ne.idomn:
add this information to the receive list for this sub-domain;
Endif
L4 Order the receive list of each sub-domain according to sub-domains;
L5 Given the receive lists, build the send list for each sub-domain
Given the send and receive lists, the information transfer required for the parallelexplicit flow solver is accomplished as follows:
- Send the updated unknowns of all nodes stored in the send list;
- Receive the updated unknowns of all nodes stored in the receive list;
- Overwrite the unknowns for these received points
In order to demonstrate the use of explicit flow solvers on MIMD machines, we sider the same supersonic inlet problem as described above for shared-memory parallelmachines (see Figure 15.24) The solution obtained on a 6-processor MIMD machine after
con-800 timesteps is shown in Figure 15.28(a) The boundaries of the different domains can beclearly distinguished Figure 15.28(b) summarizes the speedups obtained for a variety ofplatforms using MPI as the message passing library, as well as the shared memory option.Observe that an almost linear speedup is obtained For large-scale industrial applications
of domain decomposition in conjunction with advanced compressible flow solvers, seeMavriplis and Pirzadeh (1999)
15.7 The effect of Moore’s law on parallel computing
One of the most remarkable constants in a rapidly changing world has been the rate of growthfor the number of transistors that are packaged onto a square inch This rate, commonlyknown as Moore’s Law, is approximately a factor of two every 18 months, which translatesinto a factor of 10 every 5 years (Moore (1965, 1999)) As one can see from Figure 15.29 thisrate, which governs the increase in computing speed and memory, has held constant for morethan three decades, and there is no end in sight for the foreseeable future (Moore (2003)).One may argue that the raw number of transistors does not translate into CPU performance.However, more transistors translate into more registers and more cache, both importantelements to achieve higher throughput At the same time, clock rates have increased, andpre-fetching and branch prediction have improved Compiler development has also not stoodstill Moreover, programmers have become conscious of the added cost of memory access,cache misses and dirty cache lines, employing the techniques described above to minimizetheir impact The net effect, reflected in all current projections, is that CPU performance isgoing to continue advancing at a rate comparable to Moore’s Law
Trang 2EFFICIENT USE OF COMPUTER HARDWARE 345
Mach Ŧ Number: Usual vs 6 Ŧ Proc Run (min=0.825, max=3.000, incr=0.05)
(a)
1 2 4 8 16 32
Nr of Processors
Ideal SGI-O2K SHM SGI-O2K MPI IBM-SP2 MPI HP-DAX MPI
(b) Figure 15.28 Supersonic inlet: (a) MIMD results; (b) speedup for different machines
Figure 15.29 Evolution of transistor density
Trang 315.7.1 THE LIFE CYCLE OF SCIENTIFIC COMPUTING CODES
Let us consider the effects of Moore’s Law on the lifecycle of typical large-scale scientificcomputing codes The lifecycle of these codes may be subdivided into the following stages:
In the conceptual stage, the basic purpose of the code is defined, the physics to be
simulated identified and proper algorithms are selected and coded The many possiblealgorithms are compared, and the best is kept A run during this stage may take weeks ormonths to complete A few of these runs may even form the core of a PhD thesis
The demonstration stage consists of several large-scale runs that are compared to
exper-iments or analytical solutions As before, a run during this stage may take weeks or months
to complete Typically, during this stage the relevant time-consuming parts of the code areoptimized for speed
Once the basic code is shown to be useful, it may be adopted for production runs This
implies extensive benchmarking for relevant applications, quality assurance, bookkeeping ofversions, manuals, seminars, etc For commercial software, this phase is also referred to as
industrialization of a code It is typically driven by highly specialized projects that qualify
the code for a particular class of simulations, e.g air conditioning or external aerodynamics
of cars
If the code is successful and can provide a simulation capability not offered by
competi-tors, the fourth phase, i.e widespread use and acceptance, will follow naturally An important
shift is then observed: the ‘missionary phase’ (why do we need this capability?) suddenlytransitions into a ‘business as usual phase’ (how could we ever design anything without thiscapability?) The code becomes an indispensable tool in industrial research, development,design and analysis It forms part of the widely accepted body of ‘best practices’ and isregarded as commercial off the shelf (COTS) technology
One can envision a fifth phase, where the code is embedded into a larger module, e.g.
a control device that ‘calculates on the fly’ based on measurement input The technologyembodied by the code has then become part of the common knowledge and the source isfreely available
The time from conception to widespread use can span more than two decades Duringthis time, computing power will have increased by a factor of 1:10 000 Moreover, during adecade, algorithmic advances and better coding will improve performance by at least anotherfactor of 1:10 Let us consider the role of parallel computing in light of these advances
During the demonstration stage, runs may take weeks or months to complete on the
largest machine available at the time This places heavy emphasis on parallelization Giventhat optimal performance is key, and massive parallelism seems the only possible way of
Trang 4EFFICIENT USE OF COMPUTER HARDWARE 347
solving the problem, distributed memory parallelism on O(103)processors is perhaps the
only possible choice The figure of O(103)processors is derived from experience: even as ahigh-end user with sometimes highly visible projects the author has never been able to obtain
a larger number of processors with consistent availability in the last two decades Moreover,
no improvement is foreseeable in the future The main reason lies in the usage dynamics oflarge-scale computers: once online, a large audience requests time on it, thereby limiting themaximum number of processors available on a regular basis for production runs
Once the code reaches production status, a shift in emphasis becomes apparent More and
more ‘options’ are demanded, and these have to be implemented in a timely manner Anotherfive years have passed and by this time, processors have become faster (and memory has
increased) by a further factor of 1:10, implying that the same run that used to take O(103)
processors can now be run on O(102) processors Given this relatively small number of
processors, and the time constraints for new options/variants, shared memory parallelism
becomes the most attractive option
The widespread acceptance of a successful code will only accentuate the emphasis on
quick implementation of options and user-specific demands Widespread acceptance alsoimplies that the code will no longer run exclusively on supercomputers, but will migrate
to high-end servers and ultimately PCs The code has now been in production for at least
5 years, implying that computing power has increased again by another factor of 1:10 The
same run that used to take O(103)processors in the demonstration stage can now be run using
O(101) processors, and soon will be within reach of O(1) processors Given that user-specific
demands dominate at this stage, and that the developers are now catering to a large user base
working mostly on low-end machines, parallelization diminishes in importance, even to the
point of completely disappearing as an issue As parallelization implies extra time devoted tocoding, thereby hindering fast code development, it may be removed from consideration atthis stage
One could consider a fifth phase, 20 years into the life of the code The code has become
an indispensable commodity tool in the design and analysis process, and is run thousands oftimes per day Each of these runs is part of a stochastic analysis or optimization loop, and isperformed on a commodity chip-based, uni-processor machine Moore’s Law has effectively
removed parallelism from the code.
Figure 15.30 summarizes the life cycle of typical scientific computing codes
Concept Demo Prod Wide Use COTS Embedded Time 1
Figure 15.30 Life cycle of scientific computing codes
Trang 515.7.2 EXAMPLES
Let us consider two examples where the life cycle of codes described above has becomeapparent
15.7.2.1 External missile aerodynamics
The first example considers aerodynamic force and moment predictions for missiles wide, approximately 100 new missiles or variations thereof appear every year In order toassess their flight characteristics, the complete force and moment data for the expected flightenvelope must be obtained Simulations of this type based on the Euler equations require
World-approximately O(106–107)elements, special limiters for supersonic flows, semi-empiricalestimation of viscous effects and numerous specific options such as transpiration boundaryconditions, modelling of control surfaces, etc The first demonstration/feasibility studies tookplace in the early 1980s At that time, it took the fastest production machine of the day(Cray-XMP) a night to compute such flows The codes used were based on structured grids(Chakravarthy and Szema (1987)) as the available memory was small compared to the number
of gridpoints The increase of memory, together with the development of codes based on
unstructured (Mavriplis (1991b), Luo et al (1994)) or adaptive Cartesian grids (Melton et al (1993), Aftosmis et al (2000)) as well as faster, more robust solvers (Luo et al (1998))
allowed for a high degree of automation At present, external missile aerodynamics can beaccomplished on a PC in less than an hour, and runs are carried out daily by the thousands forenvelope scoping and simulator input on PC clusters (Robinson (2002)) Figure 15.31 shows
an example
Figure 15.31 External missile aerodynamics
15.7.2.2 Blast simulations
The second example considers pressure loading predictions for blasts Simulations of this
type based on the Euler equations require approximately O(106–108)elements, special iters for transient shocks, and numerous specific options such as links to damage prediction
Trang 6lim-EFFICIENT USE OF COMPUTER HARDWARE 349post-processors The first demonstration/feasibility studies took place in the early 1990s
(Baum and Löhner (1991), Baum et al (1993, 1995, 1996)) At that time, it took the fastest
available machine (Cray-C90 with special memory) several days to compute such flows.The increase of processing power via shared memory machines during the past decade hasallowed for a considerable increase in problem size, physical realism via coupled CFD/CSD
runs (Löhner and Ramamurti (1995), Baum et al (2003)) and a high degree of automation.
At present, blast predictions with O(2× 106) elements can be carried out on a PC in a
matter of hours (Löhner et al (2004c)), and runs are carried out daily by the hundreds for
maximum possible damage assessment on networks of PCs Figure 15.32 shows the results
of such a prediction based on genetic algorithms for a typical city environment (Togashi et al.
(2005)) Each dot represents an end-to-end run (grid generation of approximately 1.5 milliontetrahedra, blast simulation with advanced CFD solver, damage evaluation), which takesapproximately 4 hours on a high-end PC The scale denotes the estimated damage produced
by the blast at the given point This particular run was done on a network of PCs and is typical
of the migration of high-end applications to PCs due to Moore’s Law
Figure 15.32 Maximum possible damage assessment for inner city
15.7.3 THE CONSEQUENCES OF MOORE’S LAW
The statement that parallel computing diminishes in importance as codes mature is predicated
on two assumptions:
- the doubling of computing power every 18 months will continue;
- the total number of operations required to solve the class of problems the code wasdesigned for has an asymptotic (finite) value
Trang 7The second assumption may seem the most difficult to accept After all, a natural side effect
of increased computing power has been the increase in problem size (grid points, materialmodels, time of integration, etc.) However, for any class of problem there is an intrinsic limitfor the problem size, given by the physical approximation employed Beyond a certain point,the physical approximation does not yield any more information Therefore, we may have toaccept that parallel computing diminishes in importance as a code matures
This last conclusion does not in any way diminish the overall significance of parallel
com-puting Parallel computing is an enabling technology of vital importance for the development
of new high-end applications Without it, innovation would seriously suffer
On the other hand, without Moore’s Law many new code developments would appear
as unjustified If computing time does not decrease in the future, the range of applicationswould soon be exhausted CFD developers worldwide have always assumed subconsciouslyMoore’s Law when developing improved CFD algorithms and techniques
Trang 816 SPACE-MARCHING AND
DEACTIVATION
For several important classes of problems, the propagation behaviour inherent in the PDEsbeing solved can be exploited, leading to considerable savings in CPU requirements.Examples where this propagation behaviour can lead to faster algorithms include:
- detonation: no change to the flowfield occurs ahead of the denotation wave;
- supersonic flows: a change of the flowfield can only be influenced by upstream events,
but never by downstream disturbances; and
- scalar transport: a change of the transported variable can only occur in the downstream
region, and only if a gradient in the transported variable or a source is present.The present chapter shows how to combine physics and data structures to arrive at fastersolutions Heavy emphasis is placed on space-marching, where these techniques have reachedconsiderable maturity However, the concepts covered are generally applicable
16.1 Space-marching
One of the most efficient ways of computing supersonic flowfields is via so-called marching techniques These techniques make use of the fact that in a supersonic flowfield
space-no information can travel upstream Starting from the upstream boundary, the solution
is obtained by marching in the downstream direction, obtaining the solution for the nextdownstream plane (for structured (Kutler (1973), Schiff and Steger (1979), Chakravarthy and
Szema (1987), Matus and Bender (1990), Lawrence et al (1991)) or semi-structured (McGrory et al (1991), Soltani et al (1993)) grids), subregion (Soltani et al (1993),
Nakahashi and Saitoh (1996), Morino and Nakahashi (1999)) or block In the following,
we will denote as a subregion a narrow band of elements, and by a block a larger region of
elements (e.g one-fifth of the mesh) The updating procedure is repeated until the whole fieldhas been covered, yielding the desired solution
In order to estimate the possible savings in CPU requirements, let us consider a
steady-state run Using local timesteps, it will take an explicit scheme approximately O(n s )steps
to converge, where n s is the number of points in the streamwise direction The total number
of operations will therefore be O(n t · n2
s ) , where n t is the average number of points in the
transverse planes Using space-marching, we have, ideally, O(1) steps per active domain, implying a total work of O(n t · n s ) The gain in performance could therefore approach
O(1: n s ) for large n s Such gains are seldomly realized in practice, but it is not uncommon
to see gains in excess of 1:10
Applied Computational Fluid Dynamics Techniques: An Introduction Based on Finite Element Methods, Second Edition.
Trang 9Of the many possible variants, the space-marching procedure proposed by Nakahashi andSaitoh (1996) appears as the most general, and is treated here in detail The method can beused with any explicit time-marching procedure, it allows for embedded subsonic regionsand is well suited for unstructured grids, enabling a maximum of geometrical flexibility Themethod works with a subregion concept (see Figure 16.1) The flowfield is only updated
in the so-called active domain Once the residual has fallen below a preset tolerance, theactive domain is shifted Should subsonic pockets appear in the flowfield, the active domain
is changed appropriately
maskp 0 1 2 3 4 5 6
Active Domain Computed
Field
Uncomputed Field
Residual Monitor Region Flow Direction
Figure 16.1 Masking of points
In the following, we consider computational aspects of Nakahashi and Saitoh’s marching scheme and a blocking scheme in order to make them as robust and efficient aspossible without a major change in existing codes The techniques are considered in thefollowing order: masking of edges and points, renumbering of points and edges, grouping
space-to avoid memory contention, extrapolation of the solution for new active points, treatment
of subsonic pockets, proper measures for convergence, the use of space-marching withinimplicit, time-accurate solvers for supersonic flows and macro-blocking
16.1.1 MASKING OF POINTS AND EDGES
As seen in the previous chapters, any timestepping scheme requires the evaluation of fluxes,residuals, etc These operations typically fall into two categories:
(a) point Loops, which are of the form
do ipoin=1,npoin
do work on the point level
enddo
Trang 10SPACE-MARCHING AND DEACTIVATION 353
(b) edge loops, which are of the form
do iedge=1,nedge
gather point information
do work on the edge level
scatter-add edge results to points
enddo
The first loop is typical of unknown updates in multistage Runge–Kutta schemes, tion of residuals or other point sums, pressure, speed of sound evaluations, etc The secondloop is typical of flux summations, artificial viscosity contributions, gradient calculations andthe evaluation of the allowable timestep For cell-based schemes, point loops are replaced
initializa-by cell loops and edge loops are replaced initializa-by face loops However, the nature of these loopsremains the same The bulk of the computational effort of any scheme is usually carried out
in loops of the second type
In order to decide where to update the solution, points and edges need to be classified or
‘masked’ Many options are possible here, and we follow the notation proposed by Nakahashiand Saitoh (1996) (see Figure 16.1):
maskp=0: point in downstream, uncomputed field;
maskp=1: point in downstream, uncomputed field, connected to active domain;
maskp=2: point in active domain;
maskp=3: point of maskp=2, with connection to points of maskp=4;
maskp=4: point in the residual-monitor subregion of the active domain;
maskp=5: point in the upstream computed field, with connection to active domain;maskp=6: point in the upstream computed field
The edges for which work has to be carried out then comprise all those for which at least one
of the endpoints satisfies0<maskp<6 These active edges are marked asmaske=1, whileall others are marked asmaske=0
The easiest way to convert a time-marching code into a space- or domain-marching code
is by rewriting the point- and edge loops as follows
Loop 1a:
do ipoin=1,npoin
if(maskp(ipoin).gt.0 and maskp(ipoin).lt.6) then
do work on the point level
gather point information
do work on the edge level
scatter-add edge results to points
endif
enddo
Trang 11For typical aerodynamic configurations, resolution of geometrical detail and flow featureswill dictate the regions with smaller elements In order to be as efficient as possible, the regionbeing updated at any given time should be chosen as small as possible This implies that, inregions of large elements, there may exist edges that connect points marked asmaskp=4topoints marked asmaskp=0 In order to leave at least one layer of points in the safety region,
a pass over the edges is performed, setting the downstream point tomaskp=4for edges withpoint markingsmaskp=2,0
16.1.2 RENUMBERING OF POINTS AND EDGES
For a typical space-marching problem, a large percentage of points in Loop 1a will notsatisfy theif-statement, leading to unnecessary work Renumbering the points according
to the marching direction has the twofold advantage of a reduction in cache-misses, and thepossibility to bound the active point region locally Defining
npami: the minimum point number in the active region,
npamx: the maximum point number in the active region,
npdmi: the minimum point number touched by active edges,
npdmx: the maximum point number touched by active edges,
Loop 1a may now be rewritten as follows
Loop 1b:
do ipoin=npami,npamx
if(maskp(ipoin).gt.0 and maskp(ipoin).lt.6) then
do work on the point level
neami: The minimum active edge number;
neamx: The maximum active edge number;
Loop 2a may now be rewritten as follows
Loop 2b:
do iedge=neami,neamx
if(maske(iedge).eq.1) then
gather point information
do work on the edge level
scatter-add edge results to points
endif
enddo
Trang 12SPACE-MARCHING AND DEACTIVATION 35516.1.3 GROUPING TO AVOID MEMORY CONTENTION
In order to achieve pipelining or vectorization, memory contention must be avoided Theenforcement of pipelining or vectorization is carried out using a compiler directive, asLoop 2b, which becomes an inner loop, and still offers the possibility of memory contention
In this case, we have the following:
gather point information
do work on the edge level
scatter-add edge results to points
nedge
1 mvecl
Figure 16.2 Near-optimal point-range access of edge groups
The loop structure is shown schematically in Figure 16.2 One is now in a position toremove the if-statement from the innermost loop, situating it outside The inactive edgegroups are marked, e.g.edpas(ipass)<0 This results in the following