Applied Computational Fluid Dynamics Techniques - Wiley Episode 2 Part 6 pot

Given this relatively small number of processors, and the time constraints for new options/variants, shared memory parallelism becomes the most attractive option.. THE CONSEQUENCES OF MO

Trang 1

L1 Initialize pointer lists for elements, points and receive lists;

L2 For each pointipoin:

Get the smallest domain numberidminof the elements that surround it; store thisnumber inlpmin(ipoin);

For each element that surrounds this point:

If the domain number of this element is larger thanidmin:

- Add this element to domainidmin;

L3 For the points of each sub-domainidomn:

If lpmin(ipoin).ne.idomn:

add this information to the receive list for this sub-domain;

Endif

L4 Order the receive list of each sub-domain according to sub-domains;

L5 Given the receive lists, build the send list for each sub-domain

Given the send and receive lists, the information transfer required for the parallelexplicit flow solver is accomplished as follows:

- Send the updated unknowns of all nodes stored in the send list;

- Receive the updated unknowns of all nodes stored in the receive list;

- Overwrite the unknowns for these received points

In order to demonstrate the use of explicit flow solvers on MIMD machines, we sider the same supersonic inlet problem as described above for shared-memory parallelmachines (see Figure 15.24) The solution obtained on a 6-processor MIMD machine after

con-800 timesteps is shown in Figure 15.28(a) The boundaries of the different domains can beclearly distinguished Figure 15.28(b) summarizes the speedups obtained for a variety ofplatforms using MPI as the message passing library, as well as the shared memory option.Observe that an almost linear speedup is obtained For large-scale industrial applications

of domain decomposition in conjunction with advanced compressible flow solvers, seeMavriplis and Pirzadeh (1999)

15.7 The effect of Moore’s law on parallel computing

One of the most remarkable constants in a rapidly changing world has been the rate of growthfor the number of transistors that are packaged onto a square inch This rate, commonlyknown as Moore’s Law, is approximately a factor of two every 18 months, which translatesinto a factor of 10 every 5 years (Moore (1965, 1999)) As one can see from Figure 15.29 thisrate, which governs the increase in computing speed and memory, has held constant for morethan three decades, and there is no end in sight for the foreseeable future (Moore (2003)).One may argue that the raw number of transistors does not translate into CPU performance.However, more transistors translate into more registers and more cache, both importantelements to achieve higher throughput At the same time, clock rates have increased, andpre-fetching and branch prediction have improved Compiler development has also not stoodstill Moreover, programmers have become conscious of the added cost of memory access,cache misses and dirty cache lines, employing the techniques described above to minimizetheir impact The net effect, reflected in all current projections, is that CPU performance isgoing to continue advancing at a rate comparable to Moore’s Law

Trang 2

EFFICIENT USE OF COMPUTER HARDWARE 345

Mach Ŧ Number: Usual vs 6 Ŧ Proc Run (min=0.825, max=3.000, incr=0.05)

(a)

1 2 4 8 16 32

Nr of Processors

Ideal SGI-O2K SHM SGI-O2K MPI IBM-SP2 MPI HP-DAX MPI

(b) Figure 15.28 Supersonic inlet: (a) MIMD results; (b) speedup for different machines

Figure 15.29 Evolution of transistor density

Trang 3

15.7.1 THE LIFE CYCLE OF SCIENTIFIC COMPUTING CODES

Let us consider the effects of Moore’s Law on the lifecycle of typical large-scale scientificcomputing codes The lifecycle of these codes may be subdivided into the following stages:

In the conceptual stage, the basic purpose of the code is defined, the physics to be

simulated identified and proper algorithms are selected and coded The many possiblealgorithms are compared, and the best is kept A run during this stage may take weeks ormonths to complete A few of these runs may even form the core of a PhD thesis

The demonstration stage consists of several large-scale runs that are compared to

exper-iments or analytical solutions As before, a run during this stage may take weeks or months

to complete Typically, during this stage the relevant time-consuming parts of the code areoptimized for speed

Once the basic code is shown to be useful, it may be adopted for production runs This

implies extensive benchmarking for relevant applications, quality assurance, bookkeeping ofversions, manuals, seminars, etc For commercial software, this phase is also referred to as

industrialization of a code It is typically driven by highly specialized projects that qualify

the code for a particular class of simulations, e.g air conditioning or external aerodynamics

of cars

If the code is successful and can provide a simulation capability not offered by

competi-tors, the fourth phase, i.e widespread use and acceptance, will follow naturally An important

shift is then observed: the ‘missionary phase’ (why do we need this capability?) suddenlytransitions into a ‘business as usual phase’ (how could we ever design anything without thiscapability?) The code becomes an indispensable tool in industrial research, development,design and analysis It forms part of the widely accepted body of ‘best practices’ and isregarded as commercial off the shelf (COTS) technology

One can envision a fifth phase, where the code is embedded into a larger module, e.g.

a control device that ‘calculates on the fly’ based on measurement input The technologyembodied by the code has then become part of the common knowledge and the source isfreely available

The time from conception to widespread use can span more than two decades Duringthis time, computing power will have increased by a factor of 1:10 000 Moreover, during adecade, algorithmic advances and better coding will improve performance by at least anotherfactor of 1:10 Let us consider the role of parallel computing in light of these advances

During the demonstration stage, runs may take weeks or months to complete on the

largest machine available at the time This places heavy emphasis on parallelization Giventhat optimal performance is key, and massive parallelism seems the only possible way of

Trang 4

EFFICIENT USE OF COMPUTER HARDWARE 347

solving the problem, distributed memory parallelism on O(103)processors is perhaps the

only possible choice The figure of O(103)processors is derived from experience: even as ahigh-end user with sometimes highly visible projects the author has never been able to obtain

a larger number of processors with consistent availability in the last two decades Moreover,

no improvement is foreseeable in the future The main reason lies in the usage dynamics oflarge-scale computers: once online, a large audience requests time on it, thereby limiting themaximum number of processors available on a regular basis for production runs

Once the code reaches production status, a shift in emphasis becomes apparent More and

more ‘options’ are demanded, and these have to be implemented in a timely manner Anotherfive years have passed and by this time, processors have become faster (and memory has

increased) by a further factor of 1:10, implying that the same run that used to take O(103)

processors can now be run on O(102) processors Given this relatively small number of

processors, and the time constraints for new options/variants, shared memory parallelism

becomes the most attractive option

The widespread acceptance of a successful code will only accentuate the emphasis on

quick implementation of options and user-specific demands Widespread acceptance alsoimplies that the code will no longer run exclusively on supercomputers, but will migrate

to high-end servers and ultimately PCs The code has now been in production for at least

5 years, implying that computing power has increased again by another factor of 1:10 The

same run that used to take O(103)processors in the demonstration stage can now be run using

O(101) processors, and soon will be within reach of O(1) processors Given that user-specific

demands dominate at this stage, and that the developers are now catering to a large user base

working mostly on low-end machines, parallelization diminishes in importance, even to the

point of completely disappearing as an issue As parallelization implies extra time devoted tocoding, thereby hindering fast code development, it may be removed from consideration atthis stage

One could consider a fifth phase, 20 years into the life of the code The code has become

an indispensable commodity tool in the design and analysis process, and is run thousands oftimes per day Each of these runs is part of a stochastic analysis or optimization loop, and isperformed on a commodity chip-based, uni-processor machine Moore’s Law has effectively

removed parallelism from the code.

Figure 15.30 summarizes the life cycle of typical scientific computing codes

Concept Demo Prod Wide Use COTS Embedded Time 1

Figure 15.30 Life cycle of scientific computing codes

Trang 5

15.7.2 EXAMPLES

Let us consider two examples where the life cycle of codes described above has becomeapparent

15.7.2.1 External missile aerodynamics

The first example considers aerodynamic force and moment predictions for missiles wide, approximately 100 new missiles or variations thereof appear every year In order toassess their flight characteristics, the complete force and moment data for the expected flightenvelope must be obtained Simulations of this type based on the Euler equations require

World-approximately O(106–107)elements, special limiters for supersonic flows, semi-empiricalestimation of viscous effects and numerous specific options such as transpiration boundaryconditions, modelling of control surfaces, etc The first demonstration/feasibility studies tookplace in the early 1980s At that time, it took the fastest production machine of the day(Cray-XMP) a night to compute such flows The codes used were based on structured grids(Chakravarthy and Szema (1987)) as the available memory was small compared to the number

of gridpoints The increase of memory, together with the development of codes based on

unstructured (Mavriplis (1991b), Luo et al (1994)) or adaptive Cartesian grids (Melton et al (1993), Aftosmis et al (2000)) as well as faster, more robust solvers (Luo et al (1998))

allowed for a high degree of automation At present, external missile aerodynamics can beaccomplished on a PC in less than an hour, and runs are carried out daily by the thousands forenvelope scoping and simulator input on PC clusters (Robinson (2002)) Figure 15.31 shows

an example

Figure 15.31 External missile aerodynamics

15.7.2.2 Blast simulations

The second example considers pressure loading predictions for blasts Simulations of this

type based on the Euler equations require approximately O(106–108)elements, special iters for transient shocks, and numerous specific options such as links to damage prediction

Trang 6

lim-EFFICIENT USE OF COMPUTER HARDWARE 349post-processors The first demonstration/feasibility studies took place in the early 1990s

(Baum and Löhner (1991), Baum et al (1993, 1995, 1996)) At that time, it took the fastest

available machine (Cray-C90 with special memory) several days to compute such flows.The increase of processing power via shared memory machines during the past decade hasallowed for a considerable increase in problem size, physical realism via coupled CFD/CSD

runs (Löhner and Ramamurti (1995), Baum et al (2003)) and a high degree of automation.

At present, blast predictions with O(2× 106) elements can be carried out on a PC in a

matter of hours (Löhner et al (2004c)), and runs are carried out daily by the hundreds for

maximum possible damage assessment on networks of PCs Figure 15.32 shows the results

of such a prediction based on genetic algorithms for a typical city environment (Togashi et al.

(2005)) Each dot represents an end-to-end run (grid generation of approximately 1.5 milliontetrahedra, blast simulation with advanced CFD solver, damage evaluation), which takesapproximately 4 hours on a high-end PC The scale denotes the estimated damage produced

by the blast at the given point This particular run was done on a network of PCs and is typical

of the migration of high-end applications to PCs due to Moore’s Law

Figure 15.32 Maximum possible damage assessment for inner city

15.7.3 THE CONSEQUENCES OF MOORE’S LAW

The statement that parallel computing diminishes in importance as codes mature is predicated

on two assumptions:

- the doubling of computing power every 18 months will continue;

- the total number of operations required to solve the class of problems the code wasdesigned for has an asymptotic (finite) value

Trang 7

The second assumption may seem the most difficult to accept After all, a natural side effect

of increased computing power has been the increase in problem size (grid points, materialmodels, time of integration, etc.) However, for any class of problem there is an intrinsic limitfor the problem size, given by the physical approximation employed Beyond a certain point,the physical approximation does not yield any more information Therefore, we may have toaccept that parallel computing diminishes in importance as a code matures

This last conclusion does not in any way diminish the overall significance of parallel

com-puting Parallel computing is an enabling technology of vital importance for the development

of new high-end applications Without it, innovation would seriously suffer

On the other hand, without Moore’s Law many new code developments would appear

as unjustified If computing time does not decrease in the future, the range of applicationswould soon be exhausted CFD developers worldwide have always assumed subconsciouslyMoore’s Law when developing improved CFD algorithms and techniques

Trang 8

16 SPACE-MARCHING AND

DEACTIVATION

For several important classes of problems, the propagation behaviour inherent in the PDEsbeing solved can be exploited, leading to considerable savings in CPU requirements.Examples where this propagation behaviour can lead to faster algorithms include:

- detonation: no change to the flowfield occurs ahead of the denotation wave;

- supersonic flows: a change of the flowfield can only be influenced by upstream events,

but never by downstream disturbances; and

- scalar transport: a change of the transported variable can only occur in the downstream

region, and only if a gradient in the transported variable or a source is present.The present chapter shows how to combine physics and data structures to arrive at fastersolutions Heavy emphasis is placed on space-marching, where these techniques have reachedconsiderable maturity However, the concepts covered are generally applicable

16.1 Space-marching

One of the most efficient ways of computing supersonic flowfields is via so-called marching techniques These techniques make use of the fact that in a supersonic flowfield

space-no information can travel upstream Starting from the upstream boundary, the solution

is obtained by marching in the downstream direction, obtaining the solution for the nextdownstream plane (for structured (Kutler (1973), Schiff and Steger (1979), Chakravarthy and

Szema (1987), Matus and Bender (1990), Lawrence et al (1991)) or semi-structured (McGrory et al (1991), Soltani et al (1993)) grids), subregion (Soltani et al (1993),

Nakahashi and Saitoh (1996), Morino and Nakahashi (1999)) or block In the following,

we will denote as a subregion a narrow band of elements, and by a block a larger region of

elements (e.g one-fifth of the mesh) The updating procedure is repeated until the whole fieldhas been covered, yielding the desired solution

In order to estimate the possible savings in CPU requirements, let us consider a

steady-state run Using local timesteps, it will take an explicit scheme approximately O(n s )steps

to converge, where n s is the number of points in the streamwise direction The total number

of operations will therefore be O(n t · n2

s ) , where n t is the average number of points in the

transverse planes Using space-marching, we have, ideally, O(1) steps per active domain, implying a total work of O(n t · n s ) The gain in performance could therefore approach

O(1: n s ) for large n s Such gains are seldomly realized in practice, but it is not uncommon

to see gains in excess of 1:10

Applied Computational Fluid Dynamics Techniques: An Introduction Based on Finite Element Methods, Second Edition.

Trang 9

Of the many possible variants, the space-marching procedure proposed by Nakahashi andSaitoh (1996) appears as the most general, and is treated here in detail The method can beused with any explicit time-marching procedure, it allows for embedded subsonic regionsand is well suited for unstructured grids, enabling a maximum of geometrical flexibility Themethod works with a subregion concept (see Figure 16.1) The flowfield is only updated

in the so-called active domain Once the residual has fallen below a preset tolerance, theactive domain is shifted Should subsonic pockets appear in the flowfield, the active domain

is changed appropriately

maskp 0 1 2 3 4 5 6

Active Domain Computed

Field

Uncomputed Field

Residual Monitor Region Flow Direction

Figure 16.1 Masking of points

In the following, we consider computational aspects of Nakahashi and Saitoh’s marching scheme and a blocking scheme in order to make them as robust and efficient aspossible without a major change in existing codes The techniques are considered in thefollowing order: masking of edges and points, renumbering of points and edges, grouping

space-to avoid memory contention, extrapolation of the solution for new active points, treatment

of subsonic pockets, proper measures for convergence, the use of space-marching withinimplicit, time-accurate solvers for supersonic flows and macro-blocking

16.1.1 MASKING OF POINTS AND EDGES

As seen in the previous chapters, any timestepping scheme requires the evaluation of fluxes,residuals, etc These operations typically fall into two categories:

(a) point Loops, which are of the form

do ipoin=1,npoin

do work on the point level

enddo

Trang 10

SPACE-MARCHING AND DEACTIVATION 353

(b) edge loops, which are of the form

do iedge=1,nedge

gather point information

do work on the edge level

scatter-add edge results to points

enddo

The first loop is typical of unknown updates in multistage Runge–Kutta schemes, tion of residuals or other point sums, pressure, speed of sound evaluations, etc The secondloop is typical of flux summations, artificial viscosity contributions, gradient calculations andthe evaluation of the allowable timestep For cell-based schemes, point loops are replaced

initializa-by cell loops and edge loops are replaced initializa-by face loops However, the nature of these loopsremains the same The bulk of the computational effort of any scheme is usually carried out

in loops of the second type

In order to decide where to update the solution, points and edges need to be classified or

‘masked’ Many options are possible here, and we follow the notation proposed by Nakahashiand Saitoh (1996) (see Figure 16.1):

maskp=0: point in downstream, uncomputed field;

maskp=1: point in downstream, uncomputed field, connected to active domain;

maskp=2: point in active domain;

maskp=3: point of maskp=2, with connection to points of maskp=4;

maskp=4: point in the residual-monitor subregion of the active domain;

maskp=5: point in the upstream computed field, with connection to active domain;maskp=6: point in the upstream computed field

The edges for which work has to be carried out then comprise all those for which at least one

of the endpoints satisfies0<maskp<6 These active edges are marked asmaske=1, whileall others are marked asmaske=0

The easiest way to convert a time-marching code into a space- or domain-marching code

is by rewriting the point- and edge loops as follows

Loop 1a:

do ipoin=1,npoin

if(maskp(ipoin).gt.0 and maskp(ipoin).lt.6) then

endif

enddo

Trang 11

For typical aerodynamic configurations, resolution of geometrical detail and flow featureswill dictate the regions with smaller elements In order to be as efficient as possible, the regionbeing updated at any given time should be chosen as small as possible This implies that, inregions of large elements, there may exist edges that connect points marked asmaskp=4topoints marked asmaskp=0 In order to leave at least one layer of points in the safety region,

a pass over the edges is performed, setting the downstream point tomaskp=4for edges withpoint markingsmaskp=2,0

16.1.2 RENUMBERING OF POINTS AND EDGES

For a typical space-marching problem, a large percentage of points in Loop 1a will notsatisfy theif-statement, leading to unnecessary work Renumbering the points according

to the marching direction has the twofold advantage of a reduction in cache-misses, and thepossibility to bound the active point region locally Defining

npami: the minimum point number in the active region,

npamx: the maximum point number in the active region,

npdmi: the minimum point number touched by active edges,

npdmx: the maximum point number touched by active edges,

Loop 1a may now be rewritten as follows

Loop 1b:

do ipoin=npami,npamx

if(maskp(ipoin).gt.0 and maskp(ipoin).lt.6) then

neami: The minimum active edge number;

neamx: The maximum active edge number;

Loop 2a may now be rewritten as follows

Loop 2b:

do iedge=neami,neamx

if(maske(iedge).eq.1) then

endif

enddo

Trang 12

SPACE-MARCHING AND DEACTIVATION 35516.1.3 GROUPING TO AVOID MEMORY CONTENTION

In order to achieve pipelining or vectorization, memory contention must be avoided Theenforcement of pipelining or vectorization is carried out using a compiler directive, asLoop 2b, which becomes an inner loop, and still offers the possibility of memory contention

In this case, we have the following:

nedge

1 mvecl

Figure 16.2 Near-optimal point-range access of edge groups

The loop structure is shown schematically in Figure 16.2 One is now in a position toremove the if-statement from the innermost loop, situating it outside The inactive edgegroups are marked, e.g.edpas(ipass)<0 This results in the following

Định dạng
Số trang	25
Dung lượng	631,04 KB