Notice furthermorethe small extent of the regions that require refinement as compared to the overall domain.The equivalent uniform mesh run would have required more than two orders of ma
Trang 1(c) Figure 14.9 Continued
Trang 2294 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUES14.5.2 SHOCK-OBJECT INTERACTION IN TWO DIMENSIONS
Figures 14.9(a)–(c) show a case taken from (Baum and Löhner (1992)) They show classich-refinement for strongly unsteady flows at its best For this class of problems a new mesh isrequired every five to seven timesteps, strict conservation of mass, momentum and energyduring refinement is critical, and the introduction of dissipation due to information lossduring interpolation when remeshing proves disastrous for accuracy A maximum of six levels
of refinement were specified for this case, yielding meshes that on average have 300 000triangles and 100 000 points Figures 14.9(a) and (b) show the mesh, mesh refinement levelsand pressures for different times
(a)
(b)
Figure 14.10 Shock–object interaction in three dimensions
Observe the detail in the physics that is achievable through adaptation Notice furthermorethe small extent of the regions that require refinement as compared to the overall domain.The equivalent uniform mesh run would have required more than two orders of magnitudemore elements, CPU time and memory, pushing the limits of available supercomputers The
Trang 3(b)
(c) Figure 14.11 Shock–structure interaction: (a) building definition; (b) surface mesh and pressure;
(c) mesh and pressure in plane
Trang 4296 APPLIED COMPUTATIONAL FLUID DYNAMICS TECHNIQUEScomparison to experimental results, given in Figure 14.9(c), reveals that indeed very accurateresults with a minimum of degrees of freedom are achieved using adaptive grid refinementfor this class of problems.
Figure 14.12 Object falling into supersonic free stream
14.5.3 SHOCK–OBJECT INTERACTION IN THREE DIMENSIONS
Figures 14.10(a)–(b) show a case taken from Baum and Löhner (1991) and Löhner and Baum(1992) The object under consideration is a common main battlefield tank A maximum oftwo layers of refinement were specified close to the tank, whereas only one level of refinementwas employed farther away The original, unrefined, but strongly graded mesh consisted
of approximately 100 000 tetrahedra and 20 000 points During the run, a mesh change(refinement and coarsening) occurred every five timesteps, and the mesh size increased toapproximately 1.6 million tetrahedra and 280 000 points This represents an increase factor
of 1:16 Although seemingly high, the corresponding global h-refinement would have resulted
in a 1:64 size increase A second important factor is that most of the elements of theoriginal mesh are close to the body, where most of the refinement is going to take place.Figures 14.10(a) and (b) show surface gridding and pressure contours at two selected timesduring the run The extent of mesh refinement is clearly discernable, as well as the locationand interaction of shocks
Trang 514.5.4 SHOCK–STRUCTURE INTERACTION
Figures 14.11(a)–(c) show a typical shock–structure interaction case The building underconsideration is shown in Figure 14.11(a) One layer of refinement was specified whereverthe physics required it The pressures and grids obtained at the surface and at planes at agiven time are shown in Figures 14.11(b) and (c) The mesh had approximately 60 milliontetrahendra
14.5.5 OBJECT FALLING INTO SUPERSONIC FREE STREAM TWO DIMENSIONSThe problem statement is as follows: an object is placed in a cavity surrounded by a free
stream at M∞= 1.5 After the steady-state solution is reached (time T = 0.0), a body motion
is prescribed, and the resulting flowfield disturbance is computed Adaptive remeshing wasperformed every 100 timesteps initially, while at later times the grid was modified every 50timesteps One level of global h-refinement was used to accelerate the grid regeneration The
maximum stretching ratio specified was S = 5.0 Figure 14.12 shows different stages during the computation at times T = 60 and T = 160 One can clearly see how the location and
strength of the shocks change due to the motion of the object Notice how the directionality
of the flow features is reflected in the mesh
Trang 615 EFFICIENT USE OF COMPUTER
HARDWARE
However clever an algorithm may be, it has to run efficiently on the available computerhardware Each type of computer, from the PC to the fastest massively parallel machine,has its own shortcomings that must be accounted for when developing both the algorithmsand the simulation code The present section assumes that the algorithm has been selected,and identifies the main issues that must be addressed in order to achieve good performance
on the most common types of computers The main types of computer platforms currentlybeing used are as follows
(a) Personal computers Although perhaps not considered a serious analysis tool even a
decade ago, personal computers can already be used cost-effectively for 3-D simulations
In fact, many applications where CPU time is not a constraining factor are currently beingcarried out on PCs Most CFD software companies report higher revenues from PC platformsthan from all other platforms combined High-end PCs (4 Gbytes of RAM, 120 GFLOPSgraphics card) are ideal tools for simulations We see this as one more proof of the theme thathas been repeated so often in this book: a CFD run is more than just CPU – if this were so,vector machines would have become the dominant type of computer Rather, it consists ofproblem definition, grid generation, flow solver execution and visualization High-end PCscombine a relatively fast CPU with good visualization hardware, allowing to cut down themost expensive cost-component of any run: man-hours
(b) Vector machines These machines achieve higher speeds by splitting up arithmetic
operations (fetch, align, add, multiply, store, etc.), performing each on different data itemsconcurrently The assumption made is that the same basic operation(s) have to be performed
on a relatively large number of data items These data items can be thought of as vectors,hence the name As an example, consider the operationD=C*(A+B) While the central CPU
fetches the data from memory for the ith item, it may align the data for item i+ 1, add
two numbers for item i + 2, multiply numbers for item i + 3 and store the results for item
i+ 4 This would yield a speedup of 1:4 In practice, many more operations than the onesdescribed above are required even to add two numbers Hence, speedups of about one order
of magnitude are achievable (1:14 on the Cray-X or NEC-SX series)
(c) Single instruction multiple data (SIMD) machines Here the assumption made is that all
data items (e.g elements, points, etc.) will be subject to the same arithmetic operations Inorder to go beyond the one order of magnitude speedup of vector machines, thousands ofprocessors are combined Each processor performs the same task on a different piece of data.While this type of machine did not succeed when based on conventional chips, high-end
graphics cards are increasingly being used in this mode (Hagen et al (2006), LeGresley et al.
(2007))
Applied Computational Fluid Dynamics Techniques: An Introduction Based on Finite Element Methods, Second Edition.
Trang 7(d) Multiple instruction multiple data (MIMD) machines In this case different arithmetic
operations may be performed on different processors This circumvents some of the tions posed by the SIMD assumption that all processors are performing the same arithmeticoperation On the other hand, the operating system software required to keep these machinesfunctioning is much more involved and sensitive than that required for SIMD machines.The emerging architecture for future machines is a generalization of the MIMD machine,where some processors may be based on commodity, general-purpose chips, others onreduced instruction set chips (RISC-chips), others on powerful vector-processors, and somehave SIMD architecture An example of such a machine is the Cray-T3E, which combines
restric-a Crrestric-ay-T90 vector supercomputer with up to 2056 Alphrestric-a-Chip-brestric-ased processors An restric-tecture like this, which combines scalar, vector and distributed memory parallel processing,requires the programmer to take into consideration all the individual aspects encountered ineach of these architectures
If the data required by the CPU for subsequent arithmetic operations is not close enough tofit into the cache, this piece of information will have to be fetched from memory or disk This
is called a cache-miss Depending on the frequency of cache-misses versus CPU, a seriousdegradation in performance, often in excess of 1:10, can take place The relative number ofcache-misses invariably increases with problem size The aim of the renumbering strategiesconsidered in the present section is to minimize the frequency of cache-misses, i.e to retardthe degradation of performance with problem size The main techniques considered are:
- array access in loops;
- renumbering of points to reduce the spread in memory of the items fetched by a singleelement or edge;
- reordering of the nodes in each element so that data is accessed in as uniform a way aspossible within each element; and
- renumbering of elements, faces and edges so that data is accessed in as uniform a way
as possible when looping over them
15.1.1 ARRAY ACCESS IN LOOPS
Storing all the arrays required (elements, coordinates, unknowns, edges, etc.) in a way that iscompatible with the way they are accessed within loops reduces cache-misses appreciably Tosee why, consider the array containing the coordinates of the points: horizontal or flat storage
some Crays the preferred choice would be vertical storage à lacoord(npoin,ndimn)
Trang 8EFFICIENT USE OF COMPUTER HARDWARE 301
Suppose that the difference vector (dx, dy, dz) of the two endpoints of an edge is required.
This implies fetching six items and performing three arithmetic operations For flat storage,the jump in memory is given by
whereas for vertical storage the jumps are
The difference in the number of large jumps is clearly visible from this comparison For thisreason, flat storage is recommended for any machine with cache Note that, for codes written
in C, the opposite holds, as the second index moves faster than the first one
15.1.2 POINT RENUMBERING
Consider the evaluation of an edge RHS (the same basic principle applies to element-based
or face-based solvers), given by the following loop
(a) gather point information into the edge;
(b) perform the required mathematical operations at edge level;
(c) scatter-add the edge RHS to the assembled point RHS
Trang 9The transfer of information to and from memory required in steps (a), (c) is proportional
to the number of nodes in the edge (element, face) and the number of unknowns per node
If the nodes within each edge (element, face) are widely spaced in memory, cache-missesare likely to occur If, on the other hand, all the points within an element are ‘close’ inmemory, cache-misses are minimized From these considerations, it becomes clear thatcache-misses are directly linked to the bandwidth of the equivalent matrix system (or graph).Point renumbering to reduce bandwidths has been an important theme for many years intraditional finite element applications (Piessanetzky (1984), Zienkiewicz (1991)) The aimwas to reduce the cost of the matrix inversion, which was considered to be the most expensivepart of any finite element simulation
(b) (a)
Figure 15.1 Ordering of points for 2-D mesh
The optimal renumbering of points in such a way that spatial (or near-neighbour) locality
is mirrored in memory is a problem of formidable algorithmic complexity Fortunately,most of the benefits of renumbering points are already obtained from near-optimal heuristicrenumbering techniques To see how most of these fast, near-optimal techniques work,consider the rectangular domain with a structured mesh shown in Figure 15.1 Numberingthe points in the horizontal (Figure 15.1(a)) and vertical (Figure 15.1(b)) directions yields anaverage bandwidth ofnxandny, respectively One should therefore aim to number the points
in the direction normal to the longest graph depth Based on this observation, several pointrenumbering techniques have been developed To exemplify these techniques, the simplemesh shown in Figure 15.2 is considered
15.1.2.1 Directional ordering
If the direction of maximal graph depth is known, one can simply order the points in thisdirection This is perhaps the simplest (and fastest) possible renumbering, but implies thatthe problem class being addressed has a clear maximal graph depth direction that can easily
be identified Renumbering in the x-direction, this yields the numbering shown in Figure 15.3.
15.1.2.2 Bin ordering
Given an arbitrary distribution of points, one may first place the points in a bin of uniform
size h One can then identify, by ordering the number of bins in the x, y, z directions in
Trang 10EFFICIENT USE OF COMPUTER HARDWARE 303
Figure 15.2 Original mesh
Figure 15.3 Renumbering in the x-direction
ascending sizei,j,k, the plane k that traverses space yielding the lowest bandwidth, i.e the
closest proximity in memory Bins offer the advantage of high speed (very few operations arerequired, and most of these are easy to vectorize/ parallelize) and simplicity After obtainingthe overall dimensions of the computational domain, bin ordering may be realized in twoways:
(1) Obtain the bin each point falls into; store the points into bins (e.g using a linked list
points;
(2) Obtain the bin each point falls into; assign a number to the point based on the bin itfalls into (e.g inumb=ibinx+nbinx*(ibiny-1)+nbinx*nbiny*(ibinz-1)); store the points in a heap list (based on the assigned number); retrieve the pointsfrom the heap list, renumbering points
Bins are mostly used for grids with modest changes in element size Figure 15.4 shows thebin ordering of points for the mesh from Figure 15.2
Trang 11Figure 15.4 Renumbering using bins
subdivided space) This is easily accomplished using quadtrees (two dimensions) or octrees(three dimensions) These data structures have already been described in Chapter 2 Havingstored all the points, the quad/octree is traversed as shown in Figure 15.5, renumbering thepoints One can see that in this way spatial proximity is mirrored in memory in a near-optimalway
Figure 15.5 Renumbering using quadtree
15.1.2.4 Space-filling curves
A very similar effect to that of quad/octree ordering can be achieved by using so-called filling curves A typical curve that is often employed is the Peano–Hilbert–Morton curveshown in Figure 15.6 for two dimensions Any point in space can be thought of as lying
space-on this curve This implies that, space-once the coordinate alspace-ong this line ξ has been established for each point, the points can be renumbered in ascending order of ξ One can see from
Figure 15.6 the similarity with quad/octree renumbering, as well as the effectiveness of theprocedure
15.1.2.5 Wave renumbering
All of the techniques discussed so far have only required the spatial location of points
to achieve near-optimal renumberings However, if a mesh is given, one can obtain from
Trang 12EFFICIENT USE OF COMPUTER HARDWARE 305
Figure 15.6 Renumbering using space-filling curves
the connectivity table the nearest-neighbours for each point and construct renumberingtechniques with this information One of the most useful techniques is the so-called waverenumbering or advancing-front renumbering technique Starting from a given point, newpoints are added in layers according to the smallest connectivity The ‘front’ of renumberedpoints is advanced through the grid until all points have been covered (see Figure 15.7)
Figure 15.7 Wave front renumbering
The choice of the seed-point can have a significant effect on the total bandwidth obtained.Unfortunately, choosing the optimal starting point may be more expensive than the wholesubsequent simulation A very effective heuristic approach (all of the bandwidth minimizationstrategies are heuristic by nature) is to choose the last point of the renumbered mesh asthe starting point for a new renumbering pass This procedure is repeated until no furtherreduction in the bandwidth is achieved Convergence is obtained in a relatively small number
of passes, typically less than five, even for complex 3-D meshes
An improvement on the wave renumbering technique is the Cuthill–McKee (Cuthill andMcKee 1969) or reverse Cuthill–McKee (RCM) reordering At each stage, the node with thesmallest number of surrounding unrenumbered nodes is added to the renumbering table Formeshes, which are characterized by having a bounded number of nearest-neighbours for eachpoint, the improvement of RCM versus wave front is not considerable