1. Trang chủ
  2. » Kỹ Thuật - Công Nghệ

Future Manufacturing Systems Part 4 pptx

20 255 0
Tài liệu đã được kiểm tra trùng lặp

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 20
Dung lượng 770,98 KB

Các công cụ chuyển đổi và chỉnh sửa cho tài liệu này

Nội dung

Evaluating 25 processes on three considered scenarios time in seconds Analyzing scenario iii with α equal to 16, we detected that the first migration is postponed, which results in a lar

Trang 1

(500 Kbytes is fixed to other process’ data) and passes 100 Kilobytes of boundary data to its

right neighbor In the same way, when 25 processes are employed, each one computes 4.108

instructions and occupies 900 Kbytes in memory

5.1.2 Results and Discussions

Table 1 presents the times when testing 10 processes Firstly, we can observe that MigBSP’s

intrusivity on application execution is short when comparing both scenarios i and ii

(over-head lower than 5%) The processes are balanced among themselves with this configuration,

causing the increasing of α at each call for process rescheduling This explain the low impact

when comparing scenarios i and ii Besides this, MigBSP decides that migrations are inviable

for any moment, independing on the amount of executed supersteps In this case, our model

causes a loss of performance in application execution We obtained negative values of PM

when the rescheduling was tested This fact resulted in an empty list of migration candidates

Super-step Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii

2000 1344.09 1347.88 1347.88 1346.67 1346.67 1344.91 1344.91

Table 1 Evaluating 10 processes on three considered scenarios (time in seconds)

The results of the execution of 25 processes are presented in Table 2 In this context, the system

remains stable and α grows at each rescheduling call One migration occurred {(p21,a1)} when

testing 10 supersteps and using α equal to 4 Our notation informs that process p21 was

re-assigned to run on node a1 A second and a third migrations happened when considering 50

supersteps: {(p22,a2), (p23,a3)} They happened in the next two calls for process rescheduling

(at supersteps 12 and 28) When evaluating 2000 supersteps and maintaining this value of α,

eight migrations take place: {(p21,a1), (p22,a2), (p23,a3), (p24,a4), (p25,a5), (p18,a6), (p19,a7),

(p20,a8)} We analyzed that all migrations occurred to the fastest cluster (Aquario) The first

five migrations moved processes from cluster Corisco to Aquario After that, three processes

from Labtec were chosen for migration Concluding, we obtained a profit of 14% after

execut-ing 2000 supersteps when α equal to 4 is used.

Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii

Table 2 Evaluating 25 processes on three considered scenarios (time in seconds)

Analyzing scenario iii with α equal to 16, we detected that the first migration is postponed, which results in a larger final time when compared with lower values of α With α 4 for

instance, we have more calls for process rescheduling with migrations during the first super-steps This fact will cause a large overhead to be paid during this period These penalty costs are amortized when the amount of executed supersteps increases Thus, the configuration

with α 4 outperforms other studied values of α when 2000 supersteps are evaluated Figure

10 illustrates the frequency of process rescheduling calls when testing 25 processes and 2000

supersteps We can observe that 6 calls are done with α 16, while 8 are performed when initial

α changes to 4 Considering scenarios ii, we conclude that the greater is α, the lower is the

model’s impact if migrations are not applied (situation in which migration viability is false)





  

  





Fig 10 Number of rescheduling calls when 25 processes and 2000 supersteps are evaluated Table 3 shows the results when the number of processes is increased to 50 The processes

are considered balanced and α increases at each rescheduling call In this manner, we have

the same configuration of calls when testing 25 processes (see Figure 10) We achieved 8 migrations when 2000 supersteps are evaluated: {(p38,a1), (p40,a2), (p42, a3), (p39, a4), (p41, a5), (p37, a6), (p22, a7), (p21, a8)} MigBSP moves all processes from cluster Frontal to Aquario

and transfers two process from Corisco to the fastest cluster Using α 4, 430.95s and 408.25s were obtained for scenarios i and iii, respectively Besides this 5% of gain with α 4, we also achieve a gain when α is equal to 8 However, the final result when changing initial α to 16 in scenario iii is worse than scenario i, since the migrations are delayed and more supersteps are

need to achieve a gain in this situation Table 4 presents the execution of 100 processes over the tested infrastructure As the situations with 25 and 50 processes, the environment when 100 processes are evaluated is stable and the processes are balanced among the resources Thus,

αincreases at each rescheduling call The same migrations occurred when testing 50 and 100 processes, since the configuration with 100 just uses more nodes from cluster ICE In general, the same percentage of gain was achieve with 50 and 100 processes

The results of scenarios i, ii and iii with 200 processes is shown in Table 5 We have an un-stable scenario in this situation, which explains the fact of a large overhead in scenario ii Considering this scenario, α will begin to grow after ω calls for process rescheduling without migrations Taking into account scenario iii and α equal to 4, 2 migrations are done when

ex-ecuting 10 supersteps: {(p195,a1), (p197,a2)} Besides these, 10 migrations take place when 50 supersteps were tested: {(p196,a3), (p198,a4), (p199,a5), (p200,a6), (p38,a7), (p39,a8), (p37,a9), (p40,a10), (p41,a11), (p42, a12)} Despite the happening of these migrations, the processes are

still unbalanced with adopted value of D and, then, α does not increase at each superstep.

Trang 2

Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii

Table 3 Evaluating 50 processes on three considered scenarios (time in seconds)

Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii

Table 4 Evaluating 100 processes on three considered scenarios (time in seconds)

Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii

Table 5 Evaluating 200 processes on three considered scenarios (time in seconds)

After these migrations, MigBSP does not indicate the viability of other ones Thus, after ω

calls without migrations, MigBSP enlarges the value of D and α begins to increase following

adaptation 2 (see Subsection 3.2 for details)

Processes Scenario i - Without process migration Scenario iii - With process migration

Table 6 Barrier times on two situations

Table 6 presents the barrier times captured when 2000 supersteps were tested More

espe-cially, the time is captured when the last superstep is executed We implemented a centralized

master-slave approach for barrier, where process 1 receives and sends a scheduling message from/to other BSP processes Thus, the barrier time is captured on process 1 The times shown

in the third column of Table 6 do not include both scheduling messages and computation Our idea is to demonstrate that the remapping of processes decreases the time to compute the BSP supersteps Therefore, process 1 can reduce the waiting time for barrier computation since the processes reach this moment faster Analyzing such table, we observed that a gain of 22% in

time was achieved when comparing barrier time on scenarios i and iii with 50 processes The

gain was reduced when 100 processes were tested This occurs because we just include more nodes from cluster ICE with 100 processes if compared with the execution of 50 processes

5.2 Smith-Waterman Application

Our second application is based on dynamic programming (DP), which is a popular algorithm design technique for optimization problems (Low et al., 2007) DP algorithms can be classified according to the matrix size and the dependency relationship of each matrix cell An algorithm

for a problem of size n is called a tD/eD algorithm if its matrix size is O(n t) and each matrix

cell depends on O(n e ) other cells 2D/1D algorithms are all irregular with changes on load

computation density along the matrix’s cells In particular, we observed the Smith-Waterman algorithm that is a well-known 2D/1D algorithm for local sequence alignment (Smith, 1988)

5.2.1 Modeling the Problem

Smith-Waterman algorithm proceeds in a series of wavefronts diagonally across the matrix Figure 11 (a) illustrates the concept of the algorithm for a 4×4 matrix with a column-based processes allocation The more intense the shading, the greater is the load computation den-sity of the cell Each wavefront corresponds to a BSP superstep For instance, Figure 11 (b) shows a 4×4 matrix that presents 7 supersteps The computation load is uniform inside a particular superstep, growing up when the number of the superstep increases Both organi-zations of diagonal-based supersteps mapping and column-based processes mapping bring

the following conclusions: (i) 2n −1 supersteps are crossed to compute a square matrix with

order n and; (ii) each process will be involved on n supersteps Figures 11 (b) and (c) show the communication actions among the processes Considering that cell x, y (x means a matrix’ line, while y is a matrix’ column) needs data from the x, y − 1 and x − 1, y other ones, we will have an interaction from process py to process py+1 We do not have communication inside the same matrix column, since it corresponds to the same process

The configuration of scenarios ii and iii depends on the Computation Pattern P comp(i)of each

process i (see Subsection 3.3 for more details) P comp(i)increases or decreases depending on the prediction of the amount of performed instructions at each superstep We consider a

spe-cific process as regular if the forecast is within a δ margin of fluctuation from the amount of

instructions performed actually In our experiments, we are using 106as the amount of in-structions for the first superstep and 109for the last one The increase of load computational density among the supersteps is uniform In other words, we take the difference between 109

and 106and divide by the number of involved supersteps in a specific execution Considering

this, we applied δ equal to 0.01 (1%) and 0.50 (50%) to scenarios ii and iii, respectively This last value was used because I2(1)is 565.105and PI2(1)is 287.105when a 10×10 matrix is tested (see details about the notations in Subsection 3.3) The percentage of 50% enforces instruction

regularity in the system Both values of δ will influence the Computation metric, and

conse-quently the choosing of candidates for migration Scenario ii tends to obtain negatives values

for PM since the Computation Metric will be close to 0 Consequently, no migrations will

Trang 3

Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii

Table 3 Evaluating 50 processes on three considered scenarios (time in seconds)

Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii

Table 4 Evaluating 100 processes on three considered scenarios (time in seconds)

Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii

Table 5 Evaluating 200 processes on three considered scenarios (time in seconds)

After these migrations, MigBSP does not indicate the viability of other ones Thus, after ω

calls without migrations, MigBSP enlarges the value of D and α begins to increase following

adaptation 2 (see Subsection 3.2 for details)

Processes Scenario i - Without process migration Scenario iii - With process migration

Table 6 Barrier times on two situations

Table 6 presents the barrier times captured when 2000 supersteps were tested More

espe-cially, the time is captured when the last superstep is executed We implemented a centralized

master-slave approach for barrier, where process 1 receives and sends a scheduling message from/to other BSP processes Thus, the barrier time is captured on process 1 The times shown

in the third column of Table 6 do not include both scheduling messages and computation Our idea is to demonstrate that the remapping of processes decreases the time to compute the BSP supersteps Therefore, process 1 can reduce the waiting time for barrier computation since the processes reach this moment faster Analyzing such table, we observed that a gain of 22% in

time was achieved when comparing barrier time on scenarios i and iii with 50 processes The

gain was reduced when 100 processes were tested This occurs because we just include more nodes from cluster ICE with 100 processes if compared with the execution of 50 processes

5.2 Smith-Waterman Application

Our second application is based on dynamic programming (DP), which is a popular algorithm design technique for optimization problems (Low et al., 2007) DP algorithms can be classified according to the matrix size and the dependency relationship of each matrix cell An algorithm

for a problem of size n is called a tD/eD algorithm if its matrix size is O(n t) and each matrix

cell depends on O(n e ) other cells 2D/1D algorithms are all irregular with changes on load

computation density along the matrix’s cells In particular, we observed the Smith-Waterman algorithm that is a well-known 2D/1D algorithm for local sequence alignment (Smith, 1988)

5.2.1 Modeling the Problem

Smith-Waterman algorithm proceeds in a series of wavefronts diagonally across the matrix Figure 11 (a) illustrates the concept of the algorithm for a 4×4 matrix with a column-based processes allocation The more intense the shading, the greater is the load computation den-sity of the cell Each wavefront corresponds to a BSP superstep For instance, Figure 11 (b) shows a 4×4 matrix that presents 7 supersteps The computation load is uniform inside a particular superstep, growing up when the number of the superstep increases Both organi-zations of diagonal-based supersteps mapping and column-based processes mapping bring

the following conclusions: (i) 2n −1 supersteps are crossed to compute a square matrix with

order n and; (ii) each process will be involved on n supersteps Figures 11 (b) and (c) show the communication actions among the processes Considering that cell x, y (x means a matrix’ line, while y is a matrix’ column) needs data from the x, y − 1 and x − 1, y other ones, we will have an interaction from process py to process py+1 We do not have communication inside the same matrix column, since it corresponds to the same process

The configuration of scenarios ii and iii depends on the Computation Pattern P comp(i)of each

process i (see Subsection 3.3 for more details) P comp(i)increases or decreases depending on the prediction of the amount of performed instructions at each superstep We consider a

spe-cific process as regular if the forecast is within a δ margin of fluctuation from the amount of

instructions performed actually In our experiments, we are using 106as the amount of in-structions for the first superstep and 109for the last one The increase of load computational density among the supersteps is uniform In other words, we take the difference between 109

and 106and divide by the number of involved supersteps in a specific execution Considering

this, we applied δ equal to 0.01 (1%) and 0.50 (50%) to scenarios ii and iii, respectively This last value was used because I2(1)is 565.105and PI2(1)is 287.105when a 10×10 matrix is tested (see details about the notations in Subsection 3.3) The percentage of 50% enforces instruction

regularity in the system Both values of δ will influence the Computation metric, and

conse-quently the choosing of candidates for migration Scenario ii tends to obtain negatives values

for PM since the Computation Metric will be close to 0 Consequently, no migrations will

Trang 4

   







   



 



















Fig 11 Different views of Smith-Waterman irregular application

happen on this scenario We tested the behavior of square matrixes of order 10, 25, 50, 100 and

200 Each cell of a 10×10 matrix needs to communicate 500 Kbytes and each process occupies

1.2 Mbyte in memory (700 Kbytes comprise other application data) The cell of 25×25 matrix

communicates 200 Kbytes and each process occupies 900 Kbytes in memory and so on

5.2.2 Results and Discussions

Table 7 presents the application evaluation Nineteen supersteps were crossed when a 10×10

matrix was tested Adopting this size of matrix and α 2, 13.34s and 14.15s were obtained for

scenarios i and ii which represents a cost of 8% The higher is the value of α, the lower is

the MigBSP overhead on application execution This occurs because the system is stable

(pro-cesses are balanced) and α always increases at each rescheduling call Three calls for process

relocation were done when testing α 2 (at supersteps 2, 6 and 14) The rescheduling call at

superstep 2 does not produce migrations At this step, the load computational density is not

enough to overlap the consider migration costs involved on process transferring operation

The same occurred on the next call at superstep 6 The last call happened at superstep 14,

which resulted on 6 migrations: {(p5,a1), (p6,a2), (p7,a3), (p8,a4), (p9,a5), (p10,a6)} MigBSP

indicated the migration of processes that are responsible to compute the final supersteps The

execution with α equal to 4 implies in a shorter overhead since two calls were done (at

super-steps 4 and 12) Observing scenario iii, we do not have migrations in the first call, but eight

occurred in the other one Processes 3 up to 10 migrated in this last call to cluster Aquario α 4

outperforms α 2 for two reasons: (i) it does less rescheduling calls and; (ii) the call that causes

process migration was done at a specific superstep in which MigBSP takes better decisions

The system stays stable when the 25× 25 matrix was tested α 2 produces a gain of 11% in

performance when considering 25× 25 matrix and scenario iii This configuration presents

four calls for process rescheduling, where two of them produce migrations No migrations

are indicated at supersteps 2 and 6 Nevertheless, processes 1 up to 12 are migrated at

su-perstep 14 while processes 21 up to 25 are transferred at susu-perstep 30 These transferring

operations occurred to the fastest cluster In this last call, the remaining execution presents

19 supersteps (from 31 to 49) to amortize the migration costs and to get better performance

The execution when considering α 8 and scenario iii brings an overhead if compared with

scenario i Two calls for migrations were done, at supersteps 8 and 24 The first call causes

10×10 25×25 50×50 100×100 200×200

Scenario ii

Scenario iii

Table 7 Evaluation of scenarios i, ii and iii when varying the matrix size

the migration of just one process (number 1) to a1 and the second one produces three migra-tions: {(p21,a2),(p22,a3),(p23,a4)} We observed that processes p24 and p25 stayed on cluster Corisco Despite performed migrations, these two processes compromise the supersteps that include them Both are executing on a slower cluster and the barrier waits for the slowest

pro-cess Maintaining the matrix size and adopting α 16, we have two calls: at supersteps 16 and

48 This last call migrates p24 an p25 to cluster Aquario Although this movement is pertinent

to get performance, just one superstep is executed before ending the application

Fifty processes were evaluated when the 50× 50 matrix was considered In this context, α also

increases at each call for process rescheduling We observed that an overhead of 3% was found

when scenario i and ii were compared (using α 2) In addition, we observed that all values of

α achieved a gain of performance in scenario iii Especially when α 2 was used, five calls for

process rescheduling were done (at supersteps 2, 6, 14, 30 and 62) No migrations are indicated

in the first three calls The greater is the matrix size, the greater is the amount of supersteps needed to make migrations viable This happens because our total load is fixed (independent

of the matrix size) but the load partition increases uniformly along the supersteps (see Section

4 for details) Process 21 up to 29 are migrated to cluster Aquario at superstep 30, while

process 37 up to 42 are migrated to this cluster at superstep 62 Using α equal to 4, 84.65s were obtained for scenario iii which results a gain of 9% This gain is greater than that achieved with α 2 because now the last rescheduling call is done at superstep 60 The same processes were migrated at this point However, there are two more supersteps to execute using α equal

to 4 Three rescheduling calls were done with α8 (at supersteps 8, 24 and 56) Only the last two

produce migration Three processes are migrated at superstep 24: {(p21,a1),(p22,a2),(p23,a3)} Process 37 up to 42 are migrated to cluster Aquario at superstep 56 This last call is efficient since it transfers all processes from cluster Frontal to Aquario

The execution with a 100×100 matrix shows good results with process migration Six

rescheduling calls were done when using α 2 Migrations did not occur at the first three

su-persteps (2, 6 and 14) Process 21 up to 29 are migrated to cluster Aquario after superstep 30

In addition, process 37 to 42 are migrated to cluster Aquario at superstep 62 Finally, super-step 126 indicates 7 migrations, but just 5 occurred: p30 up to p36 to cluster Aquario These migrations complete one process per node on cluster Aquario MigBSP selected for migration those processes that belonged to cluster Corisco and Frontal, which are the slowest clusters on

our infrastructure testbed α equal to 16 produced 3 attempts for migration when a 100 ×100 matrix is evaluated (at supersteps 16, 48 and 112) All of them triggered migrations In the first

Trang 5

   







   



 



















Fig 11 Different views of Smith-Waterman irregular application

happen on this scenario We tested the behavior of square matrixes of order 10, 25, 50, 100 and

200 Each cell of a 10×10 matrix needs to communicate 500 Kbytes and each process occupies

1.2 Mbyte in memory (700 Kbytes comprise other application data) The cell of 25×25 matrix

communicates 200 Kbytes and each process occupies 900 Kbytes in memory and so on

5.2.2 Results and Discussions

Table 7 presents the application evaluation Nineteen supersteps were crossed when a 10×10

matrix was tested Adopting this size of matrix and α 2, 13.34s and 14.15s were obtained for

scenarios i and ii which represents a cost of 8% The higher is the value of α, the lower is

the MigBSP overhead on application execution This occurs because the system is stable

(pro-cesses are balanced) and α always increases at each rescheduling call Three calls for process

relocation were done when testing α 2 (at supersteps 2, 6 and 14) The rescheduling call at

superstep 2 does not produce migrations At this step, the load computational density is not

enough to overlap the consider migration costs involved on process transferring operation

The same occurred on the next call at superstep 6 The last call happened at superstep 14,

which resulted on 6 migrations: {(p5,a1), (p6,a2), (p7,a3), (p8,a4), (p9,a5), (p10,a6)} MigBSP

indicated the migration of processes that are responsible to compute the final supersteps The

execution with α equal to 4 implies in a shorter overhead since two calls were done (at

super-steps 4 and 12) Observing scenario iii, we do not have migrations in the first call, but eight

occurred in the other one Processes 3 up to 10 migrated in this last call to cluster Aquario α 4

outperforms α 2 for two reasons: (i) it does less rescheduling calls and; (ii) the call that causes

process migration was done at a specific superstep in which MigBSP takes better decisions

The system stays stable when the 25× 25 matrix was tested α 2 produces a gain of 11% in

performance when considering 25× 25 matrix and scenario iii This configuration presents

four calls for process rescheduling, where two of them produce migrations No migrations

are indicated at supersteps 2 and 6 Nevertheless, processes 1 up to 12 are migrated at

su-perstep 14 while processes 21 up to 25 are transferred at susu-perstep 30 These transferring

operations occurred to the fastest cluster In this last call, the remaining execution presents

19 supersteps (from 31 to 49) to amortize the migration costs and to get better performance

The execution when considering α 8 and scenario iii brings an overhead if compared with

scenario i Two calls for migrations were done, at supersteps 8 and 24 The first call causes

10×10 25×25 50×50 100×100 200×200

Scenario ii

Scenario iii

Table 7 Evaluation of scenarios i, ii and iii when varying the matrix size

the migration of just one process (number 1) to a1 and the second one produces three migra-tions: {(p21,a2),(p22,a3),(p23,a4)} We observed that processes p24 and p25 stayed on cluster Corisco Despite performed migrations, these two processes compromise the supersteps that include them Both are executing on a slower cluster and the barrier waits for the slowest

pro-cess Maintaining the matrix size and adopting α 16, we have two calls: at supersteps 16 and

48 This last call migrates p24 an p25 to cluster Aquario Although this movement is pertinent

to get performance, just one superstep is executed before ending the application

Fifty processes were evaluated when the 50× 50 matrix was considered In this context, α also

increases at each call for process rescheduling We observed that an overhead of 3% was found

when scenario i and ii were compared (using α 2) In addition, we observed that all values of

α achieved a gain of performance in scenario iii Especially when α 2 was used, five calls for

process rescheduling were done (at supersteps 2, 6, 14, 30 and 62) No migrations are indicated

in the first three calls The greater is the matrix size, the greater is the amount of supersteps needed to make migrations viable This happens because our total load is fixed (independent

of the matrix size) but the load partition increases uniformly along the supersteps (see Section

4 for details) Process 21 up to 29 are migrated to cluster Aquario at superstep 30, while

process 37 up to 42 are migrated to this cluster at superstep 62 Using α equal to 4, 84.65s were obtained for scenario iii which results a gain of 9% This gain is greater than that achieved with α 2 because now the last rescheduling call is done at superstep 60 The same processes were migrated at this point However, there are two more supersteps to execute using α equal

to 4 Three rescheduling calls were done with α8 (at supersteps 8, 24 and 56) Only the last two

produce migration Three processes are migrated at superstep 24: {(p21,a1),(p22,a2),(p23,a3)} Process 37 up to 42 are migrated to cluster Aquario at superstep 56 This last call is efficient since it transfers all processes from cluster Frontal to Aquario

The execution with a 100×100 matrix shows good results with process migration Six

rescheduling calls were done when using α 2 Migrations did not occur at the first three

su-persteps (2, 6 and 14) Process 21 up to 29 are migrated to cluster Aquario after superstep 30

In addition, process 37 to 42 are migrated to cluster Aquario at superstep 62 Finally, super-step 126 indicates 7 migrations, but just 5 occurred: p30 up to p36 to cluster Aquario These migrations complete one process per node on cluster Aquario MigBSP selected for migration those processes that belonged to cluster Corisco and Frontal, which are the slowest clusters on

our infrastructure testbed α equal to 16 produced 3 attempts for migration when a 100 ×100 matrix is evaluated (at supersteps 16, 48 and 112) All of them triggered migrations In the first

Trang 6

call, the 11thfirst processes are migrated to cluster Aquario All process from cluster Frontal

are migrated to Aquario at superstep 48 Finally, 15 processes are selected as candidates for

migration after crossing 112 supersteps They are: p21 to p36 This spectrum of candidates

is equal to the processes that are running on Frontal Considering this, only 3 processes were

migrated actually: {(p34,a18),(p35a19),(p36,a20)}



 





Fig 12 Migration behavior when testing a 200× 200 matrix with initial α equal to 2

Table 7 also shows the application performance when the 200×200 matrix was tested

Sat-isfactory results were obtained with process migration The system stays stable during all

application execution Despite having more than one process mapped to one processor,

some-times just a portion of them is responsible for computation at a specific moment This occurs

because the processes are mapped to matrix columns, while supersteps comprise the

anti-diagonals of the matrix Figure 12 illustrates the migrations behavior along the execution

with α 2 Using α 2 and considering scenario iii, 8 calls for process rescheduling were done.

Migrations were not done at supersteps 2, 6 and 14 Processes 21 up to 31 are migrated to

cluster Aquario at superstep 30 Moreover, all processes from cluster Frontal are migrated to

Aquario at superstep 62 Six processes are candidates for migration at superstep 126: p30 to

p36 However, only p31 up to p36 are migrated to cluster Aquario These migrations

hap-pen because the processes initially mapped to cluster Aquario do not collaborate yet with BSP

computation Migrations are not viable at superstep 254 Finally, 12 processes (p189 to p200)

are migrated to cluster Aquario when superstep 388 was crossed At this time, all previous

processes allocated to Aquario are inactive and the migrations are viable However, just 10

remaining supersteps are executed to amortize the process migration costs

5.3 LU Decomposition Application

Consider a system of linear equations A.x=b, where A is a given n × n non singular matrix,

b a given vector of length n, and x the unknown solution vector of length n One method for

solving this system is by using the LU Decomposition technique It comprises the

decompo-sition of the matrix A into a lower triangular matrix L and an upper triangular matrix U such

that A=LU A n × n matrix L is called unit lower triangular if l i,i=1 for all i, 0 ≤ i < n, and

l i,j=0 for all i, j where 0 ≤ i < j < n An n × n matrix U is called upper triangular if u i,j=0

for all i, j with 0 ≤ j < i < n.

    

 

    

 









  

Fig 13 L and U matrices with the same memory space of the original matrix A0

1 for k from 0 to n1 do fork from 0 to n −1 do

3 u k,j=a k

a k,k

i,k= a k i,k

10 a k+1 i,j =a k

i,j − l i,k u k,j endfor

11 endfor

12 endfor

13 endfor

Fig 14 Two algorithms to solve the LU Decomposition problem

On input, A contains the original matrix A0, whereas on output it contains the values of L below the diagonal and the values of U above and on the diagonal such that LU=A0 Figure

13 illustrates the organization of LU computation The values of L and U computed so far and the computed sub-matrix A k may be stored in the same memory space of A0 Figure 14

presents the sequential algorithm for producing L and U in stages Stage k first computes the elements u k,j , j ≥ k, of row k of U and the elements l i,k , i > k, of column k of L Then, it

computes A k+1in preparation for the next stage Figure 14 also presents in the right side the

functioning of the previous algorithm using just the elements from matrix A Figure 13 (b) presents the data that is necessary to compute a i,j Besides its own value, a i,jis updated using

a value from the same line and another from the same column

5.3.1 Modeling the Problem

This section explains how we modeled the LU sequential application on a BSP-based parallel

one Firstly, the bulk of the computational work in stage k of the sequential algorithm is the

Trang 7

call, the 11thfirst processes are migrated to cluster Aquario All process from cluster Frontal

are migrated to Aquario at superstep 48 Finally, 15 processes are selected as candidates for

migration after crossing 112 supersteps They are: p21 to p36 This spectrum of candidates

is equal to the processes that are running on Frontal Considering this, only 3 processes were

migrated actually: {(p34,a18),(p35a19),(p36,a20)}



 





Fig 12 Migration behavior when testing a 200× 200 matrix with initial α equal to 2

Table 7 also shows the application performance when the 200×200 matrix was tested

Sat-isfactory results were obtained with process migration The system stays stable during all

application execution Despite having more than one process mapped to one processor,

some-times just a portion of them is responsible for computation at a specific moment This occurs

because the processes are mapped to matrix columns, while supersteps comprise the

anti-diagonals of the matrix Figure 12 illustrates the migrations behavior along the execution

with α 2 Using α 2 and considering scenario iii, 8 calls for process rescheduling were done.

Migrations were not done at supersteps 2, 6 and 14 Processes 21 up to 31 are migrated to

cluster Aquario at superstep 30 Moreover, all processes from cluster Frontal are migrated to

Aquario at superstep 62 Six processes are candidates for migration at superstep 126: p30 to

p36 However, only p31 up to p36 are migrated to cluster Aquario These migrations

hap-pen because the processes initially mapped to cluster Aquario do not collaborate yet with BSP

computation Migrations are not viable at superstep 254 Finally, 12 processes (p189 to p200)

are migrated to cluster Aquario when superstep 388 was crossed At this time, all previous

processes allocated to Aquario are inactive and the migrations are viable However, just 10

remaining supersteps are executed to amortize the process migration costs

5.3 LU Decomposition Application

Consider a system of linear equations A.x=b, where A is a given n × n non singular matrix,

b a given vector of length n, and x the unknown solution vector of length n One method for

solving this system is by using the LU Decomposition technique It comprises the

decompo-sition of the matrix A into a lower triangular matrix L and an upper triangular matrix U such

that A=LU A n × n matrix L is called unit lower triangular if l i,i=1 for all i, 0 ≤ i < n, and

l i,j=0 for all i, j where 0 ≤ i < j < n An n × n matrix U is called upper triangular if u i,j=0

for all i, j with 0 ≤ j < i < n.

    

 

    

 









  

Fig 13 L and U matrices with the same memory space of the original matrix A0

1 for k from 0 to n1 do fork from 0 to n −1 do

3 u k,j=a k

a k,k

i,k= a k i,k

10 a k+1 i,j =a k

i,j − l i,k u k,j endfor

11 endfor

12 endfor

13 endfor

Fig 14 Two algorithms to solve the LU Decomposition problem

On input, A contains the original matrix A0, whereas on output it contains the values of L below the diagonal and the values of U above and on the diagonal such that LU=A0 Figure

13 illustrates the organization of LU computation The values of L and U computed so far and the computed sub-matrix A k may be stored in the same memory space of A0 Figure 14

presents the sequential algorithm for producing L and U in stages Stage k first computes the elements u k,j , j ≥ k, of row k of U and the elements l i,k , i > k, of column k of L Then, it

computes A k+1in preparation for the next stage Figure 14 also presents in the right side the

functioning of the previous algorithm using just the elements from matrix A Figure 13 (b) presents the data that is necessary to compute a i,j Besides its own value, a i,jis updated using

a value from the same line and another from the same column

5.3.1 Modeling the Problem

This section explains how we modeled the LU sequential application on a BSP-based parallel

one Firstly, the bulk of the computational work in stage k of the sequential algorithm is the

Trang 8

modification of the matrix elements a i,j with i, j ≥ k+1 Aiming to prevent communication

of large amounts of data, the update of a i,j=a i,j+a i,k a k,jmust be performed by the process

whose contains a i,j This implies that only elements of column k and row k of A need to be

communicated in stage k in order to compute the new sub-matrix A k An important

obser-vation is that the modification of the elements in row A(i, k+1 : n −1)uses only one value

of column k of A, namely a i,k The provided notation A(i, k+1 : n −1)denotes the cells of

line i varying from column k+1 to n −1 If we distribute each matrix row over a limit set

of N processes, then the communication of an element from column k can be restricted to a

multicast to N processes Similarly, the change of the elements in A(k+1 : n − 1, j)uses only

one value from row k of A, namely a k,j If we divide each column over a set of M processes,

the communication of an element of row k can be restricted to a multicast to M processes.

We are using a Cartesian scheme for the distribution of matrices The square cyclic distribution

is used since it is particularly suitable for matrix computations (Bisseling, 2004) Thus, it is

natural to organize the processes by two-dimensional identifiers P(s, t)with 0≤ s < M and

0≤ t < N, where the number of processes p=M.N Figure 15 depicts a 6 ×6 matrix mapped

to 6 processes, where M=2 and N=3 Assuming that M and N are factors of n, each process

will store nc (number of cells) cells in memory (see Equation 10).

M.

n

     













Fig 15 Cartesian distribution of a matrix over 2× 3 (M × N) processes

A parallel algorithm uses data parallelism for computations and the need-to-know principle

to design the communication phase of each superstep Following the concepts of BSP, all

communication performed during a superstep will be completed when finishing it and the

data will be available at the beginning of the next superstep (Bonorden, 2007) Concerning

this, we modeled our algorithm using three kinds of supersteps They are explained in Table

8 The element a k,k is passed to the process that computes a i,kin the first kind of superstep

The computation of a i,kis expressed in the beginning of the second superstep This superstep

is also responsible for sending the elements a i,k and a k,j to a i,j First of all, we pass the element

a i,k , k+1≤ i < n, to the N − 1 processes that execute on the respective row i This kind of

superstep also comprises the passing of a k,j , k+1≤ j < n, to M −1 processes which execute

on the respective column j The superstep 3 considers the computation of a i,j, the increase of

k (next stage of the algorithm) and the transmission of a k,k to a i,k elements (k+1≤ i < n).

The application will execute one superstep of type 1 and will follow with the interleaving of

supersteps 2 and 3 Thus, a n × n matrix will trigger 2n+1 supersteps in our LU modeling We

Type of su-perstep

Steps and explanation

First Step 1.1 : k = 0

Step 1.2 - Pass the element a k,k to cells which will compute a i,k (k+1≤ i < n)

Second

Step 2.1 : Computation of a i,k (k+1≤ i < n) by cells which own them

Step 2.2 : For each i (k+1 ≤ i < n), pass the element a i,k to other a i,j

elements in the same line (k+1≤ j < n)

Step 2.3 : For each j (k+1 ≤ j < n), pass the element a k,j to other a i,j elements in the same column (k+1≤ i < n)

Third

Step 3.1 : For each i and j (k+1≤ i, j < n), calculate a i,j as a i,j+a i,k a k,j Step 3.2 : k=k+1

Step 3.3 : Pass the element a k,k to cells which will compute a i,k (k+1≤ i < n)

Table 8 Modeling three types of supersteps for LU computation

modeled the Cartesian distribution M × N in the following manner: 5 ×5, 10×5, 10×10 and

20×10 for 25, 50, 100 and 200 processes, respectively Moreover, we are applying simulation over square matrices with orders 500, 1000, 2000 and 5000 Lastly, the tests were executed

using α=4, ω=3, D=0.5 and x = 80%.

5.3.2 Results and Discussions

Table 9 presents the results when evaluating LU application The tests with the first matrix size show the worst results Formerly, the higher the number of processes, the worse the

performance, as we can observe in scenario i The reasons for the observed times are the

overheads related to communication and synchronization Secondly, MigBSP indicated that all migration attempts were not viable due to low computing and communication loads when

compared to migration costs Considering this, both scenarios ii and iii have the same results.

Processes i 500×500 matrixii iii i 1000× ii1000 matrixiii i 2000×2000 matrixii iii

Table 9 First results when executing LU linked to MigBSP (time in seconds) When testing a 1000×1000 matrix with 25 processes, the first rescheduling call does not cause migrations After this call at superstep 4, the next one at superstep 11 informs the migration of

5 processes from cluster Corisco They were all transferred to cluster Aquario, which has the

highest computation power MigBSP does not point migrations in the future calls α always

increases its value at each rescheduling call since the processes are balanced after the men-tioned relocations MigBSP obtained a gain of 12% of performance with 25 processes when

comparing scenarios i and iii With the same size of matrix and 50 processes, 6 processes from

Frontal were migrated to Aquario at superstep 9 Although these migrations are profitable,

Trang 9

modification of the matrix elements a i,j with i, j ≥ k+1 Aiming to prevent communication

of large amounts of data, the update of a i,j=a i,j+a i,k a k,jmust be performed by the process

whose contains a i,j This implies that only elements of column k and row k of A need to be

communicated in stage k in order to compute the new sub-matrix A k An important

obser-vation is that the modification of the elements in row A(i, k+1 : n −1)uses only one value

of column k of A, namely a i,k The provided notation A(i, k+1 : n −1)denotes the cells of

line i varying from column k+1 to n −1 If we distribute each matrix row over a limit set

of N processes, then the communication of an element from column k can be restricted to a

multicast to N processes Similarly, the change of the elements in A(k+1 : n − 1, j)uses only

one value from row k of A, namely a k,j If we divide each column over a set of M processes,

the communication of an element of row k can be restricted to a multicast to M processes.

We are using a Cartesian scheme for the distribution of matrices The square cyclic distribution

is used since it is particularly suitable for matrix computations (Bisseling, 2004) Thus, it is

natural to organize the processes by two-dimensional identifiers P(s, t)with 0≤ s < M and

0≤ t < N, where the number of processes p=M.N Figure 15 depicts a 6 ×6 matrix mapped

to 6 processes, where M=2 and N=3 Assuming that M and N are factors of n, each process

will store nc (number of cells) cells in memory (see Equation 10).

M.

n

     













Fig 15 Cartesian distribution of a matrix over 2× 3 (M × N) processes

A parallel algorithm uses data parallelism for computations and the need-to-know principle

to design the communication phase of each superstep Following the concepts of BSP, all

communication performed during a superstep will be completed when finishing it and the

data will be available at the beginning of the next superstep (Bonorden, 2007) Concerning

this, we modeled our algorithm using three kinds of supersteps They are explained in Table

8 The element a k,k is passed to the process that computes a i,kin the first kind of superstep

The computation of a i,kis expressed in the beginning of the second superstep This superstep

is also responsible for sending the elements a i,k and a k,j to a i,j First of all, we pass the element

a i,k , k+1≤ i < n, to the N − 1 processes that execute on the respective row i This kind of

superstep also comprises the passing of a k,j , k+1≤ j < n, to M −1 processes which execute

on the respective column j The superstep 3 considers the computation of a i,j, the increase of

k (next stage of the algorithm) and the transmission of a k,k to a i,k elements (k+1≤ i < n).

The application will execute one superstep of type 1 and will follow with the interleaving of

supersteps 2 and 3 Thus, a n × n matrix will trigger 2n+1 supersteps in our LU modeling We

Type of su-perstep

Steps and explanation

First Step 1.1 : k = 0

Step 1.2 - Pass the element a k,k to cells which will compute a i,k (k+1≤ i < n)

Second

Step 2.1 : Computation of a i,k (k+1≤ i < n) by cells which own them

Step 2.2 : For each i (k+1 ≤ i < n), pass the element a i,k to other a i,j

elements in the same line (k+1≤ j < n)

Step 2.3 : For each j (k+1 ≤ j < n), pass the element a k,j to other a i,j elements in the same column (k+1≤ i < n)

Third

Step 3.1 : For each i and j (k+1≤ i, j < n), calculate a i,j as a i,j+a i,k a k,j Step 3.2 : k=k+1

Step 3.3 : Pass the element a k,k to cells which will compute a i,k (k+1≤ i < n)

Table 8 Modeling three types of supersteps for LU computation

modeled the Cartesian distribution M × N in the following manner: 5 ×5, 10×5, 10×10 and

20×10 for 25, 50, 100 and 200 processes, respectively Moreover, we are applying simulation over square matrices with orders 500, 1000, 2000 and 5000 Lastly, the tests were executed

using α=4, ω=3, D=0.5 and x = 80%.

5.3.2 Results and Discussions

Table 9 presents the results when evaluating LU application The tests with the first matrix size show the worst results Formerly, the higher the number of processes, the worse the

performance, as we can observe in scenario i The reasons for the observed times are the

overheads related to communication and synchronization Secondly, MigBSP indicated that all migration attempts were not viable due to low computing and communication loads when

compared to migration costs Considering this, both scenarios ii and iii have the same results.

Processes i 500×500 matrixii iii i 1000× ii1000 matrixiii i 2000×2000 matrixii iii

Table 9 First results when executing LU linked to MigBSP (time in seconds) When testing a 1000×1000 matrix with 25 processes, the first rescheduling call does not cause migrations After this call at superstep 4, the next one at superstep 11 informs the migration of

5 processes from cluster Corisco They were all transferred to cluster Aquario, which has the

highest computation power MigBSP does not point migrations in the future calls α always

increases its value at each rescheduling call since the processes are balanced after the men-tioned relocations MigBSP obtained a gain of 12% of performance with 25 processes when

comparing scenarios i and iii With the same size of matrix and 50 processes, 6 processes from

Frontal were migrated to Aquario at superstep 9 Although these migrations are profitable,

Trang 10

they do not provide stability to the system and the processes remain unbalanced among the

resources Migrations are not viable in the next 3 calls at supersteps 15, 21 and 27 After

that, MigBSP launches our second adaptation on rescheduling frequency in order to alleviate

its impact and α begins to grow until the end of the application The tests with 50 processes

obtained gains of just 5% with process migration This is explained by the fact that the

compu-tational load is decreased in this configuration when compared to the one with 25 processes

In addition, the bigger the number of the superstep, the smaller the computational load

re-quired by it Therefore, the more advanced the execution, the lesser the gain with migrations

The tests with 100 and 200 processes do not present migrations owing to the forces that act in

favor of migration are weaker than the Memory metric in all rescheduling calls

The execution with a 2000×2000 matrix presents good results because the computational load

is increased We observed a gain of 15% with process relocation when testing 25 processes

All processes from cluster Corisco were migrated to Aquario in the first rescheduling call (at

superstep 4) Thus, the application can take profit from this relocation in its beginning, when

it demands more computations The time for concluding the LU application is reduced when

passing from 25 to 50 processes as we can see in scenario i However, the use of MigBSP

resulted in lower gains Scenario i presented 60.23s while scenario iii achieved 56.18s (9% of

profit) When considering 50 processes, 6 processes were transferred from cluster Frontal to

Aquario at superstep 4 The next call occurs at superstep 9, where 16 processes from cluster

Corisco were elected as migration candidates to Aquario However, MigBSP indicated the

migration of 14 processes, since there were only 14 unoccupied processors in the target cluster

















Fig 16 Performance graph with our three scenarios for a 5000×5000 matrix

We observed that the higher the matrix order, the better the results with process migration

Considering this, the evaluation of a 5000×5000 matrix can be seen in the Figure 16 The

sim-ple movement of all processes from cluster Corisco to Aquario represented a gain of 19% when

executing 25 processes The tests with 50 processes obtained 852.31s and 723.64s for scenario

i and iii, respectively The same migration behavior found on the tests with the 2000 ×2000

matrix was achieved in Scenario iii However, the increase of matrix order represented a gain

of 15% (order 5000) instead of 10% (order 2000) This analysis helps us to verify our

previ-ous hypothesis about performance gains when enlarging the matrix Finally, the tests with

200 processes indicated the migration of 6 processes (p195 up to p200) from cluster Corisco to

Aquario at superstep 4 Thus, the nodes that belong to Corisco just execute one BSP process

while the nodes from Aquario begin to treat 2 processes The remaining rescheduling calls

inform the processes from Labtec as those with the higher values of PM However, their

mi-grations are not considered profitable The final execution with 200 processes achieved 460.85s

and 450.33s for scenarios i and iii, respectively.

6 Conclusion

Scheduling schemes for multi-programmed parallel systems can be viewed in two lev-els (Frachtenberg & Schwieglev-elshohn, 2008) In the first level processors are allocated to a job In the second level processes from a job are (re)scheduled using this pool of processors MigBSP can be included in this last scheme, offering algorithms for load (BSP processes) re-balancing among the resources during the application runtime In the best of our knowledge, MigBSP is the pioneer model on treating BSP process rescheduling with three metrics and adaptations on remapping frequency These features are enabled by MigBSP at middleware level, without changing the application code

Considering the spectrum of the three tested applications, we can take the following conclu-sions in a nutshell: (i) the larger the computing grain, the better the gain with processes migra-tion; (ii) MigBSP does not indicate the migration of those processes that have high migration costs when compared to computation and communication loads; (iii) MigBSP presented a low overhead on application execution when migrations are not applied; (v) our tests prioritizes migrations to cluster Aquario since it is the fastest one among considered clusters and tested applications are CPU-bound and; (vi) MigBSP does not work with previous knowledge about application Considering this last topic, MigBSP indicates migrations even when the applica-tion is close to finish In this situaapplica-tion, these migraapplica-tions bring an overhead since the remaining time for application conclusion is too short to amortize their costs

The results showed that MigBSP presented a low overhead on application execution The calculus of the PM (Potential of Migration) as well as our efficient adaptations were respon-sible for this feature PM considers processes and Sets (different sites), not performing all processes-resources tests at the rescheduling moment Meanwhile, our adaptations were cru-cial to enable MigBSP as a viable scheduler Instead of performing the rescheduling call at each fixed interval, they manage a flexible interval between calls based on the behavior of the

processes The concepts of the adaptations are: (i) to postpone the rescheduling call if the system is stable (processes are balanced) or to turn it more frequent, otherwise; (ii) to delay this call if a pattern without migrations in ω calls is observed.

7 References

Bhandarkar, M A., Brunner, R & Kale, L V (2000) Run-time support for adaptive load

balancing, IPDPS ’00: Proceedings of the 15 IPDPS 2000 Workshops on Parallel and

Dis-tributed Processing, Springer-Verlag, London, UK, pp 1152–1159.

Bisseling, R H (2004) Parallel Scientific Computation: A Structured Approach Using BSP and

MPI, Oxford University Press.

Bonorden, O (2007) Load balancing in the bulk-synchronous-parallel setting using process

migrations., 21th International Parallel and Distributed Processing Symposium (IPDPS

2007), IEEE, pp 1–9.

Bonorden, O., Gehweiler, J & auf der Heide, F M (2005) Load balancing strategies in a web

computing environment, Proceeedings of International Conference on Parallel Processing

and Applied Mathematics (PPAM), Poznan, Poland, pp 839–846.

Casanova, H., Legrand, A & Quinson, M (2008) Simgrid: A generic framework for

large-scale distributed experiments, Tenth International Conference on Computer Modeling and

Simulation (uksim), IEEE Computer Society, Los Alamitos, CA, USA, pp 126–131.

Casavant, T L & Kuhl, J G (1988) A taxonomy of scheduling in general-purpose distributed

computing systems, IEEE Trans Softw Eng 14(2): 141–154.

Ngày đăng: 20/06/2014, 11:20