Evaluating 25 processes on three considered scenarios time in seconds Analyzing scenario iii with α equal to 16, we detected that the first migration is postponed, which results in a lar
Trang 1(500 Kbytes is fixed to other process’ data) and passes 100 Kilobytes of boundary data to its
right neighbor In the same way, when 25 processes are employed, each one computes 4.108
instructions and occupies 900 Kbytes in memory
5.1.2 Results and Discussions
Table 1 presents the times when testing 10 processes Firstly, we can observe that MigBSP’s
intrusivity on application execution is short when comparing both scenarios i and ii
(over-head lower than 5%) The processes are balanced among themselves with this configuration,
causing the increasing of α at each call for process rescheduling This explain the low impact
when comparing scenarios i and ii Besides this, MigBSP decides that migrations are inviable
for any moment, independing on the amount of executed supersteps In this case, our model
causes a loss of performance in application execution We obtained negative values of PM
when the rescheduling was tested This fact resulted in an empty list of migration candidates
Super-step Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii
2000 1344.09 1347.88 1347.88 1346.67 1346.67 1344.91 1344.91
Table 1 Evaluating 10 processes on three considered scenarios (time in seconds)
The results of the execution of 25 processes are presented in Table 2 In this context, the system
remains stable and α grows at each rescheduling call One migration occurred {(p21,a1)} when
testing 10 supersteps and using α equal to 4 Our notation informs that process p21 was
re-assigned to run on node a1 A second and a third migrations happened when considering 50
supersteps: {(p22,a2), (p23,a3)} They happened in the next two calls for process rescheduling
(at supersteps 12 and 28) When evaluating 2000 supersteps and maintaining this value of α,
eight migrations take place: {(p21,a1), (p22,a2), (p23,a3), (p24,a4), (p25,a5), (p18,a6), (p19,a7),
(p20,a8)} We analyzed that all migrations occurred to the fastest cluster (Aquario) The first
five migrations moved processes from cluster Corisco to Aquario After that, three processes
from Labtec were chosen for migration Concluding, we obtained a profit of 14% after
execut-ing 2000 supersteps when α equal to 4 is used.
Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii
Table 2 Evaluating 25 processes on three considered scenarios (time in seconds)
Analyzing scenario iii with α equal to 16, we detected that the first migration is postponed, which results in a larger final time when compared with lower values of α With α 4 for
instance, we have more calls for process rescheduling with migrations during the first super-steps This fact will cause a large overhead to be paid during this period These penalty costs are amortized when the amount of executed supersteps increases Thus, the configuration
with α 4 outperforms other studied values of α when 2000 supersteps are evaluated Figure
10 illustrates the frequency of process rescheduling calls when testing 25 processes and 2000
supersteps We can observe that 6 calls are done with α 16, while 8 are performed when initial
α changes to 4 Considering scenarios ii, we conclude that the greater is α, the lower is the
model’s impact if migrations are not applied (situation in which migration viability is false)
Fig 10 Number of rescheduling calls when 25 processes and 2000 supersteps are evaluated Table 3 shows the results when the number of processes is increased to 50 The processes
are considered balanced and α increases at each rescheduling call In this manner, we have
the same configuration of calls when testing 25 processes (see Figure 10) We achieved 8 migrations when 2000 supersteps are evaluated: {(p38,a1), (p40,a2), (p42, a3), (p39, a4), (p41, a5), (p37, a6), (p22, a7), (p21, a8)} MigBSP moves all processes from cluster Frontal to Aquario
and transfers two process from Corisco to the fastest cluster Using α 4, 430.95s and 408.25s were obtained for scenarios i and iii, respectively Besides this 5% of gain with α 4, we also achieve a gain when α is equal to 8 However, the final result when changing initial α to 16 in scenario iii is worse than scenario i, since the migrations are delayed and more supersteps are
need to achieve a gain in this situation Table 4 presents the execution of 100 processes over the tested infrastructure As the situations with 25 and 50 processes, the environment when 100 processes are evaluated is stable and the processes are balanced among the resources Thus,
αincreases at each rescheduling call The same migrations occurred when testing 50 and 100 processes, since the configuration with 100 just uses more nodes from cluster ICE In general, the same percentage of gain was achieve with 50 and 100 processes
The results of scenarios i, ii and iii with 200 processes is shown in Table 5 We have an un-stable scenario in this situation, which explains the fact of a large overhead in scenario ii Considering this scenario, α will begin to grow after ω calls for process rescheduling without migrations Taking into account scenario iii and α equal to 4, 2 migrations are done when
ex-ecuting 10 supersteps: {(p195,a1), (p197,a2)} Besides these, 10 migrations take place when 50 supersteps were tested: {(p196,a3), (p198,a4), (p199,a5), (p200,a6), (p38,a7), (p39,a8), (p37,a9), (p40,a10), (p41,a11), (p42, a12)} Despite the happening of these migrations, the processes are
still unbalanced with adopted value of D and, then, α does not increase at each superstep.
Trang 2Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii
Table 3 Evaluating 50 processes on three considered scenarios (time in seconds)
Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii
Table 4 Evaluating 100 processes on three considered scenarios (time in seconds)
Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii
Table 5 Evaluating 200 processes on three considered scenarios (time in seconds)
After these migrations, MigBSP does not indicate the viability of other ones Thus, after ω
calls without migrations, MigBSP enlarges the value of D and α begins to increase following
adaptation 2 (see Subsection 3.2 for details)
Processes Scenario i - Without process migration Scenario iii - With process migration
Table 6 Barrier times on two situations
Table 6 presents the barrier times captured when 2000 supersteps were tested More
espe-cially, the time is captured when the last superstep is executed We implemented a centralized
master-slave approach for barrier, where process 1 receives and sends a scheduling message from/to other BSP processes Thus, the barrier time is captured on process 1 The times shown
in the third column of Table 6 do not include both scheduling messages and computation Our idea is to demonstrate that the remapping of processes decreases the time to compute the BSP supersteps Therefore, process 1 can reduce the waiting time for barrier computation since the processes reach this moment faster Analyzing such table, we observed that a gain of 22% in
time was achieved when comparing barrier time on scenarios i and iii with 50 processes The
gain was reduced when 100 processes were tested This occurs because we just include more nodes from cluster ICE with 100 processes if compared with the execution of 50 processes
5.2 Smith-Waterman Application
Our second application is based on dynamic programming (DP), which is a popular algorithm design technique for optimization problems (Low et al., 2007) DP algorithms can be classified according to the matrix size and the dependency relationship of each matrix cell An algorithm
for a problem of size n is called a tD/eD algorithm if its matrix size is O(n t) and each matrix
cell depends on O(n e ) other cells 2D/1D algorithms are all irregular with changes on load
computation density along the matrix’s cells In particular, we observed the Smith-Waterman algorithm that is a well-known 2D/1D algorithm for local sequence alignment (Smith, 1988)
5.2.1 Modeling the Problem
Smith-Waterman algorithm proceeds in a series of wavefronts diagonally across the matrix Figure 11 (a) illustrates the concept of the algorithm for a 4×4 matrix with a column-based processes allocation The more intense the shading, the greater is the load computation den-sity of the cell Each wavefront corresponds to a BSP superstep For instance, Figure 11 (b) shows a 4×4 matrix that presents 7 supersteps The computation load is uniform inside a particular superstep, growing up when the number of the superstep increases Both organi-zations of diagonal-based supersteps mapping and column-based processes mapping bring
the following conclusions: (i) 2n −1 supersteps are crossed to compute a square matrix with
order n and; (ii) each process will be involved on n supersteps Figures 11 (b) and (c) show the communication actions among the processes Considering that cell x, y (x means a matrix’ line, while y is a matrix’ column) needs data from the x, y − 1 and x − 1, y other ones, we will have an interaction from process py to process py+1 We do not have communication inside the same matrix column, since it corresponds to the same process
The configuration of scenarios ii and iii depends on the Computation Pattern P comp(i)of each
process i (see Subsection 3.3 for more details) P comp(i)increases or decreases depending on the prediction of the amount of performed instructions at each superstep We consider a
spe-cific process as regular if the forecast is within a δ margin of fluctuation from the amount of
instructions performed actually In our experiments, we are using 106as the amount of in-structions for the first superstep and 109for the last one The increase of load computational density among the supersteps is uniform In other words, we take the difference between 109
and 106and divide by the number of involved supersteps in a specific execution Considering
this, we applied δ equal to 0.01 (1%) and 0.50 (50%) to scenarios ii and iii, respectively This last value was used because I2(1)is 565.105and PI2(1)is 287.105when a 10×10 matrix is tested (see details about the notations in Subsection 3.3) The percentage of 50% enforces instruction
regularity in the system Both values of δ will influence the Computation metric, and
conse-quently the choosing of candidates for migration Scenario ii tends to obtain negatives values
for PM since the Computation Metric will be close to 0 Consequently, no migrations will
Trang 3Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii
Table 3 Evaluating 50 processes on three considered scenarios (time in seconds)
Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii
Table 4 Evaluating 100 processes on three considered scenarios (time in seconds)
Super-steps Scenario i Scen iiα= 4Scen iii Scen iiα=Scen iii8 Scen iiα= 16Scen iii
Table 5 Evaluating 200 processes on three considered scenarios (time in seconds)
After these migrations, MigBSP does not indicate the viability of other ones Thus, after ω
calls without migrations, MigBSP enlarges the value of D and α begins to increase following
adaptation 2 (see Subsection 3.2 for details)
Processes Scenario i - Without process migration Scenario iii - With process migration
Table 6 Barrier times on two situations
Table 6 presents the barrier times captured when 2000 supersteps were tested More
espe-cially, the time is captured when the last superstep is executed We implemented a centralized
master-slave approach for barrier, where process 1 receives and sends a scheduling message from/to other BSP processes Thus, the barrier time is captured on process 1 The times shown
in the third column of Table 6 do not include both scheduling messages and computation Our idea is to demonstrate that the remapping of processes decreases the time to compute the BSP supersteps Therefore, process 1 can reduce the waiting time for barrier computation since the processes reach this moment faster Analyzing such table, we observed that a gain of 22% in
time was achieved when comparing barrier time on scenarios i and iii with 50 processes The
gain was reduced when 100 processes were tested This occurs because we just include more nodes from cluster ICE with 100 processes if compared with the execution of 50 processes
5.2 Smith-Waterman Application
Our second application is based on dynamic programming (DP), which is a popular algorithm design technique for optimization problems (Low et al., 2007) DP algorithms can be classified according to the matrix size and the dependency relationship of each matrix cell An algorithm
for a problem of size n is called a tD/eD algorithm if its matrix size is O(n t) and each matrix
cell depends on O(n e ) other cells 2D/1D algorithms are all irregular with changes on load
computation density along the matrix’s cells In particular, we observed the Smith-Waterman algorithm that is a well-known 2D/1D algorithm for local sequence alignment (Smith, 1988)
5.2.1 Modeling the Problem
Smith-Waterman algorithm proceeds in a series of wavefronts diagonally across the matrix Figure 11 (a) illustrates the concept of the algorithm for a 4×4 matrix with a column-based processes allocation The more intense the shading, the greater is the load computation den-sity of the cell Each wavefront corresponds to a BSP superstep For instance, Figure 11 (b) shows a 4×4 matrix that presents 7 supersteps The computation load is uniform inside a particular superstep, growing up when the number of the superstep increases Both organi-zations of diagonal-based supersteps mapping and column-based processes mapping bring
the following conclusions: (i) 2n −1 supersteps are crossed to compute a square matrix with
order n and; (ii) each process will be involved on n supersteps Figures 11 (b) and (c) show the communication actions among the processes Considering that cell x, y (x means a matrix’ line, while y is a matrix’ column) needs data from the x, y − 1 and x − 1, y other ones, we will have an interaction from process py to process py+1 We do not have communication inside the same matrix column, since it corresponds to the same process
The configuration of scenarios ii and iii depends on the Computation Pattern P comp(i)of each
process i (see Subsection 3.3 for more details) P comp(i)increases or decreases depending on the prediction of the amount of performed instructions at each superstep We consider a
spe-cific process as regular if the forecast is within a δ margin of fluctuation from the amount of
instructions performed actually In our experiments, we are using 106as the amount of in-structions for the first superstep and 109for the last one The increase of load computational density among the supersteps is uniform In other words, we take the difference between 109
and 106and divide by the number of involved supersteps in a specific execution Considering
this, we applied δ equal to 0.01 (1%) and 0.50 (50%) to scenarios ii and iii, respectively This last value was used because I2(1)is 565.105and PI2(1)is 287.105when a 10×10 matrix is tested (see details about the notations in Subsection 3.3) The percentage of 50% enforces instruction
regularity in the system Both values of δ will influence the Computation metric, and
conse-quently the choosing of candidates for migration Scenario ii tends to obtain negatives values
for PM since the Computation Metric will be close to 0 Consequently, no migrations will
Trang 4
Fig 11 Different views of Smith-Waterman irregular application
happen on this scenario We tested the behavior of square matrixes of order 10, 25, 50, 100 and
200 Each cell of a 10×10 matrix needs to communicate 500 Kbytes and each process occupies
1.2 Mbyte in memory (700 Kbytes comprise other application data) The cell of 25×25 matrix
communicates 200 Kbytes and each process occupies 900 Kbytes in memory and so on
5.2.2 Results and Discussions
Table 7 presents the application evaluation Nineteen supersteps were crossed when a 10×10
matrix was tested Adopting this size of matrix and α 2, 13.34s and 14.15s were obtained for
scenarios i and ii which represents a cost of 8% The higher is the value of α, the lower is
the MigBSP overhead on application execution This occurs because the system is stable
(pro-cesses are balanced) and α always increases at each rescheduling call Three calls for process
relocation were done when testing α 2 (at supersteps 2, 6 and 14) The rescheduling call at
superstep 2 does not produce migrations At this step, the load computational density is not
enough to overlap the consider migration costs involved on process transferring operation
The same occurred on the next call at superstep 6 The last call happened at superstep 14,
which resulted on 6 migrations: {(p5,a1), (p6,a2), (p7,a3), (p8,a4), (p9,a5), (p10,a6)} MigBSP
indicated the migration of processes that are responsible to compute the final supersteps The
execution with α equal to 4 implies in a shorter overhead since two calls were done (at
super-steps 4 and 12) Observing scenario iii, we do not have migrations in the first call, but eight
occurred in the other one Processes 3 up to 10 migrated in this last call to cluster Aquario α 4
outperforms α 2 for two reasons: (i) it does less rescheduling calls and; (ii) the call that causes
process migration was done at a specific superstep in which MigBSP takes better decisions
The system stays stable when the 25× 25 matrix was tested α 2 produces a gain of 11% in
performance when considering 25× 25 matrix and scenario iii This configuration presents
four calls for process rescheduling, where two of them produce migrations No migrations
are indicated at supersteps 2 and 6 Nevertheless, processes 1 up to 12 are migrated at
su-perstep 14 while processes 21 up to 25 are transferred at susu-perstep 30 These transferring
operations occurred to the fastest cluster In this last call, the remaining execution presents
19 supersteps (from 31 to 49) to amortize the migration costs and to get better performance
The execution when considering α 8 and scenario iii brings an overhead if compared with
scenario i Two calls for migrations were done, at supersteps 8 and 24 The first call causes
10×10 25×25 50×50 100×100 200×200
Scenario ii
Scenario iii
Table 7 Evaluation of scenarios i, ii and iii when varying the matrix size
the migration of just one process (number 1) to a1 and the second one produces three migra-tions: {(p21,a2),(p22,a3),(p23,a4)} We observed that processes p24 and p25 stayed on cluster Corisco Despite performed migrations, these two processes compromise the supersteps that include them Both are executing on a slower cluster and the barrier waits for the slowest
pro-cess Maintaining the matrix size and adopting α 16, we have two calls: at supersteps 16 and
48 This last call migrates p24 an p25 to cluster Aquario Although this movement is pertinent
to get performance, just one superstep is executed before ending the application
Fifty processes were evaluated when the 50× 50 matrix was considered In this context, α also
increases at each call for process rescheduling We observed that an overhead of 3% was found
when scenario i and ii were compared (using α 2) In addition, we observed that all values of
α achieved a gain of performance in scenario iii Especially when α 2 was used, five calls for
process rescheduling were done (at supersteps 2, 6, 14, 30 and 62) No migrations are indicated
in the first three calls The greater is the matrix size, the greater is the amount of supersteps needed to make migrations viable This happens because our total load is fixed (independent
of the matrix size) but the load partition increases uniformly along the supersteps (see Section
4 for details) Process 21 up to 29 are migrated to cluster Aquario at superstep 30, while
process 37 up to 42 are migrated to this cluster at superstep 62 Using α equal to 4, 84.65s were obtained for scenario iii which results a gain of 9% This gain is greater than that achieved with α 2 because now the last rescheduling call is done at superstep 60 The same processes were migrated at this point However, there are two more supersteps to execute using α equal
to 4 Three rescheduling calls were done with α8 (at supersteps 8, 24 and 56) Only the last two
produce migration Three processes are migrated at superstep 24: {(p21,a1),(p22,a2),(p23,a3)} Process 37 up to 42 are migrated to cluster Aquario at superstep 56 This last call is efficient since it transfers all processes from cluster Frontal to Aquario
The execution with a 100×100 matrix shows good results with process migration Six
rescheduling calls were done when using α 2 Migrations did not occur at the first three
su-persteps (2, 6 and 14) Process 21 up to 29 are migrated to cluster Aquario after superstep 30
In addition, process 37 to 42 are migrated to cluster Aquario at superstep 62 Finally, super-step 126 indicates 7 migrations, but just 5 occurred: p30 up to p36 to cluster Aquario These migrations complete one process per node on cluster Aquario MigBSP selected for migration those processes that belonged to cluster Corisco and Frontal, which are the slowest clusters on
our infrastructure testbed α equal to 16 produced 3 attempts for migration when a 100 ×100 matrix is evaluated (at supersteps 16, 48 and 112) All of them triggered migrations In the first
Trang 5
Fig 11 Different views of Smith-Waterman irregular application
happen on this scenario We tested the behavior of square matrixes of order 10, 25, 50, 100 and
200 Each cell of a 10×10 matrix needs to communicate 500 Kbytes and each process occupies
1.2 Mbyte in memory (700 Kbytes comprise other application data) The cell of 25×25 matrix
communicates 200 Kbytes and each process occupies 900 Kbytes in memory and so on
5.2.2 Results and Discussions
Table 7 presents the application evaluation Nineteen supersteps were crossed when a 10×10
matrix was tested Adopting this size of matrix and α 2, 13.34s and 14.15s were obtained for
scenarios i and ii which represents a cost of 8% The higher is the value of α, the lower is
the MigBSP overhead on application execution This occurs because the system is stable
(pro-cesses are balanced) and α always increases at each rescheduling call Three calls for process
relocation were done when testing α 2 (at supersteps 2, 6 and 14) The rescheduling call at
superstep 2 does not produce migrations At this step, the load computational density is not
enough to overlap the consider migration costs involved on process transferring operation
The same occurred on the next call at superstep 6 The last call happened at superstep 14,
which resulted on 6 migrations: {(p5,a1), (p6,a2), (p7,a3), (p8,a4), (p9,a5), (p10,a6)} MigBSP
indicated the migration of processes that are responsible to compute the final supersteps The
execution with α equal to 4 implies in a shorter overhead since two calls were done (at
super-steps 4 and 12) Observing scenario iii, we do not have migrations in the first call, but eight
occurred in the other one Processes 3 up to 10 migrated in this last call to cluster Aquario α 4
outperforms α 2 for two reasons: (i) it does less rescheduling calls and; (ii) the call that causes
process migration was done at a specific superstep in which MigBSP takes better decisions
The system stays stable when the 25× 25 matrix was tested α 2 produces a gain of 11% in
performance when considering 25× 25 matrix and scenario iii This configuration presents
four calls for process rescheduling, where two of them produce migrations No migrations
are indicated at supersteps 2 and 6 Nevertheless, processes 1 up to 12 are migrated at
su-perstep 14 while processes 21 up to 25 are transferred at susu-perstep 30 These transferring
operations occurred to the fastest cluster In this last call, the remaining execution presents
19 supersteps (from 31 to 49) to amortize the migration costs and to get better performance
The execution when considering α 8 and scenario iii brings an overhead if compared with
scenario i Two calls for migrations were done, at supersteps 8 and 24 The first call causes
10×10 25×25 50×50 100×100 200×200
Scenario ii
Scenario iii
Table 7 Evaluation of scenarios i, ii and iii when varying the matrix size
the migration of just one process (number 1) to a1 and the second one produces three migra-tions: {(p21,a2),(p22,a3),(p23,a4)} We observed that processes p24 and p25 stayed on cluster Corisco Despite performed migrations, these two processes compromise the supersteps that include them Both are executing on a slower cluster and the barrier waits for the slowest
pro-cess Maintaining the matrix size and adopting α 16, we have two calls: at supersteps 16 and
48 This last call migrates p24 an p25 to cluster Aquario Although this movement is pertinent
to get performance, just one superstep is executed before ending the application
Fifty processes were evaluated when the 50× 50 matrix was considered In this context, α also
increases at each call for process rescheduling We observed that an overhead of 3% was found
when scenario i and ii were compared (using α 2) In addition, we observed that all values of
α achieved a gain of performance in scenario iii Especially when α 2 was used, five calls for
process rescheduling were done (at supersteps 2, 6, 14, 30 and 62) No migrations are indicated
in the first three calls The greater is the matrix size, the greater is the amount of supersteps needed to make migrations viable This happens because our total load is fixed (independent
of the matrix size) but the load partition increases uniformly along the supersteps (see Section
4 for details) Process 21 up to 29 are migrated to cluster Aquario at superstep 30, while
process 37 up to 42 are migrated to this cluster at superstep 62 Using α equal to 4, 84.65s were obtained for scenario iii which results a gain of 9% This gain is greater than that achieved with α 2 because now the last rescheduling call is done at superstep 60 The same processes were migrated at this point However, there are two more supersteps to execute using α equal
to 4 Three rescheduling calls were done with α8 (at supersteps 8, 24 and 56) Only the last two
produce migration Three processes are migrated at superstep 24: {(p21,a1),(p22,a2),(p23,a3)} Process 37 up to 42 are migrated to cluster Aquario at superstep 56 This last call is efficient since it transfers all processes from cluster Frontal to Aquario
The execution with a 100×100 matrix shows good results with process migration Six
rescheduling calls were done when using α 2 Migrations did not occur at the first three
su-persteps (2, 6 and 14) Process 21 up to 29 are migrated to cluster Aquario after superstep 30
In addition, process 37 to 42 are migrated to cluster Aquario at superstep 62 Finally, super-step 126 indicates 7 migrations, but just 5 occurred: p30 up to p36 to cluster Aquario These migrations complete one process per node on cluster Aquario MigBSP selected for migration those processes that belonged to cluster Corisco and Frontal, which are the slowest clusters on
our infrastructure testbed α equal to 16 produced 3 attempts for migration when a 100 ×100 matrix is evaluated (at supersteps 16, 48 and 112) All of them triggered migrations In the first
Trang 6call, the 11thfirst processes are migrated to cluster Aquario All process from cluster Frontal
are migrated to Aquario at superstep 48 Finally, 15 processes are selected as candidates for
migration after crossing 112 supersteps They are: p21 to p36 This spectrum of candidates
is equal to the processes that are running on Frontal Considering this, only 3 processes were
migrated actually: {(p34,a18),(p35a19),(p36,a20)}
Fig 12 Migration behavior when testing a 200× 200 matrix with initial α equal to 2
Table 7 also shows the application performance when the 200×200 matrix was tested
Sat-isfactory results were obtained with process migration The system stays stable during all
application execution Despite having more than one process mapped to one processor,
some-times just a portion of them is responsible for computation at a specific moment This occurs
because the processes are mapped to matrix columns, while supersteps comprise the
anti-diagonals of the matrix Figure 12 illustrates the migrations behavior along the execution
with α 2 Using α 2 and considering scenario iii, 8 calls for process rescheduling were done.
Migrations were not done at supersteps 2, 6 and 14 Processes 21 up to 31 are migrated to
cluster Aquario at superstep 30 Moreover, all processes from cluster Frontal are migrated to
Aquario at superstep 62 Six processes are candidates for migration at superstep 126: p30 to
p36 However, only p31 up to p36 are migrated to cluster Aquario These migrations
hap-pen because the processes initially mapped to cluster Aquario do not collaborate yet with BSP
computation Migrations are not viable at superstep 254 Finally, 12 processes (p189 to p200)
are migrated to cluster Aquario when superstep 388 was crossed At this time, all previous
processes allocated to Aquario are inactive and the migrations are viable However, just 10
remaining supersteps are executed to amortize the process migration costs
5.3 LU Decomposition Application
Consider a system of linear equations A.x=b, where A is a given n × n non singular matrix,
b a given vector of length n, and x the unknown solution vector of length n One method for
solving this system is by using the LU Decomposition technique It comprises the
decompo-sition of the matrix A into a lower triangular matrix L and an upper triangular matrix U such
that A=LU A n × n matrix L is called unit lower triangular if l i,i=1 for all i, 0 ≤ i < n, and
l i,j=0 for all i, j where 0 ≤ i < j < n An n × n matrix U is called upper triangular if u i,j=0
for all i, j with 0 ≤ j < i < n.
Fig 13 L and U matrices with the same memory space of the original matrix A0
1 for k from 0 to n −1 do fork from 0 to n −1 do
3 u k,j=a k
a k,k
i,k= a k i,k
10 a k+1 i,j =a k
i,j − l i,k u k,j endfor
11 endfor
12 endfor
13 endfor
Fig 14 Two algorithms to solve the LU Decomposition problem
On input, A contains the original matrix A0, whereas on output it contains the values of L below the diagonal and the values of U above and on the diagonal such that LU=A0 Figure
13 illustrates the organization of LU computation The values of L and U computed so far and the computed sub-matrix A k may be stored in the same memory space of A0 Figure 14
presents the sequential algorithm for producing L and U in stages Stage k first computes the elements u k,j , j ≥ k, of row k of U and the elements l i,k , i > k, of column k of L Then, it
computes A k+1in preparation for the next stage Figure 14 also presents in the right side the
functioning of the previous algorithm using just the elements from matrix A Figure 13 (b) presents the data that is necessary to compute a i,j Besides its own value, a i,jis updated using
a value from the same line and another from the same column
5.3.1 Modeling the Problem
This section explains how we modeled the LU sequential application on a BSP-based parallel
one Firstly, the bulk of the computational work in stage k of the sequential algorithm is the
Trang 7call, the 11thfirst processes are migrated to cluster Aquario All process from cluster Frontal
are migrated to Aquario at superstep 48 Finally, 15 processes are selected as candidates for
migration after crossing 112 supersteps They are: p21 to p36 This spectrum of candidates
is equal to the processes that are running on Frontal Considering this, only 3 processes were
migrated actually: {(p34,a18),(p35a19),(p36,a20)}
Fig 12 Migration behavior when testing a 200× 200 matrix with initial α equal to 2
Table 7 also shows the application performance when the 200×200 matrix was tested
Sat-isfactory results were obtained with process migration The system stays stable during all
application execution Despite having more than one process mapped to one processor,
some-times just a portion of them is responsible for computation at a specific moment This occurs
because the processes are mapped to matrix columns, while supersteps comprise the
anti-diagonals of the matrix Figure 12 illustrates the migrations behavior along the execution
with α 2 Using α 2 and considering scenario iii, 8 calls for process rescheduling were done.
Migrations were not done at supersteps 2, 6 and 14 Processes 21 up to 31 are migrated to
cluster Aquario at superstep 30 Moreover, all processes from cluster Frontal are migrated to
Aquario at superstep 62 Six processes are candidates for migration at superstep 126: p30 to
p36 However, only p31 up to p36 are migrated to cluster Aquario These migrations
hap-pen because the processes initially mapped to cluster Aquario do not collaborate yet with BSP
computation Migrations are not viable at superstep 254 Finally, 12 processes (p189 to p200)
are migrated to cluster Aquario when superstep 388 was crossed At this time, all previous
processes allocated to Aquario are inactive and the migrations are viable However, just 10
remaining supersteps are executed to amortize the process migration costs
5.3 LU Decomposition Application
Consider a system of linear equations A.x=b, where A is a given n × n non singular matrix,
b a given vector of length n, and x the unknown solution vector of length n One method for
solving this system is by using the LU Decomposition technique It comprises the
decompo-sition of the matrix A into a lower triangular matrix L and an upper triangular matrix U such
that A=LU A n × n matrix L is called unit lower triangular if l i,i=1 for all i, 0 ≤ i < n, and
l i,j=0 for all i, j where 0 ≤ i < j < n An n × n matrix U is called upper triangular if u i,j=0
for all i, j with 0 ≤ j < i < n.
Fig 13 L and U matrices with the same memory space of the original matrix A0
1 for k from 0 to n −1 do fork from 0 to n −1 do
3 u k,j=a k
a k,k
i,k= a k i,k
10 a k+1 i,j =a k
i,j − l i,k u k,j endfor
11 endfor
12 endfor
13 endfor
Fig 14 Two algorithms to solve the LU Decomposition problem
On input, A contains the original matrix A0, whereas on output it contains the values of L below the diagonal and the values of U above and on the diagonal such that LU=A0 Figure
13 illustrates the organization of LU computation The values of L and U computed so far and the computed sub-matrix A k may be stored in the same memory space of A0 Figure 14
presents the sequential algorithm for producing L and U in stages Stage k first computes the elements u k,j , j ≥ k, of row k of U and the elements l i,k , i > k, of column k of L Then, it
computes A k+1in preparation for the next stage Figure 14 also presents in the right side the
functioning of the previous algorithm using just the elements from matrix A Figure 13 (b) presents the data that is necessary to compute a i,j Besides its own value, a i,jis updated using
a value from the same line and another from the same column
5.3.1 Modeling the Problem
This section explains how we modeled the LU sequential application on a BSP-based parallel
one Firstly, the bulk of the computational work in stage k of the sequential algorithm is the
Trang 8modification of the matrix elements a i,j with i, j ≥ k+1 Aiming to prevent communication
of large amounts of data, the update of a i,j=a i,j+a i,k a k,jmust be performed by the process
whose contains a i,j This implies that only elements of column k and row k of A need to be
communicated in stage k in order to compute the new sub-matrix A k An important
obser-vation is that the modification of the elements in row A(i, k+1 : n −1)uses only one value
of column k of A, namely a i,k The provided notation A(i, k+1 : n −1)denotes the cells of
line i varying from column k+1 to n −1 If we distribute each matrix row over a limit set
of N processes, then the communication of an element from column k can be restricted to a
multicast to N processes Similarly, the change of the elements in A(k+1 : n − 1, j)uses only
one value from row k of A, namely a k,j If we divide each column over a set of M processes,
the communication of an element of row k can be restricted to a multicast to M processes.
We are using a Cartesian scheme for the distribution of matrices The square cyclic distribution
is used since it is particularly suitable for matrix computations (Bisseling, 2004) Thus, it is
natural to organize the processes by two-dimensional identifiers P(s, t)with 0≤ s < M and
0≤ t < N, where the number of processes p=M.N Figure 15 depicts a 6 ×6 matrix mapped
to 6 processes, where M=2 and N=3 Assuming that M and N are factors of n, each process
will store nc (number of cells) cells in memory (see Equation 10).
M.
n
Fig 15 Cartesian distribution of a matrix over 2× 3 (M × N) processes
A parallel algorithm uses data parallelism for computations and the need-to-know principle
to design the communication phase of each superstep Following the concepts of BSP, all
communication performed during a superstep will be completed when finishing it and the
data will be available at the beginning of the next superstep (Bonorden, 2007) Concerning
this, we modeled our algorithm using three kinds of supersteps They are explained in Table
8 The element a k,k is passed to the process that computes a i,kin the first kind of superstep
The computation of a i,kis expressed in the beginning of the second superstep This superstep
is also responsible for sending the elements a i,k and a k,j to a i,j First of all, we pass the element
a i,k , k+1≤ i < n, to the N − 1 processes that execute on the respective row i This kind of
superstep also comprises the passing of a k,j , k+1≤ j < n, to M −1 processes which execute
on the respective column j The superstep 3 considers the computation of a i,j, the increase of
k (next stage of the algorithm) and the transmission of a k,k to a i,k elements (k+1≤ i < n).
The application will execute one superstep of type 1 and will follow with the interleaving of
supersteps 2 and 3 Thus, a n × n matrix will trigger 2n+1 supersteps in our LU modeling We
Type of su-perstep
Steps and explanation
First Step 1.1 : k = 0
Step 1.2 - Pass the element a k,k to cells which will compute a i,k (k+1≤ i < n)
Second
Step 2.1 : Computation of a i,k (k+1≤ i < n) by cells which own them
Step 2.2 : For each i (k+1 ≤ i < n), pass the element a i,k to other a i,j
elements in the same line (k+1≤ j < n)
Step 2.3 : For each j (k+1 ≤ j < n), pass the element a k,j to other a i,j elements in the same column (k+1≤ i < n)
Third
Step 3.1 : For each i and j (k+1≤ i, j < n), calculate a i,j as a i,j+a i,k a k,j Step 3.2 : k=k+1
Step 3.3 : Pass the element a k,k to cells which will compute a i,k (k+1≤ i < n)
Table 8 Modeling three types of supersteps for LU computation
modeled the Cartesian distribution M × N in the following manner: 5 ×5, 10×5, 10×10 and
20×10 for 25, 50, 100 and 200 processes, respectively Moreover, we are applying simulation over square matrices with orders 500, 1000, 2000 and 5000 Lastly, the tests were executed
using α=4, ω=3, D=0.5 and x = 80%.
5.3.2 Results and Discussions
Table 9 presents the results when evaluating LU application The tests with the first matrix size show the worst results Formerly, the higher the number of processes, the worse the
performance, as we can observe in scenario i The reasons for the observed times are the
overheads related to communication and synchronization Secondly, MigBSP indicated that all migration attempts were not viable due to low computing and communication loads when
compared to migration costs Considering this, both scenarios ii and iii have the same results.
Processes i 500×500 matrixii iii i 1000× ii1000 matrixiii i 2000×2000 matrixii iii
Table 9 First results when executing LU linked to MigBSP (time in seconds) When testing a 1000×1000 matrix with 25 processes, the first rescheduling call does not cause migrations After this call at superstep 4, the next one at superstep 11 informs the migration of
5 processes from cluster Corisco They were all transferred to cluster Aquario, which has the
highest computation power MigBSP does not point migrations in the future calls α always
increases its value at each rescheduling call since the processes are balanced after the men-tioned relocations MigBSP obtained a gain of 12% of performance with 25 processes when
comparing scenarios i and iii With the same size of matrix and 50 processes, 6 processes from
Frontal were migrated to Aquario at superstep 9 Although these migrations are profitable,
Trang 9modification of the matrix elements a i,j with i, j ≥ k+1 Aiming to prevent communication
of large amounts of data, the update of a i,j=a i,j+a i,k a k,jmust be performed by the process
whose contains a i,j This implies that only elements of column k and row k of A need to be
communicated in stage k in order to compute the new sub-matrix A k An important
obser-vation is that the modification of the elements in row A(i, k+1 : n −1)uses only one value
of column k of A, namely a i,k The provided notation A(i, k+1 : n −1)denotes the cells of
line i varying from column k+1 to n −1 If we distribute each matrix row over a limit set
of N processes, then the communication of an element from column k can be restricted to a
multicast to N processes Similarly, the change of the elements in A(k+1 : n − 1, j)uses only
one value from row k of A, namely a k,j If we divide each column over a set of M processes,
the communication of an element of row k can be restricted to a multicast to M processes.
We are using a Cartesian scheme for the distribution of matrices The square cyclic distribution
is used since it is particularly suitable for matrix computations (Bisseling, 2004) Thus, it is
natural to organize the processes by two-dimensional identifiers P(s, t)with 0≤ s < M and
0≤ t < N, where the number of processes p=M.N Figure 15 depicts a 6 ×6 matrix mapped
to 6 processes, where M=2 and N=3 Assuming that M and N are factors of n, each process
will store nc (number of cells) cells in memory (see Equation 10).
M.
n
Fig 15 Cartesian distribution of a matrix over 2× 3 (M × N) processes
A parallel algorithm uses data parallelism for computations and the need-to-know principle
to design the communication phase of each superstep Following the concepts of BSP, all
communication performed during a superstep will be completed when finishing it and the
data will be available at the beginning of the next superstep (Bonorden, 2007) Concerning
this, we modeled our algorithm using three kinds of supersteps They are explained in Table
8 The element a k,k is passed to the process that computes a i,kin the first kind of superstep
The computation of a i,kis expressed in the beginning of the second superstep This superstep
is also responsible for sending the elements a i,k and a k,j to a i,j First of all, we pass the element
a i,k , k+1≤ i < n, to the N − 1 processes that execute on the respective row i This kind of
superstep also comprises the passing of a k,j , k+1≤ j < n, to M −1 processes which execute
on the respective column j The superstep 3 considers the computation of a i,j, the increase of
k (next stage of the algorithm) and the transmission of a k,k to a i,k elements (k+1≤ i < n).
The application will execute one superstep of type 1 and will follow with the interleaving of
supersteps 2 and 3 Thus, a n × n matrix will trigger 2n+1 supersteps in our LU modeling We
Type of su-perstep
Steps and explanation
First Step 1.1 : k = 0
Step 1.2 - Pass the element a k,k to cells which will compute a i,k (k+1≤ i < n)
Second
Step 2.1 : Computation of a i,k (k+1≤ i < n) by cells which own them
Step 2.2 : For each i (k+1 ≤ i < n), pass the element a i,k to other a i,j
elements in the same line (k+1≤ j < n)
Step 2.3 : For each j (k+1 ≤ j < n), pass the element a k,j to other a i,j elements in the same column (k+1≤ i < n)
Third
Step 3.1 : For each i and j (k+1≤ i, j < n), calculate a i,j as a i,j+a i,k a k,j Step 3.2 : k=k+1
Step 3.3 : Pass the element a k,k to cells which will compute a i,k (k+1≤ i < n)
Table 8 Modeling three types of supersteps for LU computation
modeled the Cartesian distribution M × N in the following manner: 5 ×5, 10×5, 10×10 and
20×10 for 25, 50, 100 and 200 processes, respectively Moreover, we are applying simulation over square matrices with orders 500, 1000, 2000 and 5000 Lastly, the tests were executed
using α=4, ω=3, D=0.5 and x = 80%.
5.3.2 Results and Discussions
Table 9 presents the results when evaluating LU application The tests with the first matrix size show the worst results Formerly, the higher the number of processes, the worse the
performance, as we can observe in scenario i The reasons for the observed times are the
overheads related to communication and synchronization Secondly, MigBSP indicated that all migration attempts were not viable due to low computing and communication loads when
compared to migration costs Considering this, both scenarios ii and iii have the same results.
Processes i 500×500 matrixii iii i 1000× ii1000 matrixiii i 2000×2000 matrixii iii
Table 9 First results when executing LU linked to MigBSP (time in seconds) When testing a 1000×1000 matrix with 25 processes, the first rescheduling call does not cause migrations After this call at superstep 4, the next one at superstep 11 informs the migration of
5 processes from cluster Corisco They were all transferred to cluster Aquario, which has the
highest computation power MigBSP does not point migrations in the future calls α always
increases its value at each rescheduling call since the processes are balanced after the men-tioned relocations MigBSP obtained a gain of 12% of performance with 25 processes when
comparing scenarios i and iii With the same size of matrix and 50 processes, 6 processes from
Frontal were migrated to Aquario at superstep 9 Although these migrations are profitable,
Trang 10they do not provide stability to the system and the processes remain unbalanced among the
resources Migrations are not viable in the next 3 calls at supersteps 15, 21 and 27 After
that, MigBSP launches our second adaptation on rescheduling frequency in order to alleviate
its impact and α begins to grow until the end of the application The tests with 50 processes
obtained gains of just 5% with process migration This is explained by the fact that the
compu-tational load is decreased in this configuration when compared to the one with 25 processes
In addition, the bigger the number of the superstep, the smaller the computational load
re-quired by it Therefore, the more advanced the execution, the lesser the gain with migrations
The tests with 100 and 200 processes do not present migrations owing to the forces that act in
favor of migration are weaker than the Memory metric in all rescheduling calls
The execution with a 2000×2000 matrix presents good results because the computational load
is increased We observed a gain of 15% with process relocation when testing 25 processes
All processes from cluster Corisco were migrated to Aquario in the first rescheduling call (at
superstep 4) Thus, the application can take profit from this relocation in its beginning, when
it demands more computations The time for concluding the LU application is reduced when
passing from 25 to 50 processes as we can see in scenario i However, the use of MigBSP
resulted in lower gains Scenario i presented 60.23s while scenario iii achieved 56.18s (9% of
profit) When considering 50 processes, 6 processes were transferred from cluster Frontal to
Aquario at superstep 4 The next call occurs at superstep 9, where 16 processes from cluster
Corisco were elected as migration candidates to Aquario However, MigBSP indicated the
migration of 14 processes, since there were only 14 unoccupied processors in the target cluster
Fig 16 Performance graph with our three scenarios for a 5000×5000 matrix
We observed that the higher the matrix order, the better the results with process migration
Considering this, the evaluation of a 5000×5000 matrix can be seen in the Figure 16 The
sim-ple movement of all processes from cluster Corisco to Aquario represented a gain of 19% when
executing 25 processes The tests with 50 processes obtained 852.31s and 723.64s for scenario
i and iii, respectively The same migration behavior found on the tests with the 2000 ×2000
matrix was achieved in Scenario iii However, the increase of matrix order represented a gain
of 15% (order 5000) instead of 10% (order 2000) This analysis helps us to verify our
previ-ous hypothesis about performance gains when enlarging the matrix Finally, the tests with
200 processes indicated the migration of 6 processes (p195 up to p200) from cluster Corisco to
Aquario at superstep 4 Thus, the nodes that belong to Corisco just execute one BSP process
while the nodes from Aquario begin to treat 2 processes The remaining rescheduling calls
inform the processes from Labtec as those with the higher values of PM However, their
mi-grations are not considered profitable The final execution with 200 processes achieved 460.85s
and 450.33s for scenarios i and iii, respectively.
6 Conclusion
Scheduling schemes for multi-programmed parallel systems can be viewed in two lev-els (Frachtenberg & Schwieglev-elshohn, 2008) In the first level processors are allocated to a job In the second level processes from a job are (re)scheduled using this pool of processors MigBSP can be included in this last scheme, offering algorithms for load (BSP processes) re-balancing among the resources during the application runtime In the best of our knowledge, MigBSP is the pioneer model on treating BSP process rescheduling with three metrics and adaptations on remapping frequency These features are enabled by MigBSP at middleware level, without changing the application code
Considering the spectrum of the three tested applications, we can take the following conclu-sions in a nutshell: (i) the larger the computing grain, the better the gain with processes migra-tion; (ii) MigBSP does not indicate the migration of those processes that have high migration costs when compared to computation and communication loads; (iii) MigBSP presented a low overhead on application execution when migrations are not applied; (v) our tests prioritizes migrations to cluster Aquario since it is the fastest one among considered clusters and tested applications are CPU-bound and; (vi) MigBSP does not work with previous knowledge about application Considering this last topic, MigBSP indicates migrations even when the applica-tion is close to finish In this situaapplica-tion, these migraapplica-tions bring an overhead since the remaining time for application conclusion is too short to amortize their costs
The results showed that MigBSP presented a low overhead on application execution The calculus of the PM (Potential of Migration) as well as our efficient adaptations were respon-sible for this feature PM considers processes and Sets (different sites), not performing all processes-resources tests at the rescheduling moment Meanwhile, our adaptations were cru-cial to enable MigBSP as a viable scheduler Instead of performing the rescheduling call at each fixed interval, they manage a flexible interval between calls based on the behavior of the
processes The concepts of the adaptations are: (i) to postpone the rescheduling call if the system is stable (processes are balanced) or to turn it more frequent, otherwise; (ii) to delay this call if a pattern without migrations in ω calls is observed.
7 References
Bhandarkar, M A., Brunner, R & Kale, L V (2000) Run-time support for adaptive load
balancing, IPDPS ’00: Proceedings of the 15 IPDPS 2000 Workshops on Parallel and
Dis-tributed Processing, Springer-Verlag, London, UK, pp 1152–1159.
Bisseling, R H (2004) Parallel Scientific Computation: A Structured Approach Using BSP and
MPI, Oxford University Press.
Bonorden, O (2007) Load balancing in the bulk-synchronous-parallel setting using process
migrations., 21th International Parallel and Distributed Processing Symposium (IPDPS
2007), IEEE, pp 1–9.
Bonorden, O., Gehweiler, J & auf der Heide, F M (2005) Load balancing strategies in a web
computing environment, Proceeedings of International Conference on Parallel Processing
and Applied Mathematics (PPAM), Poznan, Poland, pp 839–846.
Casanova, H., Legrand, A & Quinson, M (2008) Simgrid: A generic framework for
large-scale distributed experiments, Tenth International Conference on Computer Modeling and
Simulation (uksim), IEEE Computer Society, Los Alamitos, CA, USA, pp 126–131.
Casavant, T L & Kuhl, J G (1988) A taxonomy of scheduling in general-purpose distributed
computing systems, IEEE Trans Softw Eng 14(2): 141–154.