Tuning iteration space slicing based tiled multi-core code implementing Nussinov’s RNA folding

RNA folding is an ongoing compute-intensive task of bioinformatics. Parallelization and improving code locality for this kind of algorithms is one of the most relevant areas in computational biology.

Trang 1

R E S E A R C H A R T I C L E Open Access

Tuning iteration space slicing based tiled

multi-core code implementing Nussinov’s RNA folding

Abstract

Background: RNA folding is an ongoing compute-intensive task of bioinformatics Parallelization and improving

code locality for this kind of algorithms is one of the most relevant areas in computational biology Fortunately, RNA secondary structure approaches, such as Nussinov’s recurrence, involve mathematical operations over affine control loops whose iteration space can be represented by the polyhedral model This allows us to apply powerful polyhedral compilation techniques based on the transitive closure of dependence graphs to generate parallel tiled code

implementing Nussinov’s RNA folding Such techniques are within the iteration space slicing framework – the transitive dependences are applied to the statement instances of interest to produce valid tiles The main problem at generating parallel tiled code is defining a proper tile size and tile dimension which impact parallelism degree and code locality

Results: To choose the best tile size and tile dimension, we first construct parallel parametric tiled code (parameters

are variables defining tile size) With this purpose, we first generate two nonparametric tiled codes with different fixed tile sizes but with the same code structure and then derive a general affine model, which describes all integer factors available in expressions of those codes Using this model and known integer factors present in the mentioned

expressions (they define the left-hand side of the model), we find unknown integers in this model for each integer factor available in the same fixed tiled code position and replace in this code expressions, including integer factors, with those including parameters Then we use this parallel parametric tiled code to implement the well-known tile size selection (TSS) technique, which allows us to discover in a given search space the best tile size and tile dimension maximizing target code performance

Conclusions: For a given search space, the presented approach allows us to choose the best tile size and tile

dimension in parallel tiled code implementing Nussinov’s RNA folding Experimental results, received on modern Intel multi-core processors, demonstrate that this code outperforms known closely related implementations when the length of RNA strands is bigger than 2500

Keywords: RNA folding, Parametric loop tiling, Computational biology, Nussinov’s algorithm, Parallel computing, Tile

size selection

Background

RNA structure prediction, or folding, is an important

ongoing problem that lies at the core of several search

applications in computational biology Algorithms to

pre-dict the structure of single RNA molecules find a structure

of minimum free energy for a given RNA using dynamic

programming Nussinov’s folding algorithm [1] uses the

*Correspondence: mpalkowski@wi.zut.edu.pl

West Pomeranian University of Technology, Faculty of Computer Science,

Zolnierska 49, 71-210 Szczecin, Poland

number of base pairs as a proxy for free energy, preferring the structure with the most base pairs

Nussinov’s algorithm is compute intensive due to a cubic time complexity Fortunately, it involves mathemat-ical operations over affine control loops whose iteration space can be represented by the polyhedral model [2] Thanks to the simple pattern of dependences, loop tiling techniques can be used to accelerate Nussinov’s folding

Let S be an N ×N Nussinov matrix and σ(i, j) be a

func-tion which returns 1 if(x i , x j ) match and i < j − 1, or 0

© The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0

International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made The Creative Commons Public Domain Dedication waiver

Trang 2

otherwise, then the following recursion S (i, j) (the

maxi-mum number of base-pair matches of x i , , x j) is defined

over the region 1≤ i < j ≤ N as

S (i, j) = max

1≤ i<j≤N

⎧⎪⎪

⎨⎪⎪

⎩

S (i + 1, j − 1) + σ(i, j)

max

i ≤k<j (S(i, k) + S(k + 1, j)).

and S(i, j) is zero beyond that region.

Listing 1 represents the loop nest implementing

Nussinov’s algorithm It consists of triply nested affine

loops with two statements accessing to two-dimensional

array S.

Loop tiling or blocking is a crucial program

transfor-mation, which offers a number of benefits It is used to

improve code locality, expose parallelism, and allow for

adjusting parallel code granularity or balance All those

factors impact parallel code performance [3]

In paper [4], we presented loop tiling based on the

transitive closure of a dependence graph for Nussinov’s

algorithm It is within the iteration space slicing (ISS)

framework [5] The key step in calculating an iteration

space slice is to calculate the transitive closure of the data

dependence graph of the program; then transitive

depen-dences are applied to the statement instances of interest to

produce valid tiles The idea of tiling, presented in paper

[4], is to transform (correct) original rectangular fixed tiles

so that all target tiles are valid under lexicographic order

We demonstrated higher speed-up of generated tiled code

(for a properly chosen size of original tiles) than that

of code produced with state-of the-art source-to-source

optimizing compilers But that paper does not answer

what is the best size of original tiles allowing for

gener-ation of tiled code demonstrating the maximal speed-up

In general, the number of combinations of possible tile

sizes can be very large For each tile size, it is

neces-sary to generate tiled code, compile and spawn it, and

finally carry out code profiling This can result in very high

expenses not allowing for discovering the best tile size in

practice

The goal of this paper is to present an approach allowing

us to determine the best tile size maximizing tiled code performance to be applied in practice This approach is based on parametric tiling

Parametric tiling is more general, it allows for defining tile size with parameters instead of constants [3] With fixed size tiling, a separate program must be generated and compiled each time when tile size is changed In general, this can be very expensive Thereby, paramet-ric tiling is more flexible and time and cost saving when

we deal with code locality analysis and tuning code for target architectures However, most state-of-the-art com-pilation tools do not provide parametric tiling, they are able to generate tiled code for only fixed tile size Para-metric tiling is generally known to be non-linear, breaking the mathematical closure properties of the polyhedral model

To our best knowledge, well-known tiling techniques and optimizing compilers are based on linear or affine transformations [6–8], for example, the-state-of-the-art PluTo compiler [6] generates tiled code applying affine transformations derived However, PluTo can only gener-ate tiled code when tile size is fixed

PrimeTile [9] is the first system to generate paramet-rically tiled code for affine imperfectly nested loops It uses a level by level approach to generate tiled code, with

a prolog, epilog, and a full-tiles loop nest correspond-ing to each nestcorrespond-ing level of the original code But loop tiling is generated seamlessly in the affine transformation framework

DynTile [10] utilizes wavefront parallelism in the tiled iteration space corresponding to the convex hull of all the statement domains of the input untiled code Tiles are scheduled dynamically, i.e., at run time

PTile [11] is an approach to compile-time generation of code for wavefront parallel tiled execution

Although DynTile, PTile, and PrimeTile present very effective tiling for stencils, using affine loop transfor-mations, they do not allow us to tile dynamic pro-gramming kernels efficiently, in particular, they fail

to tile the innermost loop in the code implement-ing Nussinov’s algorithm [2] We show in this paper that tiling of that loop is crucial to achieve high performance Furtermore, known techniques of mono-parametric tiling [3] (tile sizes are multiple of the same block parameter) do not guarantee notable local-ity improvements for Nussinov’s algorithm To our best knowledge, there does not exist any parametric loop tiling scheme for the loop nest implementing Nussinov’s algorithm

Mullapudi and Bondhugula presented dynamic tiling for Zuker’s optimal RNA secondary structure predic-tion [2] to overcome limitapredic-tions of affine transforma-tions 3-D iterative tiling for dynamic scheduling is

Trang 3

calculated by means of reduction chains Operations

along each chain find maximum and can be reordered

to eliminate cycles Their approach involves dynamic

scheduling of tiles, rather than the generation of a static

schedule

Wonnacott et al introduced serial 3-D tiling of

“mostly-tileable” loop nests of Nussinov’s RNA secondary

structure prediction in paper [12] This approach tiles

non-problematic iterations (iterations of loops ’i’ and ’j’)

with classic tiling strategies while problematic iterations

of loop (’k’) are peeled off and executed later

Unfortu-nately, the paper does not consider any parallel code, tiling

is represented with serial code

In this paper, we present an approach allowing for

deriv-ing the best size of original tiles to be used for generation

of ISS based tiled code implementing Nussinov’s RNA

folding

Methods

Brief introduction

The polyhedral model is a mathematical formalism for

analyzing, parallelizing, and transforming an

impor-tant class of compute- and data-intensive programs, or

program fragments consisting of (sequences of )

arbi-trarily nested loops Loop bounds, statements

condi-tions and array accesses are affine funccondi-tions in the

program

Within the polyhedral model for analysis and

transfor-mation of affine programs, we deal with sets and

rela-tions whose constraints need to be affine, i.e., presented

with linear expressions and constant terms Affine

con-straints may be combined through the conjunction (and),

disjunction (or), projection (exists), and negation (not)

operators

An access relation connects iterations of a statement to

the array elements accessed by those iterations Relations

are defined in similar way as sets, except that the single

space is replaced by a pair of spaces separated by the arrow

sign→ We use the exact dependence analysis proposed

by Pugh and Wonnacott [13], where loop dependences are

represented with relations.

Standard operations on relations and sets are used,

such as intersection (∩), union (∪), difference (-), domain

(dom R), range (ran R), relation application (S′ = R(S) ∶

e′ ∈ S′ iff exists e s.t e → e′ ∈ R, e ∈ S) The detailed

description of these operations is presented in [13]

The positive transitive closure of a given

lexicograph-ically forward dependence relation R, R+, is defined as

follows [5]:

R+= {e → e′∶ e → e′∈ R ∨

∃e′′s t e → e′′∈ R ∧ e′′→ e′∈ R+}

It describes which vertices e′ in a dependence graph

(represented by relation R) are connected directly or tran-sitively with vertex e.

In sequential loop nests, the iteration i executes before

j if i is lexicographically less than j, denoted as i ≺ j, i.e.,

i1< j1∨ ∃k ≥ 1 ∶ i k < j k ∧ i t = j t , for t < k.

Generation of tiles for the Nussinov loop nest

Let us recap tiled code generation for Nussinov’s algo-rithm presented in [4] To generate valid 3-D tiled code for the Nussinov loop nest, we adopt the approach presented

in paper [14], which is based on the transitive closure of dependence graphs

The iteration domain of the Nussinov loop nest (see Listing 1) is represented with the following set

Iteration Domain=⎧⎪⎪⎪⎪⎪

⎨⎪⎪

⎪⎪⎪⎩

i ∶ 0 ≤ i ≤ N − 1,

j ∶ i + 1 ≤ j ≤ N − 1,

k∶ {s s12∶ 0 ≤ k ≤ j − i − 1, ∶ k = 0.

Let vector I = (i, j, k) T define indices of the Nussinov

loop nest, diagonal matrix B = [b1, b2, b3] define tile sizes,

vectors II = (ii, jj, kk) T and II′=(iip, jjp, kkp) Tspecify tile identifiers Each tile identifier is represented with a

non-negative integer, i.e., the following constraint II≥ 0 has to

be satisfied

First, we form parametric set, TILE(II, B), including

statement instances belonging to a parametric rectangular tile (parameters are tile identifiers) as follows

TILE=

⎧⎪⎪⎪

⎪⎪⎪⎪⎪

⎨⎪⎪

⎪⎪⎪⎪⎪

⎪⎩

i ∶ N − 1 − b1∗ ii ≥ i ≥ max(−b1∗ (ii + 1),

N − 1) ∧ ii ≥ 0

j ∶ b2∗ jj + i + 1 ≤ j ≤ min(b2∗ (jj + 1) + 1,

N − 1) ∧ jj ≥ 0

k∶⎧⎪⎪⎪

⎨⎪⎪

⎪⎩

s1∶ b3∗ kk ≤ k ≤ min(b3∗ (kk + 1) − 1,

j − i − 1) ∧ kk ≥ 0

s2∶ k = 0.

TILE_LT (TILE_GT) is the union of all the tiles whose

identifiers are lexicographically less (greater) than that of

TILE (II, B):

TILE_LT (GT) ={[I] ∣∃ II′ s t II′ ≺ (≻) II ∧ II ≥ 0

∧ B*II+LB ≤ UB ∧ II′ ≥ 0 and B*II′+LB ≤ UB ∧ I in

TILE (II′, B)}.

For calculating exact relation R+, where R is the union of

all dependence relations extracted for the Nussinov loop nest, we apply the algorithm presented in paper [15] Next,

we calculate the following set

Trang 4

TILE _ITR = TILE − R+(TILE_GT),

which does not include any invalid dependence target, i.e.,

it does not include any dependence target whose source is

within set TILE_GT.

The following set

TVLD _LT =(R+(TILE_ITR) ∩ TILE_LT)

− R+(TILE_GT)

includes all the iterations that i) belong to the tiles

whose identifiers are lexicographically less than that of

set TILE_ITR, ii) are the targets of the dependences

whose sources are contained in set TILE_ITR, and

iii) are not any target of a dependence whose source

belong to set TILE_GT Target tiles are defined by the

following set

TILE _VLD = TILE_ITR ∪ TVLD_LT.

Next, we form set TILE_VLD_EXT by means of

inserting into the first positions of the tuple of set

TILE_VLD elements of vector II: ii1, ii2, , ii d

Nonpara-metric tiled code is generated by means of applying any

code generator allowing for scanning elements of set

TILE_VLD_EXT in lexicographic order, for example, isl

AST [16]

In paper [4], we discuss parallelization of ISS based fixed

tiled code by means of loop skewing which honors all

dependences among generated tiles

Assumption about good original tile size and tile dimension

The most important step in generating target ISS based

tiled code is defining an original tile size and

dimen-sion to form set TILE according to the approach

pre-sented in paper [4] They impact serial and parallel code

locality and performance It worth noting that in

gen-eral, target tiles represented with set TILE_VLD are

dif-ferent from original rectangular ones defined with set

TILE Target tiles can be parametric non-rectangular

ones, i.e., the number of statement instances within

such tiles depends on parametric upper loop index

bounds

For parametric tiles, it does not guarantee that the data

size per a tile is smaller than the capacity of cache, this

leads to decreasing code locality The number of target

parametric tiles and the percentage of the iteration space,

occupied by them, depend on an original tile size So,

we strive to choose such original tile size which mini-mizes the percentage of the iteration sub-space occupied with target parametric tiles Let us note that if for a given loop nest statement, the set(R+(TILE_GT) ∩ TILE) is

empty, this means that for this statement, every target tile

is the same as the corresponding original one, i.e., tar-get parametric tiles are absent, so we have a good tiling scheme

An additional file presents sets(R+(TILE_GT) ∩ TILE) for s1 when B = [7, 79, 133] and B = [1, 79, 133],

respec-tively [see Additional file 1] The set (R+(TILE_GT) ∩ TILE ) for statement s2 is empty.

Scrutinizing the constraints of the set(R+(TILE_GT) ∩ TILE ) for statement s1 when B = [7, 79, 133] allows us

to conclude that most target tiles are different from orig-inal ones and they are non-rectangular For many target tiles, the data size per a target tile can be greater than the cache capacity of a multi-core platform used by us for car-rying out experiments (for details, see the next section)

So, the 3-D tiling scheme for ISS based tiled code is not desired

When we tile only the two inner loops, i.e., B = [1, 79, 133], we can derive the following conclusions A

value of parameter b3 has the most impact on the

percent-age of statement instances within non-corrected (rectan-gular) target tiles because it influences two loop indexes:

j and k For example, if the constraint N − ii + b2 ∗ jj <=

j <= 78 + N − ii + b2 ∗ jj or k > b2 ∗ jj is not

sat-isfied, statement instances defined with vector (i, j, k) T,

where j, k do not satisfy the above constraints, are all

within rectangular tiles Analyzing the constraints above,

we may conclude that increasing the value of b2 increases the percentage of instances of statement s1 included in

non-corrected target tiles On the other hand, increasing this value leads to increasing data per a target tile and reducing parallelism degree So, there exist “the golden

mean” of b2, which maximizes target ISS based tiled code

performance

The value of b3 influences only one loop index, k, in the

following constraints of the set(R+(TILE_GT) ∩ TILE):

k >= b3 ∗ kk and k <= b3 − 1 + b3 ∗ kk Increasing the value of b3 increases the percentage of instances of statement s1 included in non-corrected target tiles On

the other hand, increasing this value leads to increas-ing the stride between cache lines which are referenced

at each loop nest iteration (see Listing 1), this can

dra-matically reduce data reuse So, a value of b3 cannot be

large

Summing up, we may expect that good original tiles

are formed with the following matrix B = [1, b2, b3]

and b2 > b3, i.e., when we tile only the two inner loops This assumption is confirmed by means of the results of our experimental study presented in the next section

Trang 5

ISS based parametric tiled code construction

To improve locality of tiled code, we use a model known

as tile size selection (TSS) which can be classified into

model-driven empirical search based It is used to

charac-terize and prune the space of good tile sizes For each tile

size in the pruned search space, a version of the program

is generated and run on the target architecture, and the

tile size with the least execution time is selected [17]

To apply TSS, we first form parametric 3-D tiled code to

avoid generation and compilation of a separate code each

time when tile size is changed

For this purpose, applying our source-to-source

opti-mizing compiler TRACO [18], we generate two

non-parametric tiled codes for different values of elements of

matrix B = [b1, b2, b3] according to the technique

pre-sented in our paper [4] We choose those values to be

prime numbers to avoid generation of simplified

non-parametric tiled code We strive to generate tiled code

whose structure is the same regardless of values of

ele-ments of matrix B = [b1, b2, b3] Next using those codes,

we construct parametric tiled code

An additional file presents generated tiled codes where

for loops shown in violet correspond to B1= [23, 47, 113],

while for loops shown in red state for B2 = [37, 79, 167],

[see Additional file 2] Applying the way, presented in our

paper [4], we prove that those codes are valid

Analyzing those generated codes, we may conclude that i) their structures are the same, only integer factors, present in the same code position, are different; ii) there exist the following linear expressions defining the

init-statement, condition, and iteration expression of for loops:

b + b2, b2, b2− 1, b2+ 1, b3, b3− 1; iii) there exists the

non-linear expression of the form b1 ∗ b2.

Taking into account the above conclusions, we form the following general linear model which is valid for each

inte-ger factor, say y, present in the expressions of tiled loop

nest:

y = a0∗ b0+ a1∗ b2+ a1∗ b1 + a2∗ b2+ a3∗ b3+

a4, where a i , i= 0, , 4, are unknown integer coefficients,

b0= b1∗ b2 Let us note that we replaced the non-liner expression

b1∗ b2with the linear one b0

We use the iscc calculator [19] to find unknown

coeffi-cients a i , i= 0, 1, 2, 3, in the above model as follows

For each pair of values y1, y2, which appear in the same code position of the two generated nonparametric codes,

we form a system of equations as follows

⎧⎪⎪⎪

⎪⎪

⎨⎪⎪

⎪⎪⎪⎩

y1= a0∗ b01+a1∗ b21+ a1∗ b11+ a2∗ b21+ a3∗ b31

+ a41,

y2= a0∗ b02+a1∗ b22+ a1∗ b12+ a2∗ b22+ a3∗ b32

+ a42,

Trang 6

Table 1 Finding integer coefficients a i , i = 0 4, of the model y = a0∗ b0+ a1∗ b2+ a1∗ b1+ a2∗ b2+ a3∗ b3+ a4, where b0=b1*b2

40, 43, 49, 52, 59

where b ij , j = 1, 2, are particular values of b i , i= 1, 2, 3, for

the first (j = 1) and second (j = 2) nonparametric codes,

and apply the iscc calculator to resolve that system It is

worth noting that for each pair of y1, y2, the general model

is simplified so that the resulting system includes only at

most two unknowns, the reminding ones are absent

For example, in the codes presented in [see Additional

file 2] in line 6, we have in the same code position integers

93 and 153 We build the following set according to the

iscc calculator syntax [19]{[a1, a2] ∶ 23 ∗ a1 + 47 ∗ a2 =

93 ∧ 37 ∗ a1 + 79 ∗ a2 = 153}.

The constraints of this set are the two linear equations

with the two unknowns (a1, a2), they obtained from the

general model The iscc calculator returns the following

solution{[2, 1]}, i.e., a1 = 2, a2 = 1, and a0 = a3 = a4 = 0.

Hence, in the parametric code in line 6, we insert the

expression 2∗ b1 + b2.

Table 1 presents all solutions for integers available in the

examined nonparametric codes Using those solutions, we

form the parametric code presented in Listing 2 In [see

Additional file 2], that code is presented with dark lines

For so obtained parametric code, inter-tile dependences

are described with non-affine expressions, so we cannot

prove its validity applying the way presented in paper [4]

However, we seek for the best tile size using the

previ-ously mentioned TSS technique, which envisages running

tiled code for particular fixed values of b i , i = 1, 2, 3 So

before running each fixed tiled code, we are able to check

its validity applying the way presented in paper [4] because

all inter-tile dependences for such a code are affine

Results and discussion

To carry out experiments, we have used a machine with an

Intel Xeon processor E5-2699 v3 (2.3 Ghz in base and 3.6

Ghz in turbo, 18 cores/36 threads, 576 KB L1 Cache for

code and data separately, 4.5 MB L2 Cache and 45MB L3

Cache) and 128 GB RAM All programs were compiled by

means of the Intel C++ Compiler (icc 15.0.2) with the -O3

flag of optimization To implement multi-threaded paral-lel processing, the OpenMP programming interface [20] was chosen

We experimented with randomly generated RNA strands1 of length 2200 and 5000, the size of the aver-age and longest human mRNA, respectively We examined

Table 2 Execution time (in seconds) of serial ISS based tiled code

for some tile sizes, N =2200

Trang 7

also longer strands (up to 10000) to illustrate benefits of

tiling the innermost loop nest

We considered 20 possible tile sizes along each

dimen-sion from the set {1, 2, 4, 6, 8, 12, 16, 24, 32, 40, 48, 64, 96,

128, 150, 200, 256, 300, 400, 512} This leads to the search

space including 203= 8000 possible tile sizes

To carry out experiments, we wrote a script which

auto-matically fulfills the following tasks: i) chooses one tile size

from the search space (values of b i , i = 1, 2, 3), ii) checks

the validity of the tiled code with the chosen tile size

according to the way presented in paper [4], iii) spawns the

tiled code with the chosen tile size, iv) measures execution

time, v) repeats steps i) - iv) for each tile size within the

search space and collects all execution times It worth

not-ing that parametric code compilation runs only one time

that greatly reduces search time

Table 2 presents execution time of serial ISS based tiled

code for some tile sizes The execution time of the original

(untiled) loop nest is 12.28 seconds The results show that

tiling of the two innermost loops allows for reaching

min-imal execution time of 3.276 seconds, this results in the

maximal speed-up of 3.7 Under speed-up we mean the ratio of original program execution time to that of tiled one Tiling the outermost loop allows us to reduce time execution to only 6.65 seconds It is worth noting that only

15 tile sizes in the examined search space lead to greater execution time than that of the original program (see the last lines in Table 2)

Figure 1 depicts execution times of serial ISS based tiled code for the four tile sizes of the outermost loop As we

can see, choosing b1=1 leads to the maximal tiled code performance The explanation of this fact is presented in the previous sub-section For this code, the best tile size within the examined search space is [1,128,16]

We carried out search of the best tile size in the same search space for multi-core tiled code with the bigger

problem size, N=5000, and presented execution times in

Table 3 For 32 threads, we observed super-linear tiled code speed-up of 112.9 for the tile size of [1× 96 × 8] The reason of super-liner speed-up is the cache affect result-ing from the different memory hierarchies of the mod-ern parallel computer used for carrying out experiments

Fig 1 Execution time (in seconds) of serial ISS based tiled code, N = 2200, run on Intel Xeon E5-2699 v3 Results show that the maximal performance

of serial ISS based tiled code is achieved when the outermost loop remains untiled (b1=1)

Trang 8

Table 3 Execution time (in seconds) of parallel ISS based tiled

code for some tile sizes, N = 5000, 32 threads used

Increasing the number of processors leads to increasing

the size of accumulated caches from different processors

With the larger accumulated cache size, more or even

all of the working data can fit into caches and memory

access time reduces dramatically, which this considerably

improve code locality

Obtained results show how much important is tiling

of the innermost loop To our best knowledge, such a

tiling is not possible by means of optimizing compilers

based on affine transformations For example,

the-state-of-the-art PluTo compiler (version 0.11.4) fails to tile the

innermost loop of the examined program The

interest-ing fact is that the best code performance is achieved

when the outermost loop nest remains untiled, tiling only

the two innermost loops allows us to achieve better tiled

code locality for the platform chosen for carrying out

experiments It is worth also noting that for the best tile

size, the value of b2has to be roughly tenfold bigger than

that of b3 The explanations of those facts are given in the previous section

The results in Table 4, graphically presented in Fig 2, demonstrate that our generated tiled code is scalable, i.e., increasing the number of threads increases code speed-up

We compared the performance of ISS based tiled code with that of the manual parallel and cache efficient imple-mentations [21, 22] of Nussinov’s RNA folding presented

in Listing 3

Chang et al [21] modified Nussinov’s recurrences equations to simplify parallelization for multi-core archi-tectures RNA folding starts with initializing elements

S (i, i) of the main diagonal of Nussinov’s matrix S and elements S (i, i + 1) of the diagonal just above the main

one, then elements of the remaining diagonals in the order

S (i, i+2) S(i, i+N−1) are calculated All parallel threads

synchronize before moving to the next diagonal

Table 4 Execution time (in seconds) of the Nussinov RNA folding codes for N = 5000 and different numbers of threads used

Threads Original Chang Li PluTo [8 × 8 × 1] ISS [2 × 6 × 300] ISS [1 × 96 × 8]

Trang 9

Fig 2 Speed-up of parallel codes for Nussinov’s matrix size of 5000 run on Intel Xeon E5-2699 v3 The horizontal coordinate represents the number

of threads, the vertical one shows the speed-up of the examined codes

Li et al [22] suggested a cache efficient version of

Chang’s code by using the lower triangle of matrix S to

store the transpose of the computed values in the upper

triangle of S [22] They store S [row][k] + S[k + 1][col] to

variable t (line 19 instead of Chang’s line 16) and

addition-ally store the value of _max to S [row][col] at the end of

the loop body (line 25) Values of S[k + 1][col] locate in

the same column but values of S[col][k + 1] locate in the

same row, for row ≤ k < col Li’s modifications accelerate

rapidly code execution because reading values in a row is

more cache efficient than reading values in a column [22]

Results in Table 4 show that our tiled code

implement-ing Nussinov’s algorithm with the tile size [1× 96 × 8]

outperforms the implementations of Chang and Li (see

Listing 3) for each examined number of threads (from 1 to

32) when N=5000.

This table includes also execution times of tiled code

generated with the PluTo compiler, which tiles the two

outermost loops2 The tile size [8×8×1] was chosen from many different tile sizes, examined by us, as one expos-ing the highest code performance Those times are smaller than those achieved with Chang’s code The cache effi-cient code proposed by Li et al outperforms PluTo code and our 3-D tiled code Only tiling of the two innermost loops allows us to achieve higher speed-up than that of Li’s implementation Speed-up of the examined programs

is depicted in Fig 2

Furthermore, we studied code performance for dif-ferent problem sizes defined as an RNA strand length, which is an important characteristic of Nussinov’s fold-ing We examined eight mRNAs of homo sapiens mitogen-activated protein kinase (MAPk) from the NCBI database3 Code execution times are presented in Table 5 while corresponding speed-up is depicted in Fig 3 We observe that our code demonstrates higher speed-up than that of the reminding examined codes when the length

Table 5 Execution time (in seconds) of the Nussinov RNA folding codes for 32 threads and different lengths of RNA strands mRNAs

acquired from the NCBI database

mRNA definition Lenght Serial time Chang Li PluTo [8 × 8 × 1] ISS [2 × 6 × 300] ISS [1 × 96 × 8]

Trang 10

Fig 3 Speed-up of parallel codes run on Intel Xeon E5-2699 v3, 32 threads used The horizontal coordinate represents Nussinov’s matrix size, the

vertical one shows the speed-up of the studied codes mRNAs acquired from the NCBI database

of RNA strands is bigger than 2500 For short sequences

(less than 2500) and 32 threads, related codes are faster

(from 0,1 to 0,3 second per one strand) than ours

How-ever, for short sequences, computation time is less than

one second per one strand The power of the presented

approach is noticeable for longer strands, for example,

our code for MAP2K6 variant 2 demonstrates 16

sec-onds time benefit per one strand against cache efficient

Li’s code

The performance improvement of the code generated

with the presented technique against that of Li’s code

for longer sequences is reached due to i) application of

a tiling technique, which allows for increasing parallel

code coarseness and locality, ii) choice of the optimal

original tile size in the defined search space All those

factors together lead to significant improvement in code

performance

Summing up, we may conclude that the efficiency of

cache reuse provided with ISS based tiled code becomes

a dominant factor in achieving high code performance

despite code complexity Although our tiled code is more

complex than the examined ones, choosing the best

orig-inal tile size allows for achieving higher performance in

comparison with the related examined codes on the

multi-core machine used for experiments

Conclusion

In this paper, we presented an approach which allows us

to choose in a given search space the best original tile

size and tile dimension for generation of serial and parallel

ISS based tiled codes implementing Nissinov’s RNA

fold-ing Those codes are generated using the transitive closure

of dependence graphs – the transitive dependences are

applied to the statement instances of interest to produce

valid tiles Such a technique is within the well-known iteration space slicing framework

Analyzing the constraints of a set representing valid tar-get tiles, we make an assumption about good original tile size and tile dimension and confirm this assumption with carrying out experiments The key step of this approach

is constructing parallel parametric code, where variables defining tile size are parameters The usage of parametric code allows us to compile target code only one time that significantly reduces search time

The experimental study allows us to conclude that i) tiling the two innermost loops is the best tiling scheme for ISS based tiled code, i.e., the outermost loop has to be untiled; ii) the size of the second dimension of an original tile must be roughly tenfold bigger than the size of the third one

Our implementation of Nussinov’s algorithm improves code locality and outperforms the serial original code by

a factor of 3.7 We demonstrated super-linear speed-up

of 112.9 for parallel code run with 32 threads The tuned tiled code is more cache efficient than the closely related implementations of Li and Chang when the length of RNA strands is bigger than 2500 for the studied multi-core machine

Under Nussinov’s algorithm conditions, the problem

of folding a nucleotide sequence into a structure with minimal free energy becomes a simpler problem of find-ing a structure with the maximum number of base pairs [1] Zuker et al [23] refined Nussinov’s algorithm

by using a thermodynamic energy minimization model, which produces more accurate results at the expense

of greater computational complexity, but code imple-menting Zuker’s algorithm is affine This allows us to apply the approach presented in this paper to that

Định dạng
Số trang	12
Dung lượng	1,26 MB