In the exhaustive approach, the solution space is composed of(n−L+1)t possible solutions.
As the length and the number of the input sequences increases, the number of candidates ex-
Figure 10.2: Tree structure representation for motif candidates with length 2 and alphabet with three symbols.
plodes and the problem becomes computationally intractable. In practice, this approach is not feasible with real life data. We, therefore, need a more efficient search procedure. This will require a clever way to traverse the solution space and reduce the number of motif candidates.
The branch and bound algorithm belongs to a group of combinatorial optimization algorithms.
It enumerates the candidates, but uses an intelligent mechanism oflookaheadto avoid the explicit enumeration of some of the solutions. It is, therefore, well suited for this task.
The algorithm organizes the search space in a tree structure. This allows a partition of the search space into sub-sets. Each leaf of the tree corresponds to a candidate and the inter- nal nodes are partial sub-spaces. Traversing all the leaves of the tree provides an exhaustive enumeration. More interestingly, the branch and bound algorithm seeks a more efficient enu- meration of the candidate solutions by exploring tree branches. Partial solutions are related to the solutions at the nodes of the tree. Thus, each branch also carries information related to the objective function values of the leaves below. Thus, at each internal node or branch, an upper or a lower bound of the objective function score for the respective sub-space can be calculated. The algorithm will expand the search within that branch of the tree, if it has the potential to yield a better solution than the current one. Otherwise, all the candidates within that sub-space are discarded (not enumerated). The elimination of some candidates below a branch is typically called aspruningthe search space. The algorithm borrows its name from the branching task that allows exploring the sub-tree of a given node and bounding that esti- mates a bound on the candidates of the sub-tree rooted at that node.
For a motif of lengthLand an alphabet, there are||Lpossible candidates. Consider for instance a hypothetical alphabet of three letters= {A, B, C}and a motif of length 2. There are 32=9 candidates,AA,AB,AC,BA,BB,.... Fig.10.2provides a representation of the search space organized as a tree for this specific case.
In order to adapt the branch and bound algorithm for our motif discovery problem, we need to implement functionalities that allow us to traverse the tree and to calculate the bounds of the objective function values of the candidates. In this case, the objective function will correspond to theScorefunction previously introduced.
Figure 10.3: Representation of the workflow for the next vertex enumeration and bypass opera- tions on tree structure arrangement of motif search space.
For the traversal of the search space, we need an operation that allows enumerating the can- didates by visiting the relevant leaves in the tree. A function that points to next leaf to visit is required. If the current traversal point is an internal node, the function should point to the first leaf of its sub-space. If current point is a leaf, the function should return the subsequent leaf of the sub-space. If this leaf is the last one within the sub-space, then it should point to next in- ternal node that is at the root of the next sub-space to visit. Fig.10.3A shows a representation of the sub-space enumeration.
As we will see next, in a bounding operation we will ignore the sub-space corresponding to the sub-tree rooted at the given node and jump to the next node found at the same level of the current one. This is called abypassoperation. Fig.10.3B shows an example of branch bypass where the pointer to current node moves to the next equivalent node.
In order to decide if the sub-space should be ignored or not, we need to calculate the upper bound of the sub-space. This value will estimate if the candidates within the sub-space repre- sent or not a solution that may improve over the current one.
For our motif discovery problem, we try to maximize theScorefunction by selecting the best initial positions of the motifs=(s1, s2, ..., st). Therefore, instead of working on a mo- tif representation search space we will work on a positional representation space. Fort input
Figure 10.4: Representation of the enumeration space based on the vector of motif starting po- sitions.
sequences of lengthN and a motif lengthL, we can search each starting position, from the 0 toM=N−Lindex in thetinput sequences.
Fig.10.4shows the representation for all possible solutions represented as leaves of the tree.
Each leaf is represented as a vector of lengtht. An internal node corresponding to the leveli is represented by a vector of lengthi, as also shown in the figure.
The functionnext_vertex, provided below, implements the retrieval of the next leaf according to this representation. It tests if the length of current position vector is shorter than the number of sequencest. If it is, then it represents an internal node. In this case, it copies the current solution and goes down one level by setting the next level as position zero. If the length of current position vector is equal to the number of sequences then it searches at leaf level. Here, it tests if the position of the current sequence is already maximal, i.e. equal toM=N −L.
If it is not, it copies the current positional solution and increments by one the position of the current sequence. Otherwise, it increments the position for the previous sequence.
d e f next_vertex (s e l f, s):
res = []
i f l e n(s) < l e n(s e l f. seqs ): # internal node −> down one level f o r i i n r a n g e(l e n(s)):
res . append (s[i ]) res . append (0)
e l s e: # bypass pos = l e n(s)−1
w h i l e pos >=0 and s[ pos ] == s e l f. seq_size ( pos ) − s e l f. motif_size :
pos −= 1
i f pos < 0: res = None # last solution e l s e:
f o r i i n r a n g e( pos ): res . append (s[i ]) res . append (s[ pos ]+1)
r e t u r n res
The code for thebypassfunction is similar to the previous function, but in this case there is no test for internal nodes.
d e f bypass (s e l f, s):
res = []
pos = l e n(s)−1
w h i l e pos >=0 and s[ pos ] == s e l f. seq_size ( pos ) − s e l f. motif_size :
pos −= 1
i f pos < 0: res = None e l s e:
f o r i i n r a n g e( pos ): res . append (s[i ]) res . append (s[ pos ]+1)
r e t u r n res
We now need to implement a way to estimate the bound for a sub-space, and test if it is worth to explore the solutions within the corresponding branch. Let’s suppose that we are traversing the tree from the top to the bottom and we are currently at leveli. For ourt se- quences the position vector can be divided as:(s1, s2, ..., si)and(si+1, ..., st). Each se- quence can contribute at most with a score ofL. Now, considering fixed all the initial posi- tions in the firsti sequences, for the best-case scenario the subsequentt−isequences can contribute with(t −i)∗Lfor the current score. The bound for leveliis, thus, given by Score((s1, s2, ..., si), D)+(t−i)∗L. If this sum is smaller than the best score seen so far, iterating through this sub-space will never reach the best score and, therefore, can be skipped.
This bounding operation can be easily performed with thebypassfunction that jumps from the current internal node to the subsequent internal node.
Thebranch_and_boundfunction iterates through the initial position solution space to find the best motif. The functionnext_vertexprovides at each step the next solution. If currently
at an internal node, the bound is estimated and a bypass is performed if the conditions ex- plained above are met. If positioned at a leaf, the current score is tested; if it exceeds the best score then the new best motif is updated to the current motif solution.
d e f branch_and_bound (s e l f):
best_score = −1 best_motif = None size = l e n(s e l f. seqs ) s = [0]∗size
w h i l e s != None:
i f l e n(s) < size :
# estimate the bound for current internal node
# test if the best score can be reached
optimum_score = s e l f. score (s) + ( size−l e n(s)) ∗ s e l f. motif_size
i f optimum_score < best_score : s = s e l f. bypass (s) e l s e: s = s e l f. next_vertex (s)
e l s e:
# test if current leaf is a better solution sc = s e l f. score (s)
i f sc > best_score : best_score = sc best_motif = s s = s e l f. next_vertex (s) r e t u r n best_motif