Overlap Graphs and Hamiltonian Cycles

15.2.1 Problem Definition and Exhaustive Search

Before starting to address the algorithms, let us put forward a useful definition: thecom- positionof a sequence (string)sin segments of sizek, denotedcompk(s), is defined as the

collection of all sub-sequences ofs, of sizek, possibly including repeated sub-sequences. The resulting fragments are normally ordered lexicographically.

It is trivial to implement this definition as a Python function as provided in the next code chunk.

d e f composition (k , seq ):

res = []

f o r i i n r a n g e(l e n( seq )−k +1) : res . append ( seq [i:i+k ]) res . sort ()

r e t u r n res d e f test1 () :

seq = " CAATCATGATG "

k = 3

p r i n t ( composition (k , seq )) test1 ()

When performing genome sequencing, we have the complementary problem, i.e. we do not know the original sequence, but we know its composition in fragments. Thus, a possible problem definition is to try to recover the original sequence, given as an input its composition and the value ofk.

A natural solution for this problem would be to start with one of the fragments and try to find another fragment that has an overlap ofk−1 positions with this fragment. This would allow adding one nucleotide to the reconstructed sequence (the last of the second sequence). Iterat- ing over this strategy could lead to cover all fragments. Notice that a fragment of sizekhas an overlap ofk−1 with another of the same size when thesuffixk−1(lastk−1 characters) of the first is equal to theprefixk−1(firstk−1 characters) of the latter.

The following code implements these concepts as Python functions:

d e f suffix ( seq ):

r e t u r n seq [1:]

d e f prefix ( seq ):

r e t u r n seq[:−1]

Consider an example withk=3 and the following set of fragments: ACC, ATA, CAT, CCA, TAA. Starting with ACC, we can find CCA with an overlap of 2 and this allows to augment

Figure 15.2: Example of the reconstruction of a sequence from overlapping fragments of size k=3. (A) An example where a simple algorithm selecting fragments with overlap of 2 nucleotides returns the correct solution. (B) An example where there are repetitions of sub-sequences of size k−1in the fragments, making more difficult the selection of the next fragment: three solutions are tried, but only the last one returns a correct solution covering all fragments.

the sequence to ACCA; following the same strategy we add CAT, ATA, and, finally TAA, to reach the final sequence of ACCATAA (Fig.15.2A).

So, we seem to have solved our problem! However, a more careful analysis will show oth- erwise. Notice that in this case, we never had a doubt of which sequence to select next, since there were no repeatedk−1-mers. Also, try to run the same algorithm starting with a different fragment!

Let us try a different example adding a few fragments to the previous: ACC, AGT, ATA, ATT, CAT, CCA, GTA, TAA, TAG, TAT, TTA. If we follow the same algorithm we will add fragments ACC, CCA, and CAT. Here, we have the first decision to make as there are two candidates, ATA and ATT. Let us choose ATA and proceed. Now, we have an even harder decision to make, since we can select TAA, TAG, or TAT. Selecting the first (TAA) as before, we will terminate the algorithm, as there are no sequences starting with AA (Fig.15.2B). How- ever, we have not covered all fragments and, therefore, this solution is incorrect. We can select TAG, and this would lead to select AGT, GTA, and TAA next; again, this leads to an incom- plete solution, as the fragments TAT and ATT are not considered. So, the right choice would be to select TAT, which would lead to a solution covering all fragments and returning the sequence ACCATATTAGTAA, as shown in Fig.15.2B, last column.

Although, we have managed to find a solution in this case, it is clear that the algorithm is not adequate. Indeed, this strategy would make us backtrack in our previous solution in any situ- ation where more than one option exists. In real scenarios, this is very inefficient, as there are

Figure 15.3: Example of the overlap graph for the example in the previous section. (A) The over- lap graph. (B) The solution for the reconstruction problem shown as a path over the graph, passing through all nodes; the edges included in this path are shown with wider arrows, while the ones not included are shown in light gray.

many repetitions ofk−1-mers. As in regular puzzles, where pieces with similar patterns are the most troublesome, also here repeats make our life more difficult.

15.2.2 Overlap Graphs

One alternative to address these problems comes from graph theory, which we have covered in the last two chapters. Indeed, the fragments for this problem can be organized in a special type of graph, calledoverlap graph, which can lead to a different problem formulation and alternative algorithms.

Anoverlap graphis defined as follows:

• Nodes (vertices) correspond to the input fragments, i.e. there will be one vertex per given fragment (read);

• Edges are created between two nodesvandw, ifsuffixk−1(v)=prefixk−1(w), accord- ing to the definition above, i.e. the sequences of nodesvandwhave an overlap ofk−1 nucleotides (and thus the name of the graph).

The graph in Fig.15.3A represents the overlap graph for the example put forward in the previous section. In this case, there are 11 nodes, one for each fragment and edges exist connecting

all pairs of nodes where the last two characters of the first are equal to the first two characters of the second.

Note that a solution to the string reconstruction problem can be represented as a path over this graph. To be a valid solution, this path should pass exactly once over every node of the graph.

These paths have a special denomination and interesting properties, as we will explore in the next section.

Fig.15.3B shows a path for the previous example, which represents the solution found in the previous section (see Fig.15.2B). Note that, in some cases, there are nodes with multi- ple edges going out, which make more difficult to select the next node, since there are many choices, as we saw above.

Before proceeding to explore the properties of these graphs and paths, and related algorithms, let us define some Python code that allows to create overlap graphs. To achieve this aim, we will define a Python class,OverlapGraph, which will be a sub-class of the classMyGraph, defined in Chapter13.

Following the definition above, the next code chunk shows the definition of the class, and its constructor, which uses an auxiliary function (create_overlap_graph) to build the graph, given the set of fragments. Notice that the code uses the functionssuffixandprefixdefined before, which need to be defined outside of the class. The functiontest2provides an example, with the set of fragments from the first case discussed in the last section (Fig.15.2A).

from MyGraph i m p o r t MyGraph c l a s s OverlapGraph ( MyGraph ):

d e f __init__ (s e l f, frags ):

MyGraph . __init__ (s e l f, {})

s e l f. create_overlap_graph( frags ) d e f create_overlap_graph(s e l f, frags ):

f o r seq i n frags :

s e l f. add_vertex ( seq ) f o r seq i n frags :

suf = suffix ( seq ) f o r seq2 i n frags :

i f prefix ( seq2 ) == suf : s e l f. add_edge (seq , seq2 )

d e f test2 () :

frags = [" ACC ", " ATA ", " CAT ", " CCA ", " TAA "]

ovgr = OverlapGraph ( frags , F a l s e) ovgr . print_graph ()

test2 ()

Although for the selected example, this code works and builds the desired graph, problems arise when there are repeated fragments in the list. In this case, these repetitions would be ignored and a single node would be created for each unique sequence. However, this does not address the requirements of our problem, since repeated fragments should all be covered, and, therefore, different nodes should be assigned to each fragment.

A possible solution for this problem, while keeping our representation of graphs, that requires nodes to have unique identifiers (those are keys of a dictionary), is to add a numerical identifier to the fragment sequence. Thus, all nodes will have a unique identifier, and still preserve the information from the sequence.

This solution is implemented in the next code block, where the constructor now accounts for both options, to have replicates or not, and a new function is defined for this case of replicates (create_overlap_graph_with_reps). The auxiliary functionget_instancesis used to search all nodes containing a given sequence. The code is tested (functiontest3) with an example of a set of fragments with replicates.

c l a s s OverlapGraph ( MyGraph ):

d e f __init__ (s e l f, frags , reps = True):

MyGraph . __init__ (s e l f, {})

i f reps : s e l f. create_overlap_graph_with_reps ( frags ) e l s e: s e l f. create_overlap_graph( frags )

s e l f. reps = reps

d e f create_overlap_graph_with_reps (s e l f, frags ):

idnum = 1

f o r seq i n frags :

s e l f. add_vertex ( seq + "−" + s t r( idnum )) idnum = idnum + 1

idnum = 1

f o r seq i n frags :

suf = suffix ( seq ) f o r seq2 i n frags :

i f prefix ( seq2 ) == suf :

f o r x i n s e l f. get_instances ( seq2 ):

s e l f. add_edge ( seq + "−" + s t r( idnum ) , x) idnum = idnum + 1

d e f get_instances (s e l f, seq ):

res = []

f o r k i n s e l f. graph . keys ():

i f seq i n k: res . append (k) r e t u r n res

d e f test3 () :

frags = [ " ATA ", " ACC ", " ATG ", " ATT ", " CAT ", " CAT ", " CAT ", " CCA "

, " GCA ", " GGC ", " TAA ", " TCA ", " TGG ", " TTC ", " TTT "]

ovgr = OverlapGraph ( frags , True) ovgr . print_graph ()

test3 ()

15.2.3 Hamiltonian Circuits

In a directed graph, aHamiltonian circuitis a path that goes through all nodes of the graph exactly once. As we hinted above, the problem of reconstructing a string, given its composition, may be defined as the problem of finding a Hamiltonian circuit in the overlap graph. One example of such a circuit is shown in Fig.15.3B.

Indeed, from the path set by the Hamiltonian circuit, we can easily get the “original” sequence. This is done by simply taking the sequence from the first node in the path, and con- catenating the last character of the sequences in the remaining nodes of the path, following its order.

This can be shown in the code below, where we define two new functions to add to the OverlapGraphclass above, the first to get the sequence represented by a node, and the second to get the full sequence spelled by a path in the graph. The example provided to test complements the code of functiontest3, defining a path over the defined graph, which in this case is a Hamiltonian path, and returns the original sequence that originated the fragments.

d e f get_seq (s e l f, node ):

i f node n o t i n s e l f. graph . keys (): r e t u r n None i f s e l f. reps : r e t u r n node . split ("−") [0]

e l s e: r e t u r n node

d e f seq_from_path (s e l f, path ):

i f n o t s e l f. check_if_hamiltonian_path ( path ): r e t u r n None seq = s e l f. get_seq ( path [0])

f o r i i n r a n g e(1 ,l e n( path )):

nxt = s e l f. get_seq ( path [i ]) seq += nxt[−1]

r e t u r n seq d e f test3 () :

frags = [ " ACC ", " ATA ", " ATG ", " ATT ", " CAT ", " CAT ", " CAT ", " CCA

", " GCA ", " GGC ", " TAA ", " TCA ", " TGG ", " TTC ", " TTT "]

ovgr = OverlapGraph ( frags , True) ovgr . print_graph ()

path = [’ACC−2’, ’CCA−8’, ’CAT−5’, ’ATG−3’, ’TGG−13’, ’GGC−10’,

’GCA−9’, ’CAT−6’, ’ATT−4’, ’TTT−15’, ’TTC−14’, ’TCA−12’, ’CAT−7’,

’ATA−1’, ’TAA−11’]

p r i n t ( ovgr . seq_from_path ( path )) test3 ()

In the previous example, we have shown a path that is Hamiltonian, but have not tested for this condition. It is easy to verify, given a path, if it is a Hamiltonian path or not, checking if it is a valid path, if it contains all nodes and if there are no repeated nodes. Notice that, for a graphG=(V , E)a Hamiltonian path has exactly|V|nodes and|V| −1 edges.

The following code block shows two functions that were added to the classMyGraph, defined in Chapter13. It makes sense to add this code in this class, since these functions will work for any directed graph, and are not restricted to overlap graphs. Also, since the classOverlapGraphis a sub-class ofMyGraph, it will inherit all methods defined there.

The two methods presented can be used, respectively, to check if a given path is valid, and to check if it is a Hamiltonian path. Both return a Boolean result (TrueorFalse).

c l a s s MyGraph :

## see code in previous chapters (...)

d e f check_if_valid_path (s e l f, p):

i f p [0] n o t i n s e l f. graph . keys (): r e t u r n F a l s e f o r i i n r a n g e(1 ,l e n(p)):

i f p[i] n o t i n s e l f. graph . keys () or p[i] n o t i n s e l f. graph [p[i−1]]:

r e t u r n F a l s e r e t u r n True

d e f check_if_hamiltonian_path (s e l f, p):

i f n o t s e l f. check_if_valid_path (p): r e t u r n F a l s e to_visit = l i s t(s e l f. get_nodes ())

i f l e n(p) != l e n( to_visit ): r e t u r n F a l s e f o r i i n r a n g e(l e n(p)):

i f p[i] i n to_visit : to_visit . remove (p[i ]) e l s e: r e t u r n F a l s e

i f n o t to_visit : r e t u r n True e l s e: r e t u r n F a l s e

These functions may be tested in theOverlapGraphclass, adding a couple of rows to the previous example.

d e f test3 () :

frags = [ " ACC ", " ATA ", " ATG ", " ATT ", " CAT ", " CAT ", " CAT ", " CCA "

, " GCA ", " GGC ", " TAA ", " TCA ", " TGG ", " TTC ", " TTT "]

ovgr = OverlapGraph ( frags , True) ovgr . print_graph ()

path = [’ACC−2’, ’CCA−8’, ’CAT−5’, ’ATG−3’, ’TGG−13’, ’GGC−10’,

’GCA−9’, ’CAT−6’, ’ATT−4’, ’TTT−15’, ’TTC−14’, ’TCA−12’, ’CAT−7’,

’ATA−1’, ’TAA−11’]

p r i n t ( ovgr . seq_from_path ( path )) p r i n t( ovgr . check_if_valid_path ( path ))

p r i n t ( ovgr . check_if_hamiltonian_path ( path )) test3 ()

So, we checked that it is trivial to verify if a path is Hamiltonian or not. However, unfortu- nately, it is not easy to search for Hamiltonian circuits in a graph. An alternative is to perform an exhaustive search, which would lead to a variant of a depth-based search over the graph (this was briefly discussed in Chapter13, Section13.4). This algorithm tries to build a path moving from each node to one of its successors not previously visited. It will backtrack when a node is reached that has no successors not yet visited.

A function implementing this process is provided below, added to the classMyGraph, where the Hamiltonian circuit is sought, taking each possible node as the starting point. The function search_hamiltonian_path_from_nodetries to build a path starting at that node, that can cover all other nodes. This function implements a search tree by maintaining the current node being processed (variablecurrent), the current path being built (variablepath), as well as the state of the nodes (variablevisited). This last variable is a dictionary where the keys are nodes, and the values are the indexes of the next successor to explore for that node, thus seeking to minimize the memory used to keep this valuable information.

c l a s s MyGraph : (...)

d e f search_hamiltonian_path(s e l f):

f o r ke i n s e l f. graph . keys ():

p = s e l f. search_hamiltonian_path_from_node ( ke)

i f p != None:

r e t u r n p r e t u r n None

d e f search_hamiltonian_path_from_node (s e l f, start ):

current = start visited = { start :0}

path = [ start ]

w h i l e l e n( path ) < l e n(s e l f. get_nodes ()):

nxt_index = visited [ current ]

i f l e n(s e l f. graph [ current ]) > nxt_index : nxtnode = s e l f. graph [ current ][ nxt_index ] visited [ current ] += 1

i f nxtnode n o t i n path : ## node added to path path . append ( nxtnode )

visited [ nxtnode ] = 0 current = nxtnode

e l s e: ## backtrack i f l e n( path ) > 1:

rmvnode = path . pop () d e l visited [ rmvnode ] current = path[−1]

e l s e: r e t u r n None r e t u r n path

The previous function can be tested in the context of theOverlapGraphclass, in the previous examples, as shown below.

d e f test4 () :

frags = [ " ACC ", " ATA ", " ATG ", " ATT ", " CAT ", " CAT ", " CAT ", " CCA ",

" GCA ", " GGC ", " TAA ", " TCA ", " TGG ", " TTC ", " TTT "]

ovgr = OverlapGraph ( frags , True) ovgr . print_graph ()

path = ovgr . search_hamiltonian_path () p r i n t( path )

p r i n t ( ovgr . check_if_hamiltonian_path ( path )) p r i n t ( ovgr . seq_from_path ( path ))

test4 ()

We can now close the circle by defining a sequence, calculating its composition and trying to recover it by finding a Hamiltonian circuit. This is done in the next example.

d e f test5 () :

orig_sequence = " CAATCATGATGATGATC "

frags = composition (3 , orig_sequence ) p r i n t ( frags )

ovgr = OverlapGraph ( frags , True) ovgr . print_graph ()

path = ovgr . search_hamiltonian_path () p r i n t ( path )

p r i n t ( ovgr . seq_from_path ( path )) test5 ()

However, as the reader can check by running the example, the recovered sequence is not the same. This happens since there might be many different sequences with the same composition. This becomes increasingly rarer as the value ofkbecomes larger.

Also, we invite the reader to try to increase the size of the sequence and check that the algorithm starts to behave increasingly worse in terms of its run time. In fact, this algorithm can not be used for large sequences, and respective overlap graphs. Indeed, the problem of finding Hamiltonian graphs is quite complex, belonging to the class of NP-hard optimization problems, which implies that no efficient algorithms exist when the problem scales, in this case, when the graph becomes larger.

This makes it very difficult, in practice, to use this approach for large-scale genome assembly.

Still, until a few years ago, these were the strategies used by the existing programs, which used heuristic methods to attain the best possible solutions. These were used, for instance, throughout theHuman Genome Project.

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms