DeBruijn Graphs and Eulerian Paths

15.3.1 DeBruijn Graphs for Genome Assembly

Given that the previous attempt to represent the problem data, using overlap graphs, was not successful in finding efficient algorithms, other solutions were sought by the research commu- nity. Indeed, approaching similar problems, while at the time not related to genome assembly, the mathematician DeBruijn proposed a distinct representation, which also makes use of directed graphs.

The idea is to represent the fragments of the problem as edges, and not as vertices. Here, the solutions to the problem would, therefore, be paths over the graph containing all edges exactly once. We will explore this in further detail in the next section.

In the case ofDeBruijn graphs, the nodes contain sequences that correspond to either a prefixk−1or asuffixk−1 of one of the fragments. So, each edge, corresponding to a fragment, connects a node representing itsprefixk−1 to a node representing itssuffixk−1. In this case, un- like in the previous overlap graphs, prefixes and suffixes of lengthk−1, which share the same sequence, are represented in a single node. So, repeated fragments are represented as multiple edges connecting the same pair of nodes. Fig.15.4shows the representation of the example from the previous section, the one implemented in the functionstest3andtest4above, as a DeBruijn graph. The set of 15 fragments is ACC, ATA, ATG, ATT, CAT, CAT, CAT, CCA, GCA, GGC, TAA, TCA, TGG, TTC, and TTT. Thus, the graph has 15 edges and 11 unique k−1-mers that are either prefixes or suffixes (or both) of these fragments.

As before, we will here implement DeBruijn graphs as a Python class,DeBruijnGraph, that also extends the classMyGraph. In this case, this class needs to redefine the function add_adgeto allow repeated edges between the same pair of nodes. The functioncreate_

Figure 15.4: Example of the DeBruijn graph for the example in the previous section.

deBruijn_graphimplements the creation of the graph according to the explanation provided above. These functions are applied to the example represented in Fig.15.4.

from MyGraph i m p o r t MyGraph c l a s s DeBruijnGraph ( MyGraph ):

d e f __init__ (s e l f, frags ):

MyGraph . __init__ (s e l f, {})

s e l f. create_deBruijn_graph( frags ) d e f add_edge (s e l f, o , d):

i f o n o t i n s e l f. graph . keys ():

s e l f. add_vertex (o)

i f d n o t i n s e l f. graph . keys ():

s e l f. add_vertex (d) s e l f. graph [o ]. append (d)

d e f create_deBruijn_graph(s e l f, frags ):

f o r seq i n frags : suf = suffix ( seq )

s e l f. add_vertex ( suf ) pref = prefix ( seq )

s e l f. add_vertex ( pref ) s e l f. add_edge ( pref , suf d e f test6 () :

frags = [ " ACC ", " ATA ", " ATG ", " ATT ", " CAT ", " CAT ", " CAT ", " CCA ",

" GCA ", " GGC ", " TAA ", " TCA ", " TGG ", " TTC ", " TTT "]

Figure 15.5: Example of an Eulerian cycle over the DeBruijn graph for the previous example.

Edges are numbered in the order they appear in the path.

dbgr = DeBruijnGraph ( frags ) dbgr . print_graph ()

test6 ()

15.3.2 Eulerian Paths

In a directed graph, anEulerian circuitis a path that goes through all edges in the graph exactly once. On the other hand, anEulerian cycleis a cycle that goes through all edges, in other words it is an Eulerian circuit that starts and terminates in the same node.

The problem of reconstructing a string from its composition ink-mers can be formulated as the search for an Eulerian circuit, in the previously defined DeBruijn graph. An example of such a circuit, over the example in Fig.15.4, is shown in Fig.15.5. Note that in the example, the corresponding recovered sequence would be: ACCATGGCATTTCATAA.

We, thus, gain an alternative representation of the problem and its solutions. Does this mean that we have any advantage in terms of the efficiency of the algorithms? To answer this ques- tion, we need to understand how to search for Eulerian circuits in a graph.

This problem is one of the oldest problems in graph theory and has been studied by Euler back in the 18th century. Before looking at the algorithms to address this problem, we will see how Euler figured out if a graph has Eulerian cycles or not. Let us start with an important definition: a directed graph isbalancedif for every vertex the number of edges that reach it (in-degree) is equal to the number of edges that leave it (out-degree).

This leads us toEuler’s theorem, which states that:a directed and strongly connected graph has at least one Eulerian cycle, if it is balanced. Recall the definition of a strongly connected graph from Chapter13, stating that this implies all pairs of nodes are connected by at least

one path. This theorem provides an easy way to check if the graph has an Eulerian cycle or not.

We have seen previously that a solution for our problem is represented by an Eulerian circuit, but the previous theorem states the conditions for the existence of an Eulerian cycle. The difference is subtle but important. Indeed, an extension of the previous theorem states that a directed strongly connected graph has an Eulerian circuit if it isnearly balanced, which means it contains, at most, two semi-balanced vertices (i.e. whose difference between the in and the out-degree is−1 and 1, respectively), while all other vertices are balanced.

These definitions can be implemented in Python, by creating functions within the context of the classMyGraph, that check if graphs are balanced or semi-balanced. Note that a function to test if a graph is strongly connected was previously defined in Chapter13. The first function checks if a node is balanced, the second if a graph is balanced (i.e. all nodes are balanced), and the third if the graph is nearly balanced, according to the definitions above.

d e f check_balanced_node (s e l f, node ):

r e t u r n s e l f. in_degree ( node ) == s e l f. out_degree ( node ) d e f check_balanced_graph(s e l f):

f o r n i n s e l f. graph . keys ():

i f n o t s e l f. check_balanced_node (n): r e t u r n F a l s e r e t u r n True

d e f check_nearly_balanced_graph (s e l f):

res = None, None

f o r n i n s e l f. graph . keys ():

indeg = s e l f. in_degree (n) outdeg = s e l f. out_degree (n)

i f indeg − outdeg == 1 and res [1] i s None: res = res [0] , n

e l i f indeg − outdeg == −1 and res [0] i s None: res = n ,

res [1]

e l i f indeg == outdeg : p a s s e l s e: r e t u r n None, None r e t u r n res

Note that to apply the previous functions to DeBruijn graphs, i.e. in the context of the Python classDeBruijnGraph, we need to redefine the way the in-degree is calculated, to account for possible multiple edges with the same origin and successor in these graphs.

c l a s s DeBruijnGraph ( MyGraph ):

(...)

d e f in_degree (s e l f, v):

res = 0

f o r k i n s e l f. graph . keys ():

i f v i n s e l f. graph [k ]:

res += s e l f. graph [k ]. count (v) r e t u r n res

The next step is to design an algorithm to search for Eulerian cycles, for graphs that contain at least one. This algorithm is quite straightforward, and can be summarized in the following steps:

1. We start by any vertexvin the graph and choose one of its successors; we continue this process always selecting from each node a successor that uses a non-visited edge. As the graph is balanced and connected, we will always reachvafter a given number of steps.

2. If the cycle in the previous step is not yet an Eulerian cycle, i.e. it still does not contain all edges, we need to select a vertexwfrom the path in step 1 with non-visited edges. We repeat step 1 starting inw, until reachingwagain.

3. The cycles from steps 1 and 2 can be combined, by going fromvtow(part of the first cycle), doing the cycle in the second step and returning tow, and going fromwtov(last part of the first cycle).

4. While an Eulerian cycle is not attained, go back to step 2, select a vertexwfrom the cur- rent cycle and perform the steps 2 and 3.

Fig.15.6shows an example of the application of the previous algorithm for a given graph. In this case, the first selected node isa, and step 1 gathers the cycle going through nodesa, d, g, h, a. Since this is not an Eulerian cycle, we need to select a node with non-visited edges, and selectg. In step 2, thus we build the cycleg, f, e, c, g, that in step 3 can be combined with the previous to get a cycle doinga, d, g, f, e, c, g, h, a. Since this is still not an Eulerian cycle, we return to step 2 select another node with non-visited edges, in this casec, getting the cyclec, a, b, c. We, finally, join this cycle with the previous to geta, d, g, f, e, c, a, b, c, g, h, a, which is now an Eulerian cycle (Fig.15.6B).

This algorithm can be implemented within the classMyGraphto find Eulerian cycles as shown in the function below.

Figure 15.6: Example of the execution of the algorithm to discover Eulerian cycles. (A) shows the three cycles found in steps 1 and 2 of the algorithm with different dashed lines. (B) shows the final Eulerian cycle obtained that goes through all edges labeled with consecutive numbers marking the order of the edges in the circuit.

d e f eulerian_cycle (s e l f):

i f n o t s e l f. is_connected () or n o t s e l f. check_balanced_graph () : r e t u r n None

edges_visit = l i s t(s e l f. get_edges () ) res = []

w h i l e edges_visit :

pair = edges_visit [0]

i = 1

i f res != []:

w h i l e pair [0] n o t i n res : pair = edges_visit [i]

i = i + 1

edges_visit . remove ( pair ) start , nxt = pair

cycle = [ start , nxt ] w h i l e nxt != start :

f o r suc i n s e l f. graph [ nxt ]:

i f (nxt , suc ) i n edges_visit : pair = (nxt , suc )

nxt = suc

cycle . append ( nxt )

edges_visit . remove ( pair )

i f n o t res : res = cycle

e l s e:

pos = res . index ( cycle [0]) f o r i i n r a n g e(l e n( cycle )−1):

res . insert ( pos + i +1 , cycle [i +1]) r e t u r n res

To be useful in our context, however, we need to compute Eulerian circuits and not cycles. As we saw above, the difference can be addressed with minor effort. Indeed, we can transform a nearly balanced graph into a balanced one by adding an edge connecting the two semi- balanced nodes to make them balanced. If we calculate an Eulerian cycle for this “extended”

graph, removing this extra edge will lead to an Eulerian circuit of the original graph.

This idea is implemented in the next function, again added to the classMyGraph.

d e f eulerian_path (s e l f):

unb = s e l f. check_nearly_balanced_graph ()

i f unb [0] i s None or unb [1] i s None: r e t u r n None s e l f. graph [ unb [1]]. append ( unb [0])

cycle = s e l f. eulerian_cycle () f o r i i n r a n g e(l e n( cycle )−1):

i f cycle [i] == unb [1] and cycle [i +1] == unb [0]:

break

path = cycle [i +1:] + cycle [1: i +1]

r e t u r n path

These functions may be tested within theDeBruijnGraphclass, considering the previous example, as shown below. Notice that the Eulerian path computed in functiontest6is the one provided in Fig.15.5.

c l a s s DeBruijnGraph ( MyGraph ):

(...) # code above for this class d e f seq_from_path (s e l f, path ):

seq = path [0]

f o r i i n r a n g e(1 ,l e n( path )):

nxt = path [i]

seq += nxt[−1]

r e t u r n seq d e f test6 () :

frags = [ " ATA ", " ACC ", " ATG ", " ATT ", " CAT ", " CAT ", " CAT ", " CCA ",

" GCA ", " GGC ", " TAA ", " TCA ", " TGG ", " TTC ", " TTT "]

dbgr = DeBruijnGraph ( frags ) dbgr . print_graph ()

p r i n t ( dbgr . is_connected ())

p r i n t ( dbgr . check_nearly_balanced_graph ())

p r i n t ( dbgr . eulerian_path () ) d e f test7 () :

orig_sequence = " ATGCAATGGTCTG "

frags = composition (3 , orig_sequence ) dbgr = DeBruijnGraph ( frags )

dbgr . print_graph ()

p r i n t ( dbgr . check_nearly_balanced_graph ())

p= dbgr . eulerian_path () p r i n t (p)

p r i n t ( dbgr . seq_from_path (p)) test6 ()

test7 ()

The last example shows the closing of the cycle as in the previous section, i.e. we start with a sequence, calculate its composition and try to recover the original sequence.

Fortunately, as you may hint for the above algorithms, this problem has a much lower com- plexity, when compared to the search for Hamiltonian cycles. Indeed, the above algorithm can be implemented quite efficiently and can be run for large graphs without a problem. Thus, most of the recent genome assembly programs implement this strategy at their core, as we will discuss in the next section.

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms