BioPython Functions for Phylogenetic Analysis- 123docz.net

To finalize this chapter, we will take a brief look at some of theBioPythonfeatures for phylogenetic analysis, implemented mainly in the moduleBio.Phylo. This module mainly imple- ments data structures to represent phylogenetic trees and methods to load, save, and explore these trees. The methods for tree inference need to be run externally, while there are a few wrappers in the module to facilitate its use, which we will not cover here.

To understand the main data structures of this module let us work with a simple example.

Save a text file with the name “simple.dnd” and the following content:

(((A,B),(C,D)),(E,F,G));

This simple line will represent a tree, in the Newick format. The following code can be used to read this file, using thereadfunction, and print the tree using two different functions, being the first a print of the tree’s content and the second a simple graphical representation of the tree.

tree = Phylo . read (" simple . dnd ", " newick ") p r i n t( tree )

Phylo . draw_ascii ( tree )

This module supports functions for input/output of phylogenetic trees in a large list of formats. The functionsreadandparsecan be used to read a single or several trees from a file, respectively, while the functionwritecan be used to write trees to file, and the function convertcan directly convert file formats.

The following example shows a simple use of these functions, where a file is read and con- verted to a new format. This new file is then read and the trees it contains are printed. The example file may be found in the book’s website and in theBioPythontutorial.

tree2 = Phylo . read (" int_node_labels . nwk ", " newick ") Phylo . draw_ascii ( tree2 )

Phylo . convert (" int_node_labels . nwk ", " newick ", " tree . xml ", " phyloxml "

)

trees = Phylo . parse (" tree . xml ", " phyloxml ") f o r t i n trees : p r i n t(t)

Looking at the result of theprintmethod, we can understand the structure of the tree representation in this implementation. TheTreeobject is unique for each complete phylogenetic tree and contains global information about the tree, such as whether it is rooted or unrooted.

The recursive representation of the tree is provided by theCladeobjects. TheTreeobject has one rootClade, and under that, it’s composed of nested lists of clades all the way down to the leaves. Notice that unlike our own implementation, trees in this module are n-ary, meaning that each node may have more than two sub-trees, being represented by a list of clades.

The module has a large number of methods that allow to retrieve information from the tree, as it is the case with: searching for an element traversing the tree, listing all leaves and internal nodes of a tree, finding the common ancestor of a set of leaves, calculating the distance between two leaves, among many others. Also, it is possible to modify the tree and its clades, for instance by pruning the tree, removing nodes, or splitting nodes into new branches. Details of these methods are provided in Section 13.4 of theBioPythontutorial.

This module also allows to color the tree branches with different colors as it is shown in the next example, where we take the example tree defined above and we color the branch includ- ing leaves E, F and G, and their common ancestor in salmon, and the branch with leaves C and D in blue.

from Bio . Phylo . PhyloXML i m p o r t Phylogeny treep = Phylogeny . from_tree ( tree )

Phylo . draw ( treep )

treep . root . color = " gray "

mrca = treep . common_ancestor ({" name ": "E"}, {" name ": "F"}) mrca . color = " salmon "

treep . clade [0 , 1]. color = " blue "

Phylo . draw ( treep )

Bibliographical Notes and Further Reading

The agglomerative hierarchical clustering algorithm used by UPGMA is generally attributed to Sokal and Michener [142]. TheNeighbor-Joiningalgorithm has been proposed by Saitou and Nei [135]. The book by Felsenstein [62] contains a thorough explanation on phylogenet- ics inference algorithms, covering the three classes presented in this chapter.

TheBio.Phylomodule included inBioPythonhas been described in an article by Talevich et al. [146].

The MEGA free software application (http://www.megasoftware.net) can be used to run the methods described in this chapter, applying them to generate evolutionary trees, given sets of user defined sequences. It includes representative methods from all classes described in Section9.2.

Alternative software tools are PAML (Phylogenetic Analysis by Maximum Likelihood) (http://abacus.gene.ucl.ac.uk/software/paml.html) and PhyML (http://

www.atgc-montpellier.fr/phyml/) [69]. Both these tools have available wrappers in BioPython.

Exercises and Programming Projects

Exercises

1. Consider the sequences of the first exercise of the previous chapter. Assume that the multiple sequence alignment obtained was the following:

Figure 9.4: Example of a binary tree representing a phylogenetic tree for the exercise.

S1: A-CATATC-AT- S2: A-GATATT-AG- S3: AACAGATC-T-- S4: G-CAT--CGATT

a. Assuming the metric distance to be the number of distinct characters in pairwise alignment, and taking the pairwise alignments imposed by the multiple alignment above, calculate the distance matrix.

b. Apply the algorithm UPGMA to build the tree for these sequences.

c. Write a Python script that allows you to check your results.

2. a. Consider the phylogenetic tree represented in Fig.9.4. Assume it was built by the UPGMA algorithm as implemented in our Python code, from 4 sequences (S1, S2, S3, S4). Using the notationDij to represent the distance between sequences Si andSj, which of the following expressions are true?

• D24=2,

• D12>4,

• D23+D34=12,

• D32>8.

b. Considering our Python implementation, write a script that creates and prints the tree in the figure.

3. Considering the classBinaryTreeimplemented in this chapter, add methods that:

a. Return the size of the tree, which will be given by a tuple with two values: the number of internal nodes of the tree, the number of leaves.

b. Search if there is a leaf that contains a given value passed as a parameter of the method. The result should be a Boolean value (Trueif the value exists;False, oth- erwise).

c. Return the common ancestor of two sequences/taxa (identified as integer values), i.e. will return the simplest tree (with less height) that contains the leaves with those values.

d. Generalize the previous function to a set of sequences as input.

e. Return the distance between two leaves identified by their integer values.

f. Return the distance between the two leaves (identified by their integer values) that are nearest in the tree (i.e. have their common ancestor at the smallest height).

4. Implement the WPGMA variant of the UPGMA algorithm, changing the way the distance between clusters is calculated (as described above). Compare the results of both approaches.

5. Consider the last exercise of the previous chapter. Read the tree obtained fromClustal Omega. Draw the tree with theBio.Phylomodule. Explore the tree using the available functions.

Programming Projects

1. Implement theNeighbor-Joiningmethod applied to rooted trees, as described above in Section9.2.1.

2. Extend the class UPGMA, by considering other distance metrics for the sequences (e.g.

edit distance). Take into account the sequence type (nucleotide vs protein) to define meaningful metrics.

3. Implement a simple maximum parsimony method, by firstly creating a multiple sequence alignment of the sequences. Add a function to identify informative columns (where there are mutations). Implement a cost function for a tree, taking those columns into account.

Motif Discovery Algorithms

In this chapter, we revisit and formalize the definition of deterministic motif. We introduce the concept of search and solution space and formally define the problem of deterministic motif search in a set of biologically related sequences. We start by describing brute-force algorithms based on an exhaustive search and then present more efficient approaches for motif discovery.

BioPython Functions for Phylogenetic Analysis

Genes: Discrete Units of Genetic Information

Biological Sequences: Representations and Basic Algorithms