COMPUTATIONAL GEOMETRY AND TOPOLOGY

Computational Geometry

Computational geometry is the study of effi cient algorithms to solve geometric problems, such as: Given N points in a plane, what is the fastest way to fi nd the nearest neighbor of a point? Given N straight lines, fi nd the lines that intersect. Computational geometry emerged from the fi eld of algorithm design and analysis in the late 1970s. It has grown into a recognized discipline. The success of the fi eld as a research discipline can, on the one hand, be explained by the beauty of the problems studied and the solutions obtained, and, on the other hand, by the many application domains — computer graphics, geographic information systems, robotics, proteins, and others — in which geometric algorithms play a fundamental role.

The connections and interactions between molecular modeling and computational geometry have been growing recently. Many questions in molecular modeling can be understood geometrically in terms of arrangements of spheres in three dimensions. Problems include computing properties of such arrangements, such as their volume and topology, testing intersections and

collisions between molecules, fi nding offset surfaces, data structures for computing interatomic forces and performing molecular dynamics simulations, and computer graphic algorithms for rendering molecular models accurately and effi ciently. Computational geometry can also be used as a tool for studying topology and architecture of macromolecules and macromolecular complexes. Here we introduce briefl y the common terms and algorithmic problems in computational geometry. Detailed descriptions may be found in Skiena (2008) .

Polygon A polygon is a collection of line segments that form a cycle and do not cross each other. A polygon can be represented as a sequence of points.

For example, the points

(0 0, ) ( ) ( ) ( ), 0 1, , 1 1, , 1 0,

form a square. The line segments of the polygon connect adjacent points in the list, together with one additional segment connecting the fi rst and last points. A simple polygon is one in which no two segments cross. A convex polygon is one in which any two points inside the polygon can be connected by a line segment that does not cross the polygons. The smallest convex polygon containing a collection of points is known as a convex hull .

Convex Hull The convex hull of a set of points S in n dimensions is the intersection of all convex sets containing S . Finding the convex hull of a set of points is the most elementarily interesting problem in computational geometry, just as the minimum spanning tree is the most elementarily interesting problem in graph algorithms. It arises because the hull quickly captures a rough idea of the shape or extent of a data set. Convex hull also serves as a fi rst preprocessing step to many, if not most, geometric algorithms. For example, consider the problem of fi nding the diameter of a set of points, which is the pair of points a maximum distance apart. The diameter will always be the distance between two points on the convex hull. The convex hull representation has recently been used for supervised classifi cation of protein structures (Wang et al., 2006a,b, 2008 ). Specifi cally, the novel patterns based on convex hull representation are fi rst extracted from a protein structure, then the classifi cation system is constructed and machine learning methods such as neural net- works and hidden Markov models are employed (Wang et al., 2008 ).

Triangulation Triangulation is the division of a surface or plane polygon into a set of triangles, usually with the restriction that each triangle side is shared entirely by two adjacent triangles. Triangulation is a fundamental problem in computational geometry, because the fi rst step in working with complicated geometric objects is to break them into simple geometric objects. The simplest geometric objects are triangles in two dimensions, and tetrahedra in three.

Classical applications of triangulation include fi nite - element analysis and com-

COMPUTATIONAL GEOMETRY AND TOPOLOGY PRELIMINARIES 115 puter graphics. Recently, triangulation has been applied to computation of a molecular surface (Ryu et al., 2007a,b , 2009 ) . A molecular surface is used for both the visualization of a molecule and the computation of various molecular properties, such as the area and volume of a protein, which are important for studying problems such as protein docking and folding.

Voronoi Diagram Voronoi diagrams represent the region of infl uence around each of a given set of sites. Given a set S of points p 1 , … , p n , Voronoi diagrams decompose the space into regions around each point, such that all the points in the region around p i are closer to p i than to any other point in S . It involves partitioning a plane with points into convex polygons such that each polygon contains exactly one generating point, and every point in a given polygon is closer to its generating point than to any other. A Voronoi diagram is sometimes known as a Dirichlet tessellation . The cells are called Dirichlet regions , Thiessen polytopes , or Voronoi polygons . Voronoi diagrams have been used to compute molecular surfaces on proteins (Ryu et al., 2007a,b ).

Nearest - Neighbor Search Nearest - neighbor search (or similarity search ) is a search to quickly fi nd the nearest neighbor of a query point; that is, given a set S of n points in d dimensions and a query point q , which point in S is closest to q ? Nearest - neighbor search is important in classifi cation. Such nearest - neighbor classifi ers are widely used, often in high - dimensional spaces. The vector - quantization method of image compression partitions an image into 8 × 8 pixel regions. This method uses a predetermined library of several thou- sand 8 × 8 pixel tiles and replaces each image region by the most similar library tile. The most similar tile is the point in 64 - dimensional space that is closest to the image region in question. Compression is achieved by reporting the identi- fi er of the closest library tile instead of the 64 pixels, at some loss of image fi delity. The nearest - neighbor search has been used to approximate the protein structure (Lotan and Schwarzer, 2004 ).

Polygon Partitioning Polygon partitioning is an important preprocessing step for many geometric algorithms, because most geometric problems are simpler and faster on convex objects than on nonconvex objects. Given a polygon or polyhedron P , how can P be partitioned into a small number of simple (typically, convex) pieces? It is easier to work with the pieces indepen- dently than with the original object.

Shape Similarity Shape similarity is a problem that underlies much of pattern recognition. Given two polygonal shapes, P 1 and P 2 , how similar are P 1 and P 2 ? Defi nition of similarity is application dependent. There is no single algorithmic approach that can solve all shape - matching problems. Consider a system for optical character recognition (OCR). We have a known library of shape models representing letters and the unknown shapes we obtain by scan- ning a page. We seek to identify an unknown shape by matching it to the most

similar shape model. The shape similarity measures are widely used in protein structure comparison and prediction (Lotan and Schwarzer, 2004 ; Sael et al., 2008 ).

Topology

Topology is a branch of mathematics that can be defi ned as the study of quali- tative properties of certain objects (called topological spaces ) that are invariant under certain types of transformations (called continuous maps ), especially those properties that are invariant under a certain type of equivalence (called homeomorphism ). The mathematical defi nition of topology is described briefl y here.

Let X be any set and let T be a family of subsets of X . Then T is a topology on X if:

• Both the empty set and X are elements of T .

• Any union of arbitrarily many elements of T is an element of T . • Any intersection of fi nitely many elements of T is an element of T . If T is a topology on X , then X together with T is called a topological space .

All sets in T are called open ; note that in general not all subsets of X need be in T . A subset of X is said to be closed if its complement is in T (i.e., it is open). A subset of X may be open, closed, both, or neither.

A function or map from one topological space to another is called continu- ous if the inverse image of any open set is open. If the function maps the real numbers to the real numbers (both spaces with the standard topology), this defi nition of continuous is equivalent to the defi nition of continuous in calculus. If a continuous function is one - to - one and onto and if the inverse of the function is also continuous, the function is called a homeomorphism and the domain of the function is said to be homeomorphic to the range. Another way of saying this is that the function has a natural extension to the topology.

If two spaces are homeomorphic, they have identical topological properties and are considered to be topologically the same. A cube and a sphere are homeomorphic, as are a coffee cup and a doughnut. But the circle is not homeomorphic to the doughnut. DNA topology and protein topology are active research areas.

Mathematical Space Mathematical space is an informal term for any of many different types of sets with added structure. Mathematical spaces often form a hierarchy (i.e., one space may inherit all the characteristics of a parent space).

For example, all inner product spaces are also normed vector spaces, all normed vector spaces are also metric spaces, and all metric spaces are topological spaces, because the inner product induces a norm on the inner product space such that

PROTEIN STRUCTURES AND PREDICTION 117 x = <x x, >

and so on.

Mathematical Optimization In mathematics programming, an optimization problem is a problem of fi nding the best solution from all feasible solutions.

More formally, an optimization problem has the general form

min max

x S f x x S f x

∈ ( ) or ∈ ( ) (5.1)

where:

• f ( x ) is a real - valued function defi ned on the space R n , called an objective function .

• S is a subset of the space R n , called a feasible set . • The points x * in S are called feasible .

A point x * in S is said to be a local minimum of the f ( x ) if

f x( )* ≤f x( ),∀ ∈ ∩x S {x x, −x* <δ δ, >0} (5.2) A point x * in S is said to be a global minimum of the function f ( x ) if

f x( )* ≤ f x( ) ∀ ∈x S (5.3) Local and global maximum points can be defi ned similarly. Maximization and minimization are related by the relation

max{f x( ),∀ ∈x S}= −min{−f x( ),∀ ∈x S} (5.4)

Therefore, any maximization problem can be converted into an equivalent minimization problem, and vice versa.

GENETIC MATRICES, HYDROGEN BONDS, AND

SEQUENCE ANALYSIS AND FURTHER DISCUSSION