Bioinformatics of macromolecular sequence and structure
Nikolaos Darzentas, Nicola Gold, Vidhya Krishnan, Andrew Nightingale, Howard Parish, Steven Pickering, Michael Sadowski, Amy Williams, David Westhead
Introduction
The group works predominantly in the area of biological sequence and structure analysis. The principal themes of our research are
protein structure analysis,
prediction of protein structure and function, and
analysis of genome sequence data, including single nucleotide polymorphisms and conserved parts of UTRs (UnTranslated Regions).
High-throughput genome sequencing has resulted in complete genome sequences of many organisms including eubacteria, archaebacteria, simple eukaryotes and man. The focus of post-genome research is to understand these sequences in terms of biological function. Bioinformatics methods to predict function from sequence are now making key contributions to this effort, but the need for improved methods remains. At the same time, data mining tools can play a key role in the discovery of new elements of functional significance in this huge data set.
Within the protein structure theme we are interested in methods to predict protein structure from sequence and protein function from sequence and structure. Recently interest in high throughput protein structure determination and structural genomics (experimental initiatives with the broad aim of structure determination for large numbers of the proteins encoded in the sequenced genomes) has brought the problem of prediction of function from structure to the fore. Our BALSAMIC project aims to create methods and software tools with this aim. At its centre is a database of protein active sites and ligand binding sites, created by automatic methods from the database of known protein structures. The research focuses on the creation of algorithmic tools to search and cluster this database using site similarity as the main criterion. Thus prediction of function for a candidate site in a newly determined protein structure might be informed by a search of our database for a similar site whose function and/or ligand binding properties are already known. Some examples are shown in figures 1 and 2.

Figure 1. By using methods from graph theory, similarity in the residue arrangements in active sites and ligand binding sites in distantly related proteins with can be detected. Pairs of residues are matched by treating each site as a graph with edges labelled with the inter-atomic distances. The maximal common sub-graph can be found by forming the product graph and then using the algorithm of Bron and Kerbosch to enumerate its maximal cliques. The example shows a site match: the first site is coloured red, and the matched site is blue.

Figure 2. By modifying the graph theory algorithm slightly, common surface regions described by their geometrical and physico-chemical properties can be matched. This is important because many protein functions like catalysis and molecular recognition occur on protein surfaces.
Results (figures 1 and 2) indicate that we can successfully match a family of related ligand binding sites, and identify their common residues and surface regions. It is possible to detect similarity in sites, to describe the subtle differences in specificity that exist in proteins taken from the same family, and to detect site similarity when there is no similarity in overall structure or fold. These results indicate that our methods will be a useful complement for function prediction methods based on similarity in sequence or overall structure.
Data mining aims to discover patterns in data that were previously unknown and might have functional significance. We apply these methods to protein structures using a simplified topological description (figure 3) of the structure that enables the construction of fast algorithms, and we are also applying them to UTR sequence data with the aim of the discovery of new functional elements of nucleic acid sequences.

Figure 3. A simplified topological description of the variable domain taken from an antibody light chain. The triangles represent b strands and the circles represent helices. Strands directed out of the page are drawn as "up" triangles and strands directed into the page are drawn as "down" triangles. The fold is a b sandwich made from two anti-parallel sheets.
Collaborators
Drs. N.D Efford, A.J. Bulpitt and S. Bullock, School of Computing, Informatics Research Institute, University of Leeds.
Dr. D. Gilbert, Department of Computer Science, City University.
References
Westhead, D.R., Slidel, T.W.F., Flores, T.P.J. and Thornton, J.M. (1999). Protein structural topology: automated analysis and diagrammatic representation. Protein Science. 8: 897-904.
Gilbert, D., Westhead, D., Nagano, N. and Thornton, J. (1999). "Motif-based searching in TOPS protein topology databases", Bioinformatics 15: 317-326.
Funding:
We wish to acknowledge the support of the BBSRC, The Royal Society, ASTRAZeneca
and GlaxoSmithKline for this work.