Basel Computational Biology Conference 2006

[Basel Computational Biology Conference 2006]

Abstracts

Keynote Lecture: KEGG BRITE for linking genomes to biological systems

Minoru Kanehisa

Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto & Human Genome Center, Institute of Medical Science, University of Tokyo, Japan

The KEGG resource (http://www.genome.jp/kegg/) provides a reference knowledge base for linking genomes to biological systems, categorized as building blocks in the genomic space (KEGG GENES) and the chemical space (KEGG LIGAND), and wiring diagrams of interaction networks and reaction networks (KEGG PATHWAY). A fourth component, KEGG BRITE, has been formally added to the KEGG suite of databases. It is a collection of hierarchically structured vocabularies representing our knowledge on various aspects of biological systems. In contrast to KEGG PATHWAY, which is limited to molecular interactions and reactions, KEGG BRITE incorporates many different types of relationships involving, for example, cells, tissues, organs, and diseases. Thus, the mapping of genomic data to KEGG BRITE will supplement the current KEGG PATHWAY mapping. The KO (KEGG Orthology) system, which is a pathway-based classification of orthologs and protein families, is being improved to facilitate this mapping and to automate higher-order functional interpretations from genomic and molecular information.

References

http://www.genome.jp/kegg/

Keynote Lecture: Computational Methods in Regulatory Genomics.

Martin Vingron

MPI für Molekulare Genetik, Berlin.

The availability of complete genome sequences as well as functional genomics data like, e.g, large scale gene-expression data has revived the interest in computational prediction of cis-regulatory elements. This talk will introduce computational methods for visualizing associations between genes and conditions in DNA-microarray data. These techniques will also be applied for establishing associations between gene expression data and transcription factor binding sites. While for yeast this can be done based on published transcription factor binding data, for human data we draw on a comparative analysis with mouse data in search for binding sites.

References

Dieterich, C., Rahmann, S., Vingron, M. (2004) Functional inference from nonrandom distributions of conserved predicted transcription factor binding sites. Bioinformatics 20 (Suppl.1) 2004: i109-i115.
Dieterich C, Grossmann S, Tanzer A, Röpcke S, Arndt PF, Stadler PF, Vingron M (2005) Comparative promoter region analysis powered by CORG. BMC Genomics 6:24.
Manke, T., Bringas, R., Vingron, M. (2003) Correlating Protein-DNA and Protein-Protein Interaction Networks. J Mol Biol 333:75-85.

How comparative genomics transforms industrial biotechnology

Markus Wyss (DSM Nutritional Products)

Exponential growth of sequence information in public databases and continuously decreasing costs for genome sequencing contribute to an increasingly diverse and powerful comparative genomics toolbox. The proven and perceived opportunities are reflected in an increasing adoption of comparative genomics approaches by industrial biotechnology.

Several examples will be presented that demonstrate the successful use of sequence comparisons for the design of improved products or biotechnological production processes. However, it will be equally relevant to consider the current limitations of comparative genomics. Finally, comparative genomics will be placed in broader context to evaluate its most productive use for advancing the field of systems biology and, thereby, also industrial biotechnology.

Beyond comparative genomics: Using cross-species comparisons to elucidate pathways and functional networks

Hans-Peter Fischer ( Genedata AG, Basel )

The ongoing and accelerating sequencing of genomic DNA has produced hundreds of complete genome sequences. Ten years ago, the first available genome sequences caused tremendous excitement throughout the scientific community, as the availability of multiple genomes allowed a comprehensive catalogue of all building blocks of life to be established for the first time. Today, the focus of biological research has shifted towards understanding higher-level wiring schemes encoded by genome sequences.

Here, we demonstrate the importance of genome comparisons for understanding the physical interactions and causal interplay of individual gene products. We present methodologies based on genome comparisons for the ab initio reconstruction of signaling, regulatory and metabolic pathways. Additionally, we show how the incorporation of complementary experimental data such as protein interaction and mRNA profiling data can be used to further characterize functional networks. We show that the integration and analysis of cross-species expression data can be used to put previously uncharacterized genes in a meaningful functional context. Such analysis strategies can be used to evaluate the suitability of model organisms for investigating specific biological effects, a critical prerequisite for model system studies aiming at understanding a therapeutic target’s contribution to a disease phenotype, or a drug’s potential undesired adverse side effects.

Systems biology applications benefit from our results, as quantitative models of pathway dynamics require a thorough understanding of the wiring scheme of the cell and potential pathway cross-talk effects. Our findings are also relevant for drug discovery and development applications, as will be demonstrated by presenting examples of drug discovery and development applications, including target validation in oncology and the in silico characterization of the toxicity mechanisms in drug safety assessments.

Structural genomics and protein evolution

Marc Robinson-Rechavi ( University of Lausanne)

As the number of protein structures from high throughput centers (Structural genomics) is increasing, so is the coverage of protein diversity, as well as the coverage of the proteomes of model species. This opens new possibities for evolutionary bioinformatics, to analyse a level of organisation which has been traditionally under represented in evolutionary studies. Conversly, evolution provides keys for making sense of data which was often generated without a specific biological aim. I will present a study from T. maritima structural genomics, and discuss some perspectives.

Pathway-centric approaches for gene-expression analysis

Mischa Reinhardt (Novartis Institutes of Biomedical Research)

Gene expression analysis using diverse microarray platforms has become a well established technique used throughout all phases of drug discovery and development. While the sensitivity of today's microarrays allows us to reliable predict gene expression changes in the range of 1.5 fold, smaller, but biological meaningful events, are harder to detect. A possible solution represents a shift from a gene-centric to a pathway-centric paradigm. Rather than comparing the relative expression of a number of genes, a complete pathway or an otherwise biologically related group of genes is observed as a whole. By assuming that the disregulation of a pathway leads to a co-ordinated change of the expression of a large group of related genes, we first add additional statistical strengths to our analysis which allows us to reliable predict significant gene expression changes of ± 20%. Second, rather than supplying biologists with lengthy lists of disregulated genes, we directly identify the key-processes that are affected.

Genome-wide annotation of regulatory motifs using comparative genomics

Erik van Nimwegen (Biozentrum University Basel and Swiss Institute of Bioinformatics)

Computational discovery of regulatory sites in intergenic DNA is one of the central problems in bioinformatics. Up until
recently motif finders would typically take one of two general approaches. In the first approach, given a known set of co-regulated genes, one searches their promoter regions for significantly overrepresented sequence motifs. Alternatively, in a "phylogenetic footprinting" approach one searches multiple alignments of orthologous intergenic regions for short segments that are significantly more conserved than expected based on the phylogeny of the species.

In this lecture I will present a new method that combines these two approaches into one integrated Bayesian framework. Our method uses a Monte-Carlo Markov chain strategy to search over all ways in which an arbitrary number of binding sites for an arbitrary number of transcription factors can be assigned to arbitrary collections of multiple sequence alignments while taking into account the phylogenetic relations between the sequences.

As an application, I will show how we use our method to obtain genome-wide annotation of transcription factor binding sites in Saccharomyces cerevisiae using the genomes of five Saccharomyces species in combination with ChIP-on-chip data.

The Roche Comparative Genomics Database

Martin Ebeling (F. Hoffmann-La Roche AG)

A growing number of vertebrate genomes is currently being sequenced and analyzed - at very different levels of sophistication. Available data range from fully annotated genomic sequences to collections of low-quality sequence contigs. For the equation "more genomes = more insight" to come true, these differences have to be taken into account. The presentation will introduce the Roche Comparative Genomics project and some of the results obtained, pointing out some key advantages and problems as well as plans for future developments.

Defining diagnostic and prognostic biomarkers for kidney allograft rejection by gene expression profiling analysis

Pierre Saint-Mezard and Hai Zhang (Novartis Institutes of Biomedical Research)

Early diagnosis of renal allograft rejection and new prognostic markers are gaining importance in the current trend to minimize and personalize immunosuppression. In addition to histopathological differential diagnosis, gene expression profiling could significantly improve disease classification by defining “molecular Banff” signatures of kidney allograft rejection. Therefore, a large clinical sample collection was analyzed by Affymetrix GeneChip TM arrays including normal and various grades of acute and chronic rejected renal biopsies.

Classical methods identify panels of differentially expressed genes able to distinguish the various sample groups characterized by different histopathological readings. The respective genes support biological changes known to be involved in the pathophysiology of renal allograft rejection.

Several complementary computational approaches were applied to extract key features of acute and chronic rejection. Analysis by the Nearest Shrunken Centroid method, Gene Set Enrichment Analysis (GSEA) and Relevance Networks confirms established biomarkers/pathways and shows some novel genes with promising prognostic properties.

To obtain consistent and robust diagnostic and prognostic biomarkers, we extended the analysis with additional microarray datasets of kidney allograft rejection. A comparative meta-analysis was performed in 3 published and 2 internal datasets, identifying a common transcriptional profile of genes mainly involved in the ongoing immune response against transplants.

Our results provide a strong basisfor the validation of an unbiased “molecular Banff” classification for kidney biopsies and more importantly identify new combinatorial biomarkers that could be applied to peripheral blood samples.

References:

Sarwal M, Chua MS, Kambham N, Hsieh SC, Satterwhite T, Masek M, Salvatierra O Jr. Molecular heterogeneity in acute renal allograft rejection identified by DNA microarray profiling. N Engl J Med. 2003; 349:125-38.
Scherer A, Krause A, Walker JR, Korn A, Niese D, Raulf F. Early prognosis of the development of renal chronic allograft rejection by gene expression profiling of human protocol biopsies. Transplantation. 2003; 75:1323-30.
Raulf F. Novel biomarkers of allograft rejection: 'omics' approaches start to deliver. Curr Opin Organ Transplant. 2005; 10:295-300.
Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci U S A. 2002; 99:6567-72.
Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc Natl Acad Sci U S A. 2000; 97:12182-6.
Mootha VK, Lindgren CM, Eriksson KF, et al. PGC-1 alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003; 34:267-73.

From functional sites to domain architectures

Jörg Schultz (Biozentrum Universität Würzburg)

Domains are evolutionary and functional building blocks of proteins. Their detection within proteins and their enumeration within genomes is, thanks to different domain databases, a straightforward task. But one of the original expectations of domain analyses, the prediction of an unknown proteins function, is still not fulfilled. Within others, there are two challenges. First, one type of domain can perform widely differing functions, second the presence of multiple domains within one proteins and their interplay has to be taken into account. To address the first problem, we have analysed the position and the type of functional sites within domain families relying on structurally characterised hetero-complexes. We found that, depending on the domain family, the type of amino acid, but also the position of functional sites can vary substantially within the family. This heterogeneity of functional sites implies that standard alignment based methods for the prediction of interaction sites will be error-prone. These mostly mark the position of a functional site within the alignment and transfer this information to novel sequences added to the alignment. We have developed an extension of profile HMMs which allows the probabilistic prediction of functional sites.

One of the exciting features of protein domains is their evolutionary independence, that is, they can be found in proteins which are despite from the domain non-homologous. To understand, how multi-domain proteins arise and how these genomic inventions might interplay with physiological features, we analysed the origin of domain architectures considering the taxonomic classification of the organisms encoding them. Not unexpectedly, we found distinct taxonomic nodes with a high number of novel domain architectures. The functional characterisation of the respective proteins did reveal significant differences between taxonomic nodes. Furthermore, the approach allowed us to determine the taxonomic node, where a domain architecture first arose, leading to an evolutionary classification of proteins. Integration of these data with large scale protein interaction sets revealed, that there exists evolutionary modules within protein interaction networks.

The Swiss Vitis Microsatellite Database

Claire Arnold (University of Neuchatel)

Arnold Claire and Vouillamoz José

Since their first application on grapevine in 1993, microsatellites quickly became the molecular markers of choice for the identification of grapevine varieties. Microsatellite data are expressed by the size of the DNA fragments in basepairs and thus allows a quick exchange of data between laboratories in the world.

The purpose of the Swiss Vitis Microsatellite Database (SVMD) project is to set up a harmonized database containing the microsatellite genotypes of all grapevine varieties, root-stocks and wild grapevines growing in Switzerland. To our knowledge, there is no official national record of Swiss cultivars however we have recorded about one hundred varieties of cultivated grapevines in Switzerland, of which dozens are unique indigenous varieties. All these samples are currently genotyped with the six multiply-confirmed and universally defined OIV-SSR-markers (VVMD5, VVMD7, VVMD27, VVS2, VrZAG62, VrZAG79). These primers allow a guaranteed identification.

The Swiss Vitis Microsatellite Database will help scientists working in research against pathogens or other biotic or abiotic stress to better identify and select their research material. It will also offer agronomists a reliable service of identification for Swiss grape varieties and rootstocks when ampelography reaches its limits. A better knowledge of the genetic distance between varieties will enable grape breeders to suggest suited parents for new crosses. Because of the harmonisation of its data, the Swiss Vitis Microsatellite Database can easily be integrated into the European Vitis Database.

Evolutionary fate of retroposed gene copies in the human genome

Henrik Kaessmann ( University of Lausanne)

We conducted a systematic survey to gauge to what extent the high rate of retroposition in primates has generated young functional retrogenes in humans. Extensive comparative sequencing and expression analyses as well as evolutionary simulations suggest that a significant proportion of retrocopies represent recent genes with potentially diverse functions in testis, brain, and other organs. Evolutionary analyses reveal that following duplication retrogenes obtain new functions as a consequence of adaptive protein change driven by positive selection and/or the evolution of new spatial or temporal expression patterns. Our study points to a significant role of retroduplication for the origin of young human genes and therefore recently emerged phenotypes in human evolution.

Comparative insect genomics

Evgeny Zdobnov ( University of Geneva )

Insects are the largest and most diverse group of animals on Earth. They greatly affect human agriculture and health that has provided strong justification for several whole-genome sequencing projects. The considerable number of the available genomes and their diversity, not observed among comparable vertebrate species, make this group unique for quantification of evolutionary processes shaping animal genomes.

I will present the first comparative overview of these insect genomes, focusing on the initial genome analysis of a highly social animal, the honeybee Apis mellifera.

Phyloinformatics in the genomic era: examples from the plant family Poaceae

Nicolas Salamin ( University of Lausanne )

Computational approaches making the most efficient use of the large amount of genomic data now available are becoming increasingly important. However, such data can serve many different purposes, and three different applications related to this field of research are presented here. First, we focus on the large amount of genomic data present in public databases in the form of DNA sequences and their utility to build part of the Tree of Life. The computational part of this task requires to combine efficiently available DNA sequences for a set of species in order to maximise both the number of species and gene regions available for analysis. An economically important plant family, the grasses, is used to highlight the advantages and shortcomings of different approaches. Second, the evolution of a gene family encoding an essential step of the photosynthetic pathway is described. Among the multiple plant families using C4 photosynthesis, grasses are the oldest C4 species, and contains the largest number of C4 species, including species showing intermediate photosynthetic pathways. The evolution of this photosynthetic system is analysed using a broad sampling of grass species diversity, instead of the typical model grass species. Methods to detect adaptive protein evolution are illustrated with this gene family, and the effect of convergent evolution is detected using simulations. Third, phylogenetic trees are now an important tool in any genomic research, but it is essential to keep in mind that any trees used are an estimate of the true evolutionary history of the taxa at hand. However, errors surrounding the topology and the branch lengths should be taken into account in any analyses using phylogenetic trees. We present here an approach to estimate the rate of duplication and extinction of genes within a gene family by averaging over all the plausible trees for a set of DNA sequences. To avoid specifying prior distributions on parameters, we use a full frequentist approach based on an importance sampling scheme.

The Orthologous Matrix (OMA) Project: Massive Cross-Comparison of Complete Genomes

Gaston H. Gonnet (ETH Zurich )

The OMA project is a large-scale effort to identify groups of orthologs from complete genome data, currently 280 species. The orthologous detection relies solely on protein sequence information and does not require any human supervision. It has several original features, in particular a verification step that detects paralogs and prevents them from being clustered together. The paralogy detection algorithm is provable correct and includes an interesting application of max edge-weight cliques.

The resulting groups, whenever a comparison could be made, are highly consistent both with EC assignments, and with assignments from the manually curated database HAMAP. A highly accurate set of orthologous sequences constitutes the basis for several other investigations, including phylogenetic analysis and protein classification.

A complete set of orthologues also allows the assignment of orthologous genes and large scale gene mapping between relatively close species. With these gene maps we can reconstruct the synteny distance between species. The synteny distance between species appears to be a remarkably accurate measure of distance.

The complex genetic ancestry of Humans.

Arndt von Haeseler ( Center for Integrative Bioinformatics Vienna )

I. Ebersberger, Arndt von Haeseler (CIBIV-MFPL, Vienna, Austria) and P. Galgoczy, S. Taudien, s. Taenzer, R. Lehmann, M. Platzer (FLI, Jena, Germany)

The split of humans and chimpanzees approximately 5-6 million years ago is generally taken as initial point for the distinct evolutionary histories of both species. Consequently, it is genetic changes that have accumulated since then in the genomes of either species that are held responsible for the remarkebly different phenotypes of the contemporary species. However, for some regions of our genome we are genetically more closely related to gorillas than to chimpanzees. Vice versa, genomic regions exist where chimpanzees and gorillas are each other's closest relatives. This suggests that the processes that formed humans and chimpanzees are more complex than usually considered.

Here, we report a whole genome sample sequencing approach on the genomes of gorilla, orang-utan and rhesus to shed light on the intertwined genetic relationships of humans and the great apes. Together with the genome sequences of humans and chimpanzees, we analyze a total of 4.3 million base pairs from randomly chosen regions of the human genome, corresponding to 7,600 sequence trees with three species each. We estimate that about one third of our genetic material, encompassing ~25% of our genes, is phylogenetically old. That is, its ancestry predates the speciation of humans and traces back to the ancient species we jointly shared with chimpanzees and gorillas. Consequently, the "human-specific" evolution of these genetic lineages and their associated phenotypes started long before humans emerged as a species. This may lead to an explanation of recurrent findings of very old human specific morphological traits in the fossils record, which predate the recent emergence of the human species about 5 million years ago. Only a fraction of these ancient lineages identifies chimpanzees as our closest genetic relatives, explaining why evolutionary novelties can be exclusively shared among species that are not each other's closest relatives. Our findings show that a deeper understanding of human and chimpanzee evolution is essentially dependent on the insights into our genetic ancestry.

Medical laboratory data analysis: An application of machine learning techniques to analyze the trends of biomarkers over time.

Andre Elisseeff (IBM Zurich Research Laboratory)

Most modern medical laboratories store patient's lab tests over time (such as glucose levels, triglycerides, etc.) into databases. The hospital of Desio in Italy has for instance about 2.5 Million patient records corresponding to several million tests performed in the last ten years. Physicians have access to this database and can monitor the evolution of a patient from a workstation. To detect whether an observation is normal, they can check population-based statistics and see how much it deviates from the average value: an observation within 95% confidence interval computed from a healthy population of the same gender and same age as the patient is usually considered as normal.Unfortunately such an approach might overlook the case that a patient has a medical problem. Consider the case of glucose and assume that a patient has a normal glucose level (in the 20-30 percentile around the mean) and moves suddenly to another glucose level (at the border of the 70-80 percentile around the mean) in a year time. From the population based statistic perspective, she/he will be considered as normal since she/he does not get out of the 5-95 percentile
range. From a patient-based statistics on the other hand, she/he should be watched carefully because her/his glucose level has an unexpected trend.

In this talk we will describe and motivate some statistical (machine learning) methods we are currently developing with the hospital of Desio in Italy to analyze and discover biomarker trends over time with the end-goal to return more patient specific information to the physician. We will see how machine learning naturally comes in and discuss the practice of data analysis in a medical setting.