[Basel Computational Biology Conference
2006] |
|
Abstracts |
|
Keynote
Lecture: KEGG BRITE for linking genomes to
biological systems
Minoru Kanehisa |
Bioinformatics Center, Institute
for Chemical Research, Kyoto University, Uji, Kyoto & Human
Genome Center, Institute of Medical Science, University of
Tokyo, Japan
|
The KEGG resource (http://www.genome.jp/kegg/)
provides a reference knowledge base for linking genomes to biological
systems, categorized as building blocks in the genomic space (KEGG
GENES) and the chemical space (KEGG LIGAND), and wiring diagrams
of interaction networks and reaction networks (KEGG PATHWAY). A
fourth component, KEGG BRITE, has been formally added to the KEGG
suite of databases. It is a collection of hierarchically structured
vocabularies representing our knowledge on various aspects of biological
systems. In contrast to KEGG PATHWAY, which is limited to molecular
interactions and reactions, KEGG BRITE incorporates many different
types of relationships involving, for example, cells, tissues,
organs, and diseases. Thus, the mapping of genomic data to KEGG
BRITE will supplement the current KEGG PATHWAY mapping. The KO
(KEGG Orthology) system, which is a pathway-based classification
of orthologs and protein families, is being improved to facilitate
this mapping and to automate higher-order functional interpretations
from genomic and molecular information.
References

|
Keynote Lecture: Computational Methods in Regulatory Genomics.
Martin Vingron |
MPI für Molekulare Genetik,
Berlin. |
|
The availability of complete genome sequences as well as functional
genomics data like, e.g, large scale gene-expression data has revived
the interest in computational prediction of cis-regulatory elements.
This talk will introduce computational methods for visualizing
associations between genes and conditions in DNA-microarray data.
These techniques will also be applied for establishing associations
between gene expression data and transcription factor binding sites.
While for yeast this can be done based on published transcription
factor binding data, for human data we draw on a comparative analysis
with mouse data in search for binding sites.
References
- Dieterich, C., Rahmann, S., Vingron, M. (2004) Functional
inference from nonrandom distributions of conserved predicted
transcription factor binding sites. Bioinformatics 20
(Suppl.1) 2004: i109-i115.
- Dieterich C, Grossmann S, Tanzer A,
Röpcke S, Arndt PF,
Stadler PF, Vingron M (2005) Comparative promoter region analysis
powered by CORG. BMC Genomics 6:24.
- Manke, T., Bringas, R., Vingron, M. (2003)
Correlating Protein-DNA and Protein-Protein Interaction Networks. J
Mol Biol 333:75-85.
|
|
How comparative genomics
transforms industrial biotechnology
Markus Wyss (DSM
Nutritional Products)
Exponential growth of sequence
information in public databases and
continuously decreasing costs for genome sequencing contribute
to an
increasingly diverse and powerful comparative genomics toolbox.
The proven
and perceived opportunities are reflected in an increasing adoption
of
comparative genomics approaches by industrial biotechnology.
Several examples will be presented that
demonstrate the successful use of
sequence comparisons for the design of improved products or biotechnological
production processes. However, it will be equally relevant to
consider the
current limitations of comparative genomics. Finally, comparative
genomics
will be placed in broader context to evaluate its most productive
use for
advancing the field of systems biology and, thereby, also industrial
biotechnology.

|
|
Beyond comparative genomics: Using cross-species
comparisons to elucidate pathways and functional networks
Hans-Peter Fischer ( Genedata AG, Basel
)
The ongoing and accelerating sequencing of genomic DNA has produced
hundreds of complete genome sequences. Ten years ago, the first
available genome sequences caused tremendous excitement throughout
the scientific community, as the availability of multiple genomes
allowed a comprehensive catalogue of all building blocks of life
to be established for the first time. Today, the focus of biological
research has shifted towards understanding higher-level wiring
schemes encoded by genome sequences.
Here, we demonstrate the importance of genome comparisons for
understanding the physical interactions and causal interplay of
individual gene products. We present methodologies based on genome
comparisons for the ab initio reconstruction of signaling,
regulatory and metabolic pathways. Additionally, we show how the
incorporation of complementary experimental data such as protein
interaction and mRNA profiling data can be used to further characterize
functional networks. We show that the integration and analysis
of cross-species expression data can be used to put previously
uncharacterized genes in a meaningful functional context. Such
analysis strategies can be used to evaluate the suitability of
model organisms for investigating specific biological effects,
a critical prerequisite for model system studies aiming at understanding
a therapeutic target’s contribution to a disease phenotype,
or a drug’s potential undesired adverse side effects.
Systems biology applications benefit from our results, as quantitative
models of pathway dynamics require a thorough understanding of
the wiring scheme of the cell and potential pathway cross-talk
effects. Our findings are also relevant for drug discovery and
development applications, as will be demonstrated by presenting
examples of drug discovery and development applications, including
target validation in oncology and the in silico characterization
of the toxicity mechanisms in drug safety assessments.

|
|
Structural
genomics and protein evolution
Marc Robinson-Rechavi ( University of Lausanne)
As the number of protein structures from
high throughput centers (Structural genomics) is increasing, so
is the coverage of protein diversity, as well as the coverage of
the proteomes of model species. This opens new possibities for
evolutionary bioinformatics, to analyse a level of organisation
which has been traditionally under
represented in evolutionary studies. Conversly, evolution provides
keys for making sense of data which was often generated without
a
specific biological aim. I will present a study from T. maritima
structural genomics, and discuss some perspectives.

|
Pathway-centric
approaches for gene-expression analysis
Mischa Reinhardt (Novartis Institutes
of Biomedical Research)
Gene expression analysis using diverse microarray
platforms has become a
well established technique used throughout all phases of drug discovery
and development. While the sensitivity of today's microarrays allows
us to
reliable predict gene expression changes in the range of 1.5 fold,
smaller, but biological meaningful events, are harder to detect.
A
possible solution represents a shift from a gene-centric to a
pathway-centric paradigm. Rather than comparing the relative expression
of
a number of genes, a complete pathway or an otherwise biologically
related
group of genes is observed as a whole. By assuming that the disregulation
of a pathway leads to a co-ordinated change of the expression of
a large
group of related genes, we first add additional statistical strengths
to
our analysis which allows us to reliable predict significant gene
expression changes of ± 20%. Second, rather than supplying
biologists with
lengthy lists of disregulated genes, we directly identify the
key-processes that are affected.

|
Genome-wide annotation of regulatory motifs
using comparative genomics
Erik van Nimwegen (Biozentrum
University Basel and Swiss Institute of Bioinformatics)
Computational discovery
of regulatory sites in intergenic
DNA is one of the central problems in bioinformatics. Up until
recently motif finders would typically take one of two general
approaches. In the first approach, given a known set of co-regulated
genes, one searches their promoter regions for significantly overrepresented
sequence motifs. Alternatively, in a "phylogenetic
footprinting" approach one searches multiple alignments of
orthologous
intergenic regions for short segments that are significantly more
conserved than expected based on the phylogeny of the species.
In this lecture I will present a new method
that combines these two
approaches into one integrated Bayesian framework. Our method uses
a
Monte-Carlo Markov chain strategy to search over all ways in which
an
arbitrary number of binding sites for an arbitrary number of
transcription factors can be assigned to arbitrary collections
of
multiple sequence alignments while taking into account the
phylogenetic relations between the sequences.
As an application, I will show how we
use our method to obtain genome-wide annotation of transcription
factor binding sites in Saccharomyces cerevisiae using the genomes
of five Saccharomyces species in combination with ChIP-on-chip
data.

|
The Roche Comparative Genomics Database
Martin Ebeling (F. Hoffmann-La
Roche AG)
A growing number of vertebrate genomes is currently
being sequenced and
analyzed - at very different levels of sophistication. Available
data
range from fully annotated genomic sequences to collections
of
low-quality sequence contigs. For the equation "more genomes
= more
insight" to come true, these differences have to be taken
into account.
The presentation will introduce the Roche Comparative Genomics
project
and some of the results obtained, pointing out some key advantages
and
problems as well as plans for future developments.

|
Defining diagnostic and prognostic biomarkers
for kidney allograft rejection by gene expression profiling analysis
Pierre Saint-Mezard and Hai Zhang
(Novartis Institutes of Biomedical Research)
Early diagnosis of renal allograft rejection and new prognostic
markers are gaining importance in the current trend to minimize
and personalize immunosuppression. In addition to histopathological
differential diagnosis, gene expression profiling could significantly
improve disease classification by defining “molecular Banff” signatures
of kidney allograft rejection. Therefore, a large clinical sample
collection was analyzed by Affymetrix GeneChip TM arrays including
normal and various grades of acute and chronic rejected renal biopsies.
Classical methods identify panels of differentially expressed
genes able to distinguish the various sample groups characterized
by different histopathological readings. The respective genes support
biological changes known to be involved in the pathophysiology
of renal allograft rejection.
Several complementary computational approaches were applied to
extract key features of acute and chronic rejection. Analysis by
the Nearest Shrunken Centroid method, Gene Set Enrichment Analysis
(GSEA) and Relevance Networks confirms established biomarkers/pathways
and shows some novel genes with promising prognostic properties.
To obtain consistent and robust diagnostic and prognostic biomarkers,
we extended the analysis with additional microarray datasets of
kidney allograft rejection. A comparative meta-analysis was performed
in 3 published and 2 internal datasets, identifying a common transcriptional
profile of genes mainly involved in the ongoing immune response
against transplants.
Our results provide a strong basisfor the validation of
an unbiased “molecular Banff” classification for kidney
biopsies and more importantly identify new combinatorial biomarkers
that could be applied to peripheral blood samples.
References:
- Sarwal M, Chua MS, Kambham N, Hsieh SC, Satterwhite T, Masek
M, Salvatierra O Jr. Molecular heterogeneity in acute renal allograft
rejection identified by DNA microarray profiling. N Engl J Med.
2003; 349:125-38.
- Scherer A, Krause A, Walker JR, Korn A, Niese D, Raulf F.
Early prognosis of the development of renal chronic allograft
rejection by gene expression profiling of human protocol biopsies.
Transplantation. 2003; 75:1323-30.
- Raulf F. Novel biomarkers of allograft rejection: 'omics'
approaches start to deliver. Curr Opin Organ Transplant. 2005;
10:295-300.
- Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of
multiple cancer types by shrunken centroids of gene expression.
Proc Natl Acad Sci U S A. 2002; 99:6567-72.
- Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS. Discovering
functional relationships between RNA expression and chemotherapeutic
susceptibility using relevance networks. Proc Natl Acad Sci U
S A. 2000; 97:12182-6.
- Mootha VK, Lindgren CM, Eriksson KF, et al. PGC-1 alpha-responsive
genes involved in oxidative phosphorylation are coordinately
downregulated in human diabetes. Nat Genet. 2003; 34:267-73.

|
From functional sites to domain architectures
Jörg Schultz (Biozentrum Universität Würzburg)
Domains are evolutionary and functional building blocks of proteins.
Their detection within proteins and their enumeration within genomes
is, thanks to different domain databases, a straightforward task.
But one of the original expectations of domain analyses, the prediction
of an unknown proteins function, is still not fulfilled. Within
others, there are two challenges. First, one type of domain can
perform widely differing functions, second the presence of multiple
domains within one proteins and their interplay has to be taken
into account. To address the first problem, we have analysed the
position and the type of functional sites within domain families
relying on structurally characterised hetero-complexes. We found
that, depending on the domain family, the type of amino acid, but
also the position of functional sites can vary substantially within
the family. This heterogeneity of functional sites implies that
standard alignment based methods for the prediction of interaction
sites will be error-prone. These mostly mark the position of a
functional site within the alignment and transfer this information
to novel sequences added to the alignment. We have developed an
extension of profile HMMs which allows the probabilistic prediction
of functional sites.
One of the exciting features of protein domains is their evolutionary
independence, that is, they can be found in proteins which are
despite from the domain non-homologous. To understand, how multi-domain
proteins arise and how these genomic inventions might interplay
with physiological features, we analysed the origin of domain architectures
considering the taxonomic classification of the organisms encoding
them. Not unexpectedly, we found distinct taxonomic nodes with
a high number of novel domain architectures. The functional characterisation
of the respective proteins did reveal significant differences between
taxonomic nodes. Furthermore, the approach allowed us to determine
the taxonomic node, where a domain architecture first arose, leading
to an evolutionary classification of proteins. Integration of these
data with large scale protein interaction sets revealed, that there
exists evolutionary modules within protein interaction networks.

|
The Swiss Vitis Microsatellite Database
Claire Arnold (University
of Neuchatel)
Arnold Claire and Vouillamoz José
Since their first application on grapevine in 1993, microsatellites
quickly
became the molecular markers of choice for the identification of
grapevine
varieties. Microsatellite data are expressed by the size of the
DNA
fragments in basepairs and thus allows a quick exchange of data
between
laboratories in the world.
The purpose of the Swiss Vitis Microsatellite Database (SVMD) project
is to
set up a harmonized database containing the microsatellite genotypes
of all
grapevine varieties, root-stocks and wild grapevines growing in
Switzerland.
To our knowledge, there is no official national record of Swiss
cultivars
however we have recorded about one hundred varieties of cultivated
grapevines in Switzerland, of which dozens are unique indigenous
varieties.
All these samples are currently genotyped with the six multiply-confirmed
and universally defined OIV-SSR-markers (VVMD5, VVMD7, VVMD27,
VVS2,
VrZAG62, VrZAG79). These primers allow a guaranteed identification.
The Swiss Vitis Microsatellite Database will help scientists working
in
research against pathogens or other biotic or abiotic stress to
better
identify and select their research material. It will also offer
agronomists
a reliable service of identification for Swiss grape varieties
and
rootstocks when ampelography reaches its limits. A better knowledge
of the
genetic distance between varieties will enable grape breeders to
suggest
suited parents for new crosses. Because of the harmonisation of
its data,
the Swiss Vitis Microsatellite Database can easily be integrated
into the
European Vitis Database.

|
Evolutionary fate of retroposed gene copies
in the human genome
Henrik Kaessmann ( University of Lausanne)
We conducted a systematic survey to gauge to
what extent the high rate of
retroposition in primates has generated young functional retrogenes
in
humans. Extensive comparative sequencing and expression analyses
as well as
evolutionary simulations suggest that a significant proportion
of
retrocopies represent recent genes with potentially diverse functions
in
testis, brain, and other organs. Evolutionary analyses reveal that
following
duplication retrogenes obtain new functions as a consequence of
adaptive
protein change driven by positive selection and/or the evolution
of new
spatial or temporal expression patterns. Our study points to a
significant
role of retroduplication for the origin of young human genes and
therefore
recently emerged phenotypes in human evolution.

|
Comparative insect genomics
Evgeny Zdobnov ( University of Geneva )
Insects are the largest and
most diverse group of animals on Earth.
They greatly affect human agriculture and health that has provided
strong justification for several whole-genome sequencing projects.
The considerable number of the available genomes and their diversity,
not
observed among comparable vertebrate species, make this group unique
for quantification of evolutionary processes shaping animal genomes.
I will present the first
comparative overview of these insect genomes,
focusing on the initial genome analysis of a highly social animal,
the
honeybee Apis mellifera.

|
Phyloinformatics
in the genomic era: examples from the plant family Poaceae
Nicolas Salamin ( University of Lausanne )
Computational approaches making the most efficient
use of the large amount of genomic data now available are becoming
increasingly important. However, such data can serve many different
purposes, and three different applications related to this field
of research are presented here. First, we focus on the large amount
of genomic data present in public databases in the form of DNA
sequences and their utility to build part of the Tree of Life.
The computational part of this task requires to combine efficiently
available DNA sequences for a set of species in order to maximise
both the number of species and gene regions available for analysis.
An economically important plant family, the grasses, is used to
highlight the advantages and shortcomings of different approaches.
Second, the evolution of a gene family encoding an essential step
of the photosynthetic pathway is described. Among the multiple
plant families using C4 photosynthesis, grasses are the oldest
C4 species, and contains the largest number of C4 species, including
species showing intermediate photosynthetic pathways. The evolution
of this photosynthetic system is analysed using a broad sampling
of grass species diversity, instead of the typical model grass
species. Methods to detect adaptive protein evolution are illustrated
with this gene family, and the effect of convergent evolution is
detected using simulations. Third, phylogenetic trees are now an
important tool in any genomic research, but it is essential to
keep in mind that any trees used are an estimate of the true evolutionary
history of the taxa at hand. However, errors surrounding the topology
and the branch lengths should be taken into account in any analyses
using phylogenetic trees. We present here an approach to estimate
the rate of duplication and extinction of genes within a gene family
by averaging over all the plausible trees for a set of DNA sequences.
To avoid specifying prior distributions on parameters, we use a
full frequentist approach based on an importance sampling scheme.
 |
The
Orthologous Matrix (OMA) Project: Massive Cross-Comparison of Complete
Genomes
Gaston H. Gonnet (ETH Zurich )
The OMA project is a large-scale effort to identify
groups of orthologs from complete genome data, currently 280 species.
The orthologous detection relies solely on protein sequence information
and does not require any human supervision. It has several original
features, in particular a verification step that detects paralogs
and prevents them from being clustered together. The paralogy detection
algorithm is provable correct and includes an interesting application
of max edge-weight cliques.
The resulting groups, whenever a comparison could be made, are highly consistent
both with EC assignments, and with assignments from the manually curated
database HAMAP. A highly accurate set of orthologous sequences constitutes
the basis for several other investigations, including phylogenetic analysis
and protein classification.
A complete set of orthologues also allows the assignment of orthologous genes
and large scale gene mapping between relatively close species. With these
gene maps we can reconstruct the synteny distance between species. The synteny
distance between species appears to be a remarkably accurate measure of distance.

|
The
complex genetic ancestry of Humans.
Arndt von Haeseler ( Center for Integrative Bioinformatics
Vienna )
I. Ebersberger, Arndt von Haeseler
(CIBIV-MFPL, Vienna, Austria) and
P. Galgoczy, S. Taudien, s. Taenzer, R. Lehmann, M. Platzer
(FLI,
Jena, Germany)
The split of humans and chimpanzees approximately
5-6 million years
ago is generally
taken as initial point for the distinct evolutionary histories
of
both species.
Consequently, it is genetic changes that have accumulated since
then
in the genomes of either species that are held responsible for
the
remarkebly different phenotypes of the contemporary
species. However, for some regions of our genome we are genetically
more closely related to
gorillas than to chimpanzees. Vice versa, genomic regions exist
where
chimpanzees and
gorillas are each other's closest relatives. This suggests that
the
processes
that formed humans and chimpanzees are more complex than usually
considered.
Here, we report a whole genome sample sequencing approach
on the genomes of gorilla, orang-utan and rhesus to shed light
on the
intertwined genetic
relationships of humans and the great apes. Together with the genome
sequences of
humans and chimpanzees, we analyze a total of 4.3 million base
pairs
from randomly
chosen regions of the human genome, corresponding to 7,600 sequence
trees with three
species each. We estimate that about one third of our genetic
material, encompassing
~25% of our genes, is phylogenetically old. That is, its ancestry
predates the speciation
of humans and traces back to the ancient species we jointly shared
with chimpanzees
and gorillas. Consequently, the "human-specific" evolution
of these
genetic lineages and
their associated phenotypes started long before humans emerged
as a
species. This may
lead to an explanation of recurrent findings of very old human
specific morphological
traits in the fossils record, which predate the recent emergence
of
the human species
about 5 million years ago. Only a fraction of these ancient lineages
identifies
chimpanzees as our closest genetic relatives, explaining why
evolutionary novelties can
be exclusively shared among species that are not each other's closest
relatives. Our
findings show that a deeper understanding of human and chimpanzee
evolution is
essentially dependent on the insights into our genetic ancestry.

|
Medical
laboratory data analysis: An application of machine
learning techniques to analyze the trends of biomarkers over time.
Andre Elisseeff (IBM Zurich Research Laboratory)
Most modern medical laboratories store patient's
lab tests over time (such
as glucose levels, triglycerides, etc.) into databases. The hospital
of
Desio in Italy has for instance about 2.5 Million patient records
corresponding to several million tests performed in the last ten
years.
Physicians have access to this database and can monitor the evolution
of a
patient from a workstation. To detect whether an observation is
normal,
they can check population-based statistics and see how much it
deviates
from the average value: an observation within 95% confidence interval
computed from a healthy population of the same gender and same
age as the
patient is usually considered as normal.Unfortunately such an approach
might overlook the case that a patient has a medical problem. Consider
the
case of glucose and assume that a patient has a normal glucose
level (in
the 20-30 percentile around the mean) and moves suddenly to another
glucose level (at the border of the 70-80 percentile around the
mean) in a
year time. From the population based statistic perspective, she/he
will be
considered as normal since she/he does not get out of the 5-95
percentile
range. From a patient-based statistics on the other hand, she/he
should be
watched carefully because her/his glucose level has an unexpected
trend.
In this talk we will describe and motivate
some statistical (machine
learning) methods we are currently developing with the hospital
of Desio
in Italy to analyze and discover biomarker trends over time with
the
end-goal to return more patient specific information to the physician.
We
will see how machine learning naturally comes in and discuss
the practice
of data analysis in a medical setting.

|
|
|