A 01
A1 |
Next generation sequencing (NGS) is a mixed blessing. On one hand, it provides researchers with unbiased, genome-wide access to potentially causative variants and gene expression patterns that may explain their pet disease/syndrome. On the other, it brings very high data volumes, challenging analysis scenarios, and access requirements for expensive IT infrastructure, biocurators and bioinformaticians. Recognising this need, we have developed targeted solutions ("pipelines") that enable researchers, basic and clinical alike, to extract a maximum of information from their experiments. In the growing number of successful cases, this has involved iterative collaboration to efficiently exploit in silico and in vivo mind-sets, one-on-one instruction sessions, and structured courses/workshops to disseminate best practice in this new art. Although challenging for the average researcher, it is becoming clear that the first part of the NGS data-crunching pipeline, providing lists of variants or gene expression levels, is now a commodity. The really challenging and time-consuming part is what happens next: information integration to solve biological problems and support clinical interpretation. A solution for both parts is a bioinformatics platform capable of translating data into knowledge, with access to an HPC, maintained software, and staffed by bioinformatics researchers capable of productively interacting with the broad customer base of paediatricians, oncologists, microbiologists, etc. The generalizability of data processing pipelines for variant calling and read quantitation implies that these should be standardized in a flexible manner and shared between platforms, freeing the bioinformatics and biocurator experts to concentrate on enabling the researchers to solve their biological problems. |
Stevenson B*, Kutalik Z, Pradervand S, Beckmann J, Famiglietti L, Xenarios I
*SIB Swiss Institute of Bioinformatics, Switzerland |
A - Applied and Translational Bioinformatics |
|
C 01
C1 |
Most psychiatric disorders are assumed to be highly complex and heterogeneous traits. Up to now, very little is known about the genetic architecture of such disorders.
We analysed 1678 individuals assessed by the psychiatric arm of the CoLaus study (Firmann et al., 2008) (PsyCoLaus). These cohort participants have been assessed for 105 common psychiatric disorders and scores, as well as other clinical characteristics. All individuals have been sequenced for 202 drug target genes (Nelson et al., 2012).
We started off with simple trait versus single SNP associations using multiple regression models to discover potentially important regions. Sequencing data quality was assessed via comparing it to genotyping chip (Affy 500K) data for overlapping probes.
First results show that Bipolar I disorder, Cocaine dependence and the “dyadic adjustment” scale show genetic predisposition of rare variants. Most of the emerging genes have not yet been reported to play a key role in the corresponding disorders. However, we found rare variants associated with bipolar disease in the DYRK3 and SCN9A genes. The former has been shown to be significantly down-regulated in the brain of bipolar patients (ExpressionAtlas) and mutations in genes encoding voltage-gated sodium channels (incl. SCN9A) are significant factors in the etiology of neurological diseases and psychiatric disorders (Imbrici et al., 2013).
As next step, we will apply rare variant burden tests to see whether one can observe aggregation of rare variants among individuals with psychiatric symptoms.
References:
Firmann, M., Mayor, V., Vidal, P.M., Bochud, M., Pecoud, A., Hayoz, D., Paccaud, F., Preisig, M., Song, K.S., Yuan, X., et al. (2008). The CoLaus study: a population-based study to investigate the epidemiology and genetic determinants of cardiovascular risk factors and metabolic syndrome. BMC Cardiovasc Disord 8, 6.
Imbrici, P., Camerino, D.C., and Tricarico, D. (2013). Major channels involved in neuropsychiatric |
Rüeger S*, Kutalik Z, Preisig M, Mooser V, Ehm MG
*DGM, UNIL, Switzerland |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C 02
C2 |
Breast and ovarian cancers pose huge and unsolved challenges to the medical profession. Breast cancer is the most common cancer in women in the EU: a woman dies every 6 minutes from this disease. Ovarian cancer, whilst far less common, is often diagnosed at an advanced stage and has a 60% mortality rate. The EU FP7 consortium EpiFemCare aims to reduce the number of women diagnosed with late stage breast or ovarian cancer by 50%, reduce the number of women receiving unnecessary long-term chemotherapy by 50%, and reduce the number of women dying from these cancers by 20%.
EpiFemCare will develop blood tests based upon DNA methylation technology to facilitate early detection and prediction of therapeutic outcome. The project consists of three phases: (1) Epigenome-wide discovery of ovarian/breast cancer specific DNA methylation markers. (2) Development of serum based assays for cancer specific markers. (3) Validation of the test in thousands of serial samples from prospective clinical trials.
In phase 1 Infinium® HumanMethylation450 BeadChip technology is used to assess the methylation status of ~485’000 sites in cancer and control tissues. In parallel, Reduced Representation Bisulfite Sequencing (RRBS) is used to identify and confirm cancer specific methylated circulating DNA in matching serum samples. Using Genedata Expressionist® for Genomic Profiling, we have established an automated bioinformatics pipeline for the detection of cancer specific differentially methylated regions (DMRs) that most likely fulfill the strict specificity criteria of a serum based test. The most promising DMRs are taken forward to clinical assay development and validation.
|
Lempiäinen H*, Mertens D, Brandenburg A, Biegert A, Remmert M, Hayward J, Jones A, Anjum S, Widschwendter M, Flesch M, Hoefkens J, Rujan T, Wittenberger T
*Genedata AG, Switzerland |
C - Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
D 01
D1 |
The elucidation of the complex relationships linking genotypic and phenotypic variations to protein structure is a major challenge in the post-genomic era. We present MSV3d (Database of human MisSense Variants mapped to 3D protein structure), a new database that contains detailed annotation of missense variants of all human proteins (20199 proteins). The multi-level characterization includes details of the physico-chemical changes induced by amino acid modification, as well as information related to the conservation of the mutated residue and its position relative to functional features in the available or predicted 3D model. Major releases of the database are automatically generated and updated regularly in line with the dbSNP (database of Single Nucleotide Polymorphism) and SwissVar releases. The database (http://decrypthon.igbmc.fr/msv3d) is easily accessible through a simple web interface coupled to a powerful query engine and a standard web service. The content is completely or partially downloadable in XML or flat file formats. |
NGUYEN H*, Poch O
*IGBMC, France |
D - Databases, Ontologies, and Text Mining |
|
D 02
D2 |
Technological advances in genotyping and large-scale sequencing currently allow the identification of an increasing number of variations in the human genome, opening the way to personalized medicine. However, interpretation of these variations in terms of phenotypic causality remains a challenge. Association analysis requires tracking current knowledge about each variant of interest, both at the clinical and molecular level. The scientific literature is one primary source to be explored and text mining methodologies can help with the cumbersome task of finding relevant publications.
We developed a procedure which (1) queries PubMed for mutations and polymorphisms in all human genes, (2) filters the retrieved documents for precise mutation information using a set of fine-grained regular expressions, and (3) extracts data such as the type of variation (substitution, indel), the mutated site and its position in the genomic or protein sequence. Whenever possible, the extracted protein site is grounded on the corresponding sequence provided by UniProtKB. For the sites which cannot be grounded, the reliability of variant attribution is assessed by measuring the proximity of the queried gene mention with the extracted mutation pattern.
For each gene, the results are recorded in a tab-delimited file which displays the PubMed ID of the retrieved document, the wild type and mutated sites, the position in the sequence, the extracted pattern, as well as quality control indicators such as the sequence grounding and the gene mention proximity. When available, phenotypic information is displayed, coming either from the UniProtKB annotation if the site is grounded, or from MeSH terms of disease category provided by the PubMed abstract.
|
Veuthey A*, Bridge A, Bougueleret L, Xenarios I
*SIB-Swiss Institute of Bioinformatics, Switzerland |
D - Databases, Ontologies, and Text Mining |
|
D 03
D3 |
SwissRegulon portal (www.swissregulon.unibas.ch) is a repository of databases and bioinformatics tools related to transcription regulatory processes. It includes:
SwissRegulon: A database of genome-wide annotations of regulatory sites. We currently have annotations for 17 prokaryotes and 3 eukaryotes (including human and mouse) in our collection.
PhyloGibbs: An algorithm for inferring regulatory motifs and regulatory sites from collections of DNA sequences, including multiple alignments of orthologous sequences from related organisms.
ISMARA: Integrated System for Motif Activity Response Analysis is a free online tool that models genome-wide expression data in terms of our genome-wide annotations of regulatory sites.
TCS: A database of predicted two-component signaling interactions across bacterial genome. |
Pachkov M*, Balwierz P, van Nimwegen E
*University of Basel, Switzerland |
D - Databases, Ontologies, and Text Mining |
|
D 04
D4 |
The Universal Protein Resource UniProt provides the scientific community with a stable, comprehensive, classified, richly and accurately curated catalog of proteins. UniProtKB, the knowledgebase produced by the consortium, offers a central access point for integrated protein information with cross-references to multiple resources. It is composed of a manually curated section, Swiss-Prot, and its automatically curated complement, TrEMBL. The manual curation of the human proteome is a priority of the consortium and includes the integration of information extracted from the literature and the thorough analysis of protein sequences. We continually revisit human UniProtKB/Swiss-Prot entries as knowledge evolve, updating functional and sequence annotation including variants and their association with diseases. We also increase the number of manually reviewed alternative products, and correct erroneous sequences. The update is done in the frame of collaborations with external resources including HGNC for gene nomenclature, OMIM for diseases, and HAVANA, Ensembl, RefSeq and the Consensus CoDing sequence projects (CCDS) for sequence curation. We revise gene model annotations in conjunction with these resources in order to improve the quality, coverage, and consistency of the sequences we provide. Furthermore, we produce Gene Ontology annotations to ease the retrieval and exchange of biological knowledge, optimizing the curation of each piece of data we go through. All human protein-coding genes have been manually reviewed and are described within 20,226 entries as of release 2013_01 of UniProtKB/Swiss-Prot. More than 50’000 alternative products that have not been manually curated yet are available in the TrEMBL section of UniProtKB and can be retrieved as part of the complete human proteome. |
Breuza L*
*SIB-Swiss Institute of Bioinformatics, Switzerland |
D - Databases, Ontologies, and Text Mining |
|
D 05
D5 |
UniProtKB/Swiss-Prot, the manually curated section of the UniProt Knowledgebase (http://www.uniprot.org/) records information on human protein variants and genetic diseases. Relevant data are manually retrieved from publications, particularly focusing on characterized single amino-acid polymorphisms (SAPs), their functional consequences, and association with disease. The complete index of all SAPs and their classification is available at http://www.uniprot.org/docs/humsavar. The current release contains over 67’400 SAPs, classified into disease–associated variants, variants of unknown pathological significance, and benign polymorphisms. In total, 6% of all UniProt variants are associated with annotations describing their impact on protein function.
In order to facilitate the integration of our curated variant data with that from other resources, UniProt SAPs are mapped to reference nucleotide sequences from RefSeq and Locus Reference Genomic (LRG) sequences (http://www.lrg-sequence.org/) and are submitted to specialized Locus Specific Databases (LSDBs) as well as to the dbSNP repository. Part of this work has been carried out in the frame of Gen2Phen (http://www.gen2phen.org/), a collaborative project aiming to unify genetic variation databases.
Curated information on variants is linked to disease descriptions in UniProtKB/Swiss-Prot records (annotated in the “Involvement in disease” subsection of the “General Annotation” section). An index of genetic diseases containing disease names, synonyms, descriptions, and cross-references to OMIM phenotypes and MeSH terms, is available at http://www.uniprot.org/docs/humdisease. We also plan to standardize the annotation of functional consequences of variants using selected terms from existing ontologies.
|
Famiglietti L*
*SIB-Swiss Institute of Bioinformatics, Switzerland |
D - Databases, Ontologies, and Text Mining |
|
E 01
E1 |
The transformer (tra) gene is a key regulator in the signaling hierarchy controlling all aspects of somatic sexual differentiation in Drosophila and other insects. This central role leads to strong and varying selection pressures and dynamic turnover of the regulatory mechanisms at the base of the signaling cascade among insect lineages. In the honey bee Apis mellifera, tra underwent gene duplication followed by adaptive and balancing selection in one of the paralogs. We report that six of the seven sequenced ants also have two copies of tra. Surprisingly, the two paralogs are always more similar within species than among species. Comparative sequence analyses indicate that this pattern is due to ongoing concerted evolution after an ancestral duplication rather than independent duplications in each of the six species. In particular there is strong support for inter-locus recombination between the paralogs of the ant Atta cephalotes. In the five species where the relative location of paralogs is known, they are adjacent to each other in four cases and separated by only few genes in the fifth case. Because there have been extensive genomic rearrangements in these lineages, this suggests selection acting to conserve their synteny. In three species we also find a signature of positive selection in one of the paralogs. In three honey bee species where the tra gene is also duplicated, the copies are adjacent and in at least one species there was recombination between paralogs. These results suggest that concerted evolution plays an adaptive role in the evolution of this gene family. |
Privman E*, Wurm Y, Keller L
*University of Lausanne, Switzerland |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 02
E2 |
Comparative genomics studies estimated that at least 5% of the human genome is under purifying selection, while only 2% of is coding for proteins. This implies that there is a repertoire of functionally relevant elements at least as large as the repertoire of protein‐coding genes. However, the function of the majority of these elements termed Conserved Non-Coding sequences (CNCs) remains poorly understood. Their evolutionary history also remains enigmatic: only a handful of the vertebrate conserved elements were identified outside the Vertebrata subphylum. Changes in evolutionary rates in some CNCs were also observed between lineages. This could be indicative of an additional selection constraint, possibly due to neofunctionalization. The identification of such elements can be an opportunity to assess the adaptive innovations provided by CNCs. We set up a pipeline, OrthoGA that makes use of the rich sampling of available genomes to retrieve conserved non-coding sequences. Glocal alignments are made on the basis of colinear blocks, in order to gain in sensitivity in the alignment of distantly related species. |
Dousse A*
*University of Geneva, Switzerland |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 03
E3 |
Different regions of protein-coding DNA evolve under different selective constraints and may ehxibit varying biases. Here we present a new approach for maximum likelihood (ML) phylogeny inference using multiple codon models for different gene regions. This enables researchers to create realistically complex scenarios by introducing model configurations that combine existing models on a per-site basis. These models can share some of their parameters to allow for more flexible hypothesis testing. CodonPhyML estimates one single ML tree that best fits models and data. A researcher could, e.g., use one instance of M0 for the tandem repeat regions in an MSA and another instance of M0 for the rest of the data and test the hypothesis if the ratio of nonsynonymous to synonymous changes is significantly different between those two regions. To illustrate the methodology we infer the origin of LRR regions from the type III effectors GALA-LRR of phytopathogenic R. solanacearum. These bacterial LRRs are hypothesized to have originated by a lateral transfer.
The method is implemented in CodonPhyML, an extension of PhyML that allows to use codon models for estimating maximum likelihood trees and testing hypothesis on multiple sequence alignments. |
Zoller S*, Anisimova M
*ETHZ, Switzerland |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 04
E4 |
The OMA (Orthologous MAtrix) database is one of the largest resources for identifying orthologs among publicly available complete genomes. We are
now making its clustering pipeline available as a stand-alone program, that can be used to identify orthologous sequences in arbitrary datasets. To
speed-up analyses of own datasets in combination with publicly available genomes, users can download precomputed Smith-Waterman alignment
results from the OMA Browser, which is the most time-intensive task of the pipeline. OMAstandalone is freely available for Linux and Mac OS X and
can be downloaded from http://omabrowser.org/standalone.
|
Altenhoff A*, Zoller S, Dalquen D, Gonnet G, Dessimoz C
*SIB, ETH Zurich, Switzerland |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 05
E5 |
Tandem repeats (TRs) are consecutive duplicates of genomic sequences. They represent one of the most frequent sequence features in both coding and non-coding DNA. TRs evolve through expansion and deletion of repeat units, with mutation rates found at six orders of magnitude higher than point mutation rates. It has been argued that these high mutation rates may lead to a pool of variation also in protein TRs, constituting a source for rapid adaption to fast changing environments. A fast succession of TR unit expansions and deletions on the population scale would eliminate conservation of TRs across species. We examined the conservation of human TRs across the eukaryotic clade, tracing the evolutionary history of every single repeat unit by means of a comparative phylogenetic analysis. Our results reveal a high and complete conservation of a large number of TRs within the eukaryotic clade, thus showing a strong discrepancy with common beliefs on TR evolution. |
Schaper E, Gascuel O, Anisimova M*
*ETH Zurich, Switzerland |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 06
E6 |
Tandem repeats (TRs) are often present in proteins with crucial functions, responsible for resistance, pathogeneicity, and associated with infectious or neurodegenerative diseases. This motivates numerous studies of TRs and their evolution, requiring accurate multiple sequence alignment. TRs may be lost or inserted at any position of a TR region by replication slippage or recombination. But current methods assume fixed domain boundaries, and yet are of high complexity. We present a new global graph-based alignment method that does not restrict TR events by TR unit boundaries. TR indels are modeled separately, and penalized using the phylogeny-aware alignment algorithm. This ensures enhanced accuracy of reconstructed alignments, disentangling TRs and measuring indel events and rates in a biologically meaningful way. Our method detects not only duplication events, but all changes in TR regions due to recombination, strand slippage, and other events inserting or deleting TR units. We evaluate our method by simulation incorporating TR evolution, by either sampling TRs from a profile hidden Markov model or by mimicking strand slippage with duplications. |
Szalkowski A*
*ETH Zürich, Switzerland |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 07
E7 |
Tandem repeats in eukaryotic gene promoters can change gene expression drastically due to their extremely low stability. We hypothesized that unstable tandem repeats in promoters increase expression divergence along the primate phylogeny. A search for tandem repeats in promoter regions of 13,000 human, chimpanzee and macaque orthologous genes revealed that 30 % of primate promoters contain tandem repeats. Genes driven by these repeat-containing promoters show significantly higher rates of expression divergence. More specifically, we found a significant correlation between repeat instability and expression divergence. We showed a similar association in gene duplicate expression divergence. This relation might explain gene expression divergence in more special cases, as in disease formation. |
Bilgin T*, Wagner A
*University of Zurich, Switzerland |
E - Evolution, Phylogeny, and Comparative Genomics |
|
E 08
E8 |
Positive selection can increase the rate at which deleterious mutations accumulate in a hitchhiking region. However, the importance of balancing selection in increasing the rate at which deleterious mutations accumulate has not been investigated. Here we investigate how strong balancing selection at the human HLA genes influences the evolutionary dynamics of closely linked loci. Our expectation is that strong selection at HLA loci may interfere with the efficacy of selection in removing deleterious variants at closely linked loci. By analyzing the ratio of deleterious to synonymous polymorphisms we were able to show that loci close to the strongly selected HLA genes show signficacntly increased levels of polymorphism and harbor an excess of deleterious variants. Surprisingly, when we applied the McDonald-Kreitman test by incorporating divergence with respect to the rhesus genome, we found that a substantial part of the deleterious variation reaches fixation over long timespans, suggesting that selection at the HLA genes may be interfering with both the transient patterns of polymorphism and substitution processes.
|
Meyer D*, Mendes FH
*University of Sao Paulo, Switzerland |
E - Evolution, Phylogeny, and Comparative Genomics |
|
F 01
F1 |
The Protein Model Portal (PMP) has been developed as an open platform to foster effective use of molecular models in biomedical research by providing convenient and comprehensive access to structural information for a specific protein. For the first time both experimental structures and theoretical models for a given protein can be searched simultaneously, and analyzed for structural variation. The current release which is updated at least once a month, allows searching 19.5 million model structures for 4.4 million distinct UniProt entries (UP release 2013_06).
Ultimately, the accuracy of a structural model determines its utility for specific applications. Hence, model quality estimation tools assist in evaluating the accuracy of generated models. We, thus, present new developments in Protein Model Portal supporting model validation and quality estimation, which consist of (1) continuously extended service interfaces to several established modeling and model quality estimation tools (2) a novel analysis tool for protein structure variation for both models and experimental structures and (3) the CAMEO (Continuous Automated Model EvaluatiOn, www.cameo3d.org) system for the continuous evaluation of servers predicting 3D protein structures, ligand binding site residues and the recent extension to model quality assessment programs (MQAPs). By providing a comprehensive view on structural information, the Protein Model Portal not only offers a unique environment to apply consistent assessment and validation criteria to the complete set of structural models available for a specific protein, but also allows continuous assessment of the modeling and quality estimation services registered with CAMEO.
Visit us at www.proteinmodelportal.org!
|
Haas J*, Roth S, Bordoli L, Schwede T
*Biozentrum University of Basel & SIB, Switzerland |
F - Macromolecular Structure, Dynamics and Function |
|
F 02
F2 |
In our assessment of the CASP9 TBM category we introduced the Local Distance Difference Test (lDDT), which evaluates how well the inter-atomic distances in the target protein structures are reproduced in the prediction models. Here we introduce an improved version, which includes checks of the stereo-chemical quality of protein structures and allows the use of multiple reference structures. |
Mariani V*, Biasini M, Barbato A, Schwede T
*Swiss Institute of Bioinformatics / Biozentrum, Switzerland |
F - Macromolecular Structure, Dynamics and Function |
|
F 03
F3 |
Interactions between proteins and their ligands play crucial roles in many biological processes, such as metabolism, signaling, transport, regulation or molecular recognition. Understanding the molecular basis of protein-ligand interactions is thus of great interest, not only for modeling complex biological systems but also for applications in drug discovery. Computational methods have become increasingly important for investigating biological systems at an atomistic level. Our approach aims at better understanding the molecular basis of disease related viral methyltransferases, their interactions with small molecules and their catalytic mechanism.
Dengue fever is one of the most important, rapidly emerging infectious diseases in many areas of the world causing significant mortality and morbidity in humans. Currently, neither vaccines nor specific drug treatments are available. The dengue virus methyltransferase is a viral enzyme crucial for its replication and is thus an attractive target for drug discovery. Despite this fact, little is known about the mechanism underlying its function. Hence, we apply computer simulations to investigate the effect of protein sequence variations. Molecular dynamics (MD) simulations and mixed quantum mechanics / molecular mechanics (QM/MM) calculations are employed to investigate the mechanisms of the enzymatically catalyzed reaction at an atomistic level. Based on a structural model of the target protein in complex with its RNA substrate, the impact of mutations on ligand binding, geometric arrangements and reaction energy barriers are evaluated computationally. Based on these results, novel mutations were suggested and subsequently validated experimentally. |
Schmidt T*, Mahi M, Meuwly M, Schwede T
*SIB & Biozentrum, University of Basel, Switzerland |
F - Macromolecular Structure, Dynamics and Function |
|
G 01
G1 |
To better understand how genetic variation influences inter-individual variation in food perception, food choice and metabolism we have recruited a panel of 600 ethnically-diverse adults from Sao Paulo, Brazil.
Panelists underwent extensive taste sensitivity and food preference tests, provided information on food habits and medical history and donated samples for metabolomic analysis. Here we report the results of GWAS on the taste perception and urine metabolism phenotypes.
Analysis of the taste data identified a new locus for the perceived bitterness of caffeine (p=8.6x10-12) and showed that this locus is independent of the previously reported loci for the perceived bitterness of quinine and the bitter compound PROP. We further find that the effect size of the quinine locus depends on quinine concentration. This now brings to three the number of loci impacting human bitter taste perception. By contrast we found no evidence for a genetic impact on the perception of sweet, sour, salty or umami compounds.
For the analysis of the urine NMR data we employed a genome-wide metabolome-wide approach that allows the extraction of an NMR metabolite signatures for each tested SNP. Many of the identified associations replicate previously identified metabolite-gene associations demonstrating the reliability of this untargeted approach. Using cross-replication with an equivalent data set available from the CoLaus cohort we further identified two new associations that link urine concentrations of lysine (p=6.9x10-44) and fucose (p=1.2x10-33) to genetic loci for chronic-kidney and Crohn’s disease respectively.
|
Genick U*, Ledda M, Rueedi R, Kutalik Z, Bergmann S
*n/a, Switzerland |
G - Mutations, Variations, and Population Genomics |
|
G 02
G2 |
Many single nucleotide polymorphisms influence common disease susceptibility but, to date even cumulatively, they explain only part of the heritability. As large and rare CNVs were also found to be associated with common diseases/traits, it was suggested that part of the missing heritability could be accounted for by rare variants with intermediate penetrance. The extent of this contribution remains however largely unknown. To address this question, we are collecting cohorts genotyped on Metabochip and other Illumina platforms to identify new rare or short CNVs associated with BMI and other complex traits
We assessed CNV detection sensitivity, specificity and optimal filtering parameters of PennCNV calls on two cohorts of 300 and 1’000 unrelated adults genotyped on both Metabochip and OmniExpress or Omni2.5. Assuming that CNVs called in high probe-density regions of the Metabochip are genuine, we examined the concordance with results of the OmniExpress or Omni2.5 platform. We assessed how different filtering parameters, such as length, number of probes, confidence score, influence true positive and false discovery rates (TPR, FDR). We then defined thresholds that gave an optimized CNV call reliability. TPR and FDR typically decreased as the minimum length threshold increased, revealing the difficulty in detecting CNVs smaller than 20kb (TPR=0.034, FDR=0.775) on Omni chips.
We performed a preliminary genome-wide CNV association meta-analysis (N=8295) based on our filtering. We found several promising hits to replicate in additional Metabochip-genotyped cohorts. To the best of our knowledge, this is the first effort to discover small CNVs associated with adult BMI.
|
Macé A*, Männik K, Magi R, Schurmann C, Teumer A, Homuth G, Jacquemont S, Beckmann J, Metspalu A, Kutalik Z, Reymond A
*UNIL, Switzerland |
G - Mutations, Variations, and Population Genomics |
|
G 03
G3 |
While allele specific expression (ASE) is expected to result from genetic regulatory variants, a proper estimation and dissection of the causes has not been performed to date. In this study we used RNA-seq data from fat, LCLs and skin from ~400 female MZ and DZ twin pairs (2330 RNA-seq samples in total) to quantify ASE and to dissect its underlying causes . ASE may be caused by genetic or epigenetic /environmental factors. To measure the relative contribution of the underlying causes of allelic expression we estimated the variance components of the ASE ratios using the identity-by-descended status (IBD) of the twin pairs at the ASE site and the identity-by-state status (IBS) at the best eQTL. We found that about 53% of the variance in ASE is due to the effect of the best eQTL , 5% to the additive effect of the other genetic variants in cis, 24% to the interaction between cis and trans variants and 16% to the individual environment. The additive trans and the shared environmental effects were negligible. There were small differences among tissues. The sum of all the genetic effects gives an average heritability estimate of 80% for fat, 89% for LCL and 84% for skin. Our results show a complex genetic architecture for allelic expression that identifies GxG and putative GxE effects. We utilized the twin structure of our sample to look for examples of GxE interactions. |
Buil A*, Brown A, Viñuela A, Davies M, Zheng HF, Richards JB, Small K, Durbin R, Spector TD, Dermitzakis ET
*University of Geneva, Switzerland |
G - Mutations, Variations, and Population Genomics |
|
G 04
G4 |
DNA methylation is an essential epigenetic mark whose role in gene regulation and its dependency on genomic sequence and environment are not yet fully understood. In this study we provide novel insights into the mechanistic relationships between genetic variation, DNA methylation and transcriptome sequencing data in three different cell-types of the GenCord human population cohort4. We find that the association between DNA methylation and gene expression variation among individuals are likely due to different mechanisms from those establishing methylation-expression patterns during differentiation. Furthermore, cell-type differential DNA methylation may delineate a platform in which more local inter-individual changes may respond to or act in gene regulation. We show that unlike genetic regulatory variation, DNA methylation alone does not significantly drive allele specific expression. Finally, inferred mechanistic relationships using genetic variation as well as correlations with TF abundance reveal both a passive and active role of DNA methylation to regulatory interactions influencing gene expression.
|
Gutierrez-Arcelus M*, Lappalainen T, Montgomery SB, Buil A, Ongen H, Yurovsky A, Bryois J, Padioleau I, Romano L, Bielser D, Planchon A, Falconnet E, Borel C, Letourneau A, Makrythanasis P, Gagnebin M, Guipponi M, Gehrig C, Antonarakis SE, Dermitzakis ET
*University of Geneva, Switzerland |
G - Mutations, Variations, and Population Genomics |
|
G 05
G5 |
Genetic variation can affect mapping of RNA-seq reads, as several of them that carry the nonreference allele have lower probability to map correctly to the reference genome. We simulated this allelic read mapping bias and examined its influence in quantification of gene expression and in eQTL discovery.
We simulated all the potential 50 bp RNA-seq reads (1,248 B) with haplotypes for 8,6 M SNPs and 872 K indels (MAF>1%) from the 1000 Genomes. After mapping to the reference genome we found a >5% difference in mapping of reference and nonreference reads in 12,58 % of the SNPs and 45,56 % of the indels as well 63 M of biased genomic loci.
We investigated if the observed bias can affect gene quantifications and eQTL associations in real data. We used 185 individuals with RNA-seq from LCLs and imputed genotypes from 6.9 M SNPs. We quantified 78,595 exons and discovered 3,372 eQTLs at 10% FDR. After filtering away reads mapped to biased positions we measured 78,281 exons with very high (99,65%) correlation and we mapped 3,323 eQTLs, of which 3,253 are shared, 119 were lost and 70 were gained after filtering. Comparison of original and filtered p-values showed only slight differences. However, some of the eQTLs that were lost are likely to be false associations as the p-values dropped dramatically.
To summarize, allelic mapping bias does not severely affect eQTL associations. Nevertheless, some of them are likely to be false positives and correcting for these effects will lead to more accurate results.
|
Panousis N*, Gutierrez-Arcelus M, Dermitzakis E, Lappalainen T
*University of Geneva, Switzerland |
G - Mutations, Variations, and Population Genomics |
|
H 01
H1 |
This R package helps with quality checks, visualizations
and analysis of mass spectrometry data, coming from proteomics
experiments. The package is developed, tested and used at the Functional
Genomics Center Zurich. We use this package mainly for prototyping,
teaching, and having fun with proteomics data. But it can also be
used to do solid data analysis for small scale data sets. |
Panse C*, Grossmann J, Barkow S
*Functional Genomics Center Zürich, Switzerland |
H - Protein Interactions, Molecular Networks, and Proteomics |
|
I 01
I1 |
To understand the function of hundreds of RNA-binding proteins (RBPs) encoded in animal genomes, it is important to identify their target RNAs. Although the binding specificity is generally accepted to be well described in terms of the nucleotide sequence of the binding sites, other factors such as the structural accessibility of binding sites or their clustering, enabling the binding of RBP multimers, are also believed to play a role. Here we focus on GLD-1, a translational regulator of C. elegans, whose targets have been studied with a variety of methods such as CLIP, RIP-Chip, profiling of polysome-associated mRNAs and biophysical determination of binding affinities of GLD-1 for short nucleotide sequences. We show that a simple biophysical model explains the binding of GLD-1 to mRNA targets to a large extent and that taking into account the accessibility of putative target sites significantly improves the prediction of GLD-1 binding, particularly due to a more accurate prediction of binding in transcript coding regions. Relating GLD-1 binding to translational repression and stabilization of its target transcripts we find that binding sites along the entire transcripts contribute to functional responses, in particular binding sites located in the coding region of transcripts appear to function in translation repression. Finally, biophysical measurements of GLD-1 affinity for a small number of oligonucleotides appear to allow an accurate reconstruction of the sequence specificity of the protein. This approach can be applied to uncover the specificity and function of other RBPs. |
Brümmer A*, Zavolan M
*Biozentrum, University of Basel & SIB, Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 02
I2 |
The cancer genome projects have released a large number of recurrently
mutated genes in cancer tissue. For most of these genes the function in cancer development is still unclear. We use RNAi and image-based
genetic interaction screens to get insights in the functional
relationship of cancer genes. We show that recurrently mutated cancer
genes cluster more often together than "normal" genes.
Within the clusters, the individual genes are rarely
mutated. Clustering them reduces the effort
needed to (i) understand them and (ii) target them by drugs.
In the future this may help to stratify cancer populations and genetic
interaction screens will provide knowledge about drug sensitivity and resistance. |
Fischer B*
*EMBL, Germany |
I - Regulation, Pathways, and Systems Biology |
|
I 03
I3 |
The recruitment of RNA Pol-II to specific sites in the genome called promoters is an essential step in eukaryotic gene regulation. Recently introduced genome-wide chromatin profiling assays have revealed a common chromatin architecture of eukaryotic promoters consisting of a nucleosome-free region bound by Pol-II and a positioned +1 nucleosome occurring at a conserved distance downstream from the transcription start site (TSS). In other respects, promoters are quite variable. Some have very focused while others have highly dispersed initiation site patters. Promoters also differ by the presence or absence of core promoter elements such as the TATA-box.
The role of the positioned +1 nucleosome in the Pol-II recruitment process is not well understood. Specifically, the timing and causal relationship between nucleosome binding and Pol-II binding remains unclear. Here we show that TATA-less promoters have a strong sequence-intrinsic nucleosome positioning signal in the +1 nucleosome region, in both vertebrates and flies. This signal essentially consists of 10 bp periodic dinucleotide distributions reminiscent of those reported for yeast promoter nucleosomes. The strength of the signal is inversely proportional to the degree of TSS dispersion. Interestingly, the nucleosome-positioning signal is completely absent in TATA-box containing promoters. Together, these findings suggest that TATA-box binding and DNA sequence-induced nucleosome positioning are two mutually exclusive pathways of Pol-II recruitment and TSS selection in eukaryotic promoters. |
Dreos R*, Bucher P
*SIB - EPFL, Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 04
I4 |
For the most part metazoan genomes are highly methylated and harbour only small regions with low or absent methylation. In contrast, partially methylated domains (PMDs), recently discovered in a variety of cell lines and tissues, do not fit this paradigm as they show partial methylation for large portions (20%-40%) of the genome. While in PMDs methylation levels are reduced on average, we found that at single CpG resolution, they show extensive variability along the genome ranging from 0% to 100% in a roughly uniform fashion with only little similarity between neighboring CpGs. A comparison of various PMD-containing methylomes showed that these seemingly disordered states of methylation are strongly conserved across cell types for virtually every PMD. Comparative sequence analysis suggests that DNA sequence is the main determinant of these methylation states. This is further substantiated by a purely sequence based model which can predict 31% of the variation in methylation. The model revealed CpG density as the main driving feature, followed by various dinucleotides immediately flanking the CpG and a minor contribution from sequence preferences reflecting nucleosome positioning. Taken together we provide a reinterpretation for the nucleotide-specific methylation levels observed in PMDs, demonstrate their conservation across tissues and suggest that they are mainly determined by specific DNA sequence features. |
Gaidatzis D*, Burger L, Murr R, Lerch A, Dessus-Babus S, Schuebeler D, Stadler M
*Friedrich Miescher Institute for Biomedical Research, Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 05
I5 |
Defining active regulatory regions in the genome is a crucial step towards the understanding and modeling of transcriptional regulation. We have
recently shown in mouse stem cells and neuronal progenitors that transcription factor binding to active regulatory regions leads to defined reduction in DNA methylation allowing for the identification of active regulatory regions in otherwise methylated parts of the genome. Here we present a computational method for the unbiased identification of such footprints from large-scale bisulfite-sequencing data. Our approach partitions the genome into fully methylated (FMRs), unmethylated (UMRs) and low-methylated regions (LMRs), while accounting for false methylation calls due to single-nucleotide variations. As an additional feature, our approach detects partially methylated domains, which represent a fourth class of methylation pattern that needs to be discriminated from LMRs and UMRs. By applying our method to publicly available mouse and human methylation datasets, we find that whereas methylation levels are mostly conserved at UMRs, which include many active promoters, methylation is highly dynamic at LMRs. These latter regions lie distal to transcription start sites, correlate between related cell types and show motif enrichments for tissue-specific transcription factors. These findings extend and generalize our previous results and suggest that the presented method provides a robust and reproducible approach for unbiased segmentation of basepair bisulfite methylomes and the discovery of regulatory elements from DNA methylation data. |
Burger L*, Gaidatzis D, Schuebeler D, Stadler M
*Friedrich Miescher Institute for Biomedical Research, Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
I 06
I6 |
Nucleosomes are the basic unit of chromatin, comprising a stretch of DNA of length 147 bp wrapped around a histone octamer. Since 70-90% of the eukaryotic genome is packaged into nucleosomes, they play a crucial role in modulating accessibility of transcription factor binding sites (TFBSs). Consequently, nucleosome positioning has profound effects on gene expression in eukaryotes.
Biophysical modeling predicts that competition between nucleosomes and transcription factors (TF) for binding to nearby sites on the genome can induce both positive and negative cooperativity in TF binding. In particular, we show that the cooperative effect depends periodically on the distance between TFBSs, with
positive cooperativity for sites less than 40 bp apart, negative cooperativity for larger distances up to one nucleosome length, and again positive cooperativity for distances just above one nucleosome length.
A comprehensive statistical analysis of TFBS positioning for 158 TFs of Saccharomyces cerevisiae shows that many pairs of TFs have positioned their binding sites so as to optimize positive cooperativity of their binding. Moreover, this positioning is most significant for a number of TFs that have already been
implicated in opening chromatin. In summary, our results show that the "grammar" of the regulatory code in yeast promoters is shaped to a significant extent by nucleosome-mediated cooperativity of TFs. |
Ozonov E*, van Nimwegen E
*Biozentrum University of Basel & SIB, Switzerland |
I - Regulation, Pathways, and Systems Biology |
|
K 01
K1 |
Neisseria meningitidis is associated with septicemia and meningitis, occurring as endemic infections. The gram-negative bacteria displays a high genetic diversity, and to present no strict core pathogenome could be defined. In prokaryotes potential associations of epigenetic modifications with phenotypes are still poorly characterized. DNA methylation has been predominantly studied in the context of restriction modification (R-M) systems.
Here we exploited sequence homology to characterized genes to predict DNA methyltransferases in 6 completely sequenced strains of N. meningitidis.
Recent studies have linked DNA methylation with a complex genetic system termed the ‘phasevarion’ (phase-variable regulon), in which mutations in simple tandem repeat units control the expression of DNA methyltransferases. We present a novel method to infer the precise number of repeat units at specific tandem repeat loci exploiting increasing read lengths resulting from recent versions of large scale sequencing assays.
We have utilized single-molecule real-time (SMRT) sequencing technology (Pacific Biosciences) to establish genome wide DNA modification profiles of two closely related N. meningitidis strains. DNA modifications, and in particular DNA methylation as the most common DNA modification, are thereby detected based on a delay in the kinetics of the DNA synthesis in vitro. The methylated sequences as defined by the SMRT sequencing largely correspond to the (predicted) target sequences of DNA methyltransferases identified in N. meningitidis, and their phase-variable state. Restriction digest using methylation-sensitive enzymes further corroborates the SMRT results. Our approaches reveal a high diversity in DNA methylation between strains. Divergent DNA methylation profiles might thus link to different phenotypic consequences. |
Sater MRAS, Wang G, Clark T, Roeltgen K, Lamelas A, Mane S, Korlach J, Pluschke G, Schmid C*
*Swiss Tropical and Public Health Institute, Switzerland |
K - Sequencing and Sequence Analysis |
|
K 02
K2 |
Promoter identification is a key step in studying the regulation of gene expression. Recently, a novel differential RNA sequencing (dRNA-seq) method was developed to discover bacterial transcription start sites (TSSs) at a genome wide scale. It uses 5’ mono-phosphate-dependent terminator exonuclease (TEX) that specifically degrades 5’ mono-phosphorylated RNA species such as processed RNA, mature rRNAs and tRNAs whereas primary transcripts remain intact. This approach results in an enrichment of primary transcripts, allowing TSSs to be identified by comparison of the TEX-treated libraries to control untreated ones. So far, an automated computational method to identify TSSs based on dRNA-seq data has not been available, and the TSS identification has been done to a great extent manually. To support future analyses of dRNA-seq data, we here introduce a rigorous computational method that helps identifying a large proportion of bona fide TSSs with relative ease. Our method is based on quantifying 5’ enrichment of transcription start sites and also the significance of their expression relative to nearby putative TSSs. We have benchmarked our method on several recently published data sets and demonstrated that it enables accurate and automated TSS identification. |
Jorjani H*, Zavolan M
*Biozentrum, Switzerland |
K - Sequencing and Sequence Analysis |
|
K 03
K3 |
Deep Sequencing technology has become a powerful research tool with a wide range of applications. We developed a pipeline for the downstream analysis of deep sequencing data termed 'Quantify and Annotate Short Reads in R (QuasR)', which is freely available via Bioconductor. QuasR is an integrated start-to-end analysis solution within the programming language R. In order to facilitate the utilisation of QuasR for scientists without knowledge of R, we started to embed QuasR into the Galaxy framework. Galaxy is increasingly used by a broad community for (pre-)processing of Deep Sequencing data. Galaxy allows easy access to our pipeline by a web based graphical user interface while maintaining the various assets of QuasR such as keeping track of numerous analysis parameters and ensuring compatibility of the data and full access to the generated data e.g. for a fine-grained analysis. |
Hundsrucker C*, Lerch A, Gaidatzis D, Hotz H, Stadler M
*Friedrich Miescher Institute for Biomedical Research, Switzerland |
K - Sequencing and Sequence Analysis |
|
K 04
K4 |
Recently, Single Molecule, Real-Time (SMRT, Pacific Biosciences) DNA sequencing has been used to generate nonhybrid, finished microbial genome assemblies. Instead of using shorter, more accurate reads from second-generation sequencers to correct errors in the long SMRT sequencing reads, the hierarchical genome-assembly process (HGAP) reconstructs long, accurate reads by alignment and preassembly of SMRT raw reads, followed by genome assembly using long-read assembler such as Celera and Allora. Here we compared the Escherichia coli K12 MG1655 assembly obtained by nonhybrid PacBio sequencing with Illumina sequencing, the current most accurate platform. Finished genome assembly with expected genome size could already be achieved with low coverage PacBio data, while consensus accuracy correlated with PacBio data coverage. Illumina MiSeq data produced genome assembly with higher consensus accuracy, but more gaps and less assembled size. Higher MiSeq data coverage improved consensus accuracy and assembly integrity only slightly. Consensus errors in the PacBio genome assembly were mainly InDels, while in Illumina genome assembly SNPs were more frequently observed. Although PacBio assembly was superior in terms of resolving long repeats, Illumina genome assembly yielded more correctly predicted protein coding genes, due to the higher consensus accuracy and the none-repetitiveness of coding regions. |
Qi W*, Patrignani A, Poveda L, Schlapbach R
*Functional Genomics Center Zurich, Switzerland |
K - Sequencing and Sequence Analysis |
|
L 01
L1 |
Protein modeling is widely used in life science when no experimental structures are available. To this purpose, protein structure predictions (or models) have been demonstrated to be as useful as the experimentally determined structures for many biomedical applications. The usability of a 3D model for the biological problem at hand is, however, strictly dependent on its accuracy. In fact, Depending on both the specific target protein and the applied modeling method, the accuracy of the structural models may vary significantly. The critical assessment of protein structure prediction (CASP) has been reviewing the participating modeling servers every two years using a selected set of targets. Taking inspiration by two previously developed, but deceased evaluation systems, we have set up a new Continuous Automated Model EvaluatiOn system named CAMEO. Currently it assesses the performance of 3D protein structure (CAMEO 3D) and ligand binding residues (CAMEO LB) prediction servers on a weekly basis. Here we would like to introduce the CAMEO Quality Estimation (CAMEO QE) category, which consists of a benchmark of different widely-used local model quality estimation tools which are available as standalone packages.
CAMEO assists both scientists interested in modeling their protein of interest and developers of new methods. The former, relying on the retrospective analysis provided by CAMEO, can select the most suitable tool for a given modeling problem. Method developers can benefit from contemporaneously using CAMEO to benchmark their new developments and to compare the performance of their productive servers with the ones of the other registered participants. |
Barbato A*, Haas J, Schmidt T, Schwede T
*Swiss Institute of Bioinformatics, University of Basel, Switzerland |
L - Technology and Software |
|
L 02
L2 |
SWISS-MODEL is a widely-used web service for protein homology modeling.
Around 2000 models are built every day. Since these models are primarily motivated by a particular biological research question, the biological context of these models plays an important role. Current efforts in the development of SWISS-MODEL focus on increased biological relevance of the models, and an improved responsive user experience: For new users, an automated modeling pipeline performs template identification, template selection, and model building without user intervention. Experienced users are given full control of the modeling steps: in a new template selection step, biological knowledge can be incorporated into the selection process. Models are built in their correct oligomeric state, and include relevant ligands and co-factors. The models are presented in a modern web-interface, providing 3D visualisation of the built models.
|
Biasini M, Waterhouse A*, Bienert S, Arnold K, Schwede T
*Biozentrum University of Basel & SIB, Switzerland |
L - Technology and Software |
|