Conference Poster Presentations
Presenters of odd numbered posters should be present at their poster during the first poster session on Monday (morning and lunch session), presenters of even numbered posters during the poster session on Tuesday (morning and lunch session).Each poster will have a space of 1 meter wide by 1.5 meters tall (39 x 59 inches). The conference will provide material for hanging posters. More info can be found here.
# | Title | Authors | Topic | |
---|---|---|---|---|
A01 |
Spatiotemporal dynamics of microtubules shape many cellular processes such as division or migration. Here, we investigate how spindle pole body segregation in Saccharomyces cerevisiae arises from microtubule dynamics and their regulation. Specifically, we aim to develop a mechanistic model for the interactions of bud-directed astral microtubules with their environment, e.g., the spindle pole body they are emanating from, plus-tip proteins, kinesin and myosin motor proteins, or actin cables emanating from the bud. Resolving these interactions involves modeling their effect on the local, stochastic microtubule dynamics – especially the microtubule tip – as well as their reaction-diffusion dynamics on a cellular scale. To this end, we are constructing stochastic reaction-diffusion models of the budding yeast cell in 2D and 3D with embedded microtubules, based on the reaction-diffusion master equation (RDME). We implement them in a cross-platform, modular and high-performance C++ RDME solver framework that currently supports the next subvolume method. Model inference relies on data available from both in vitro and in vivo experiments. In an in vivo setting, ground truth data is usually available from fluorescence microscopy. However, image analysis is not performed in a consistent manner between different data sources. We developed an interface from the reaction-diffusion solver framework to our previously-published virtual microscope. This enables physically accurate in silico simulations of fluorescence microscopy experiments, given a particular model geometry and photometry, as well as the microscopy setup used. In principle, this allows for model inference given in vivo imaging data, and we currently address the associated computational challenges. |
Widmer L*, Stelling J
*Swiss Federal Institute of Technology Zürich (ETH Zurich) & SIB Swiss Institute of Bioinformatics, Switzerland |
Bioimaging, and Spatial-temporal Modeling |
|
A02 |
Accurate analysis of cell morphodynamics paves the way to a better understanding of many biological processes that have direct implications to human health. This challenging task typically entails developing robust cell segmentation and lineage tracking algorithms. In this work, our focus consists of two parts: First, we tackle the problem of membrane-based segmentation where only cell walls/membranes are stained. For this purpose, we prototype several segmentation algorithms based on multiscale watershed segmentation and Fast Marching Method. Later, we present a simple, yet robust lineage tracking algorithm, which achieves over 91% tracking accuracy on selected Cell Tracking Challenge datasets. This cell tracker has been demonstrated at the International Symposium on Biomedical Imaging (ISBI) in April 2015. |
Demirel Ö*, Zhang X, Beati I, Majer P, Malmström L, Kunszt P
*University of Zurich, Switzerland |
Bioimaging, and Spatial-temporal Modeling |
|
A03 |
The size and shape of organs is species-specific and even in species in which organ size is strongly influenced by environmental cues, such as nutrition or temperature, it follows defined rules. Therefore, mechanisms must exist to ensure a tight control of organ size within a given species, while being flexible enough to allow the evolution of organs with different size in different species. We have combined computational modelling and quantitative measurements to define conditions for robust growth control in the Drosophila eye disc. We identify two growth laws that are consistent with the growth data and that would explain the extraordinary robustness and evolutionary plasticity of the final adult eye size. These two growth laws correspond to very different growth mechanisms and further experiments will be required to distinguish between these two candidate growth laws. |
Vollmer J*, Fried P, Sánchez M, Aguilar-Hidalgo D, Lopes C, Casares F, Iber D
*SIB Swiss Institute of Bioinformatics, Switzerland |
Bioimaging, and Spatial-temporal Modeling |
|
A04 |
There is already a fairly good understanding of characteristic phenotypic properties and behavior of different cell types, but much less is known about the cellular reaction to perturbation. In this work, we seek to investigate the hypothesized presence of elementary states—the idea that cellular responses to perturbations fall into a phenotypically limited set. For this purpose a clustering approach is applied. The challenge of this project lies in the quantity and quality of the data. The data consists of images that show phenotypic results of pathogenic perturbation (bacterial or viral) to a cell population that has undergone RNA interference (RNAi) knockouts. These knockouts are gene-specific and are induced by the introduction of synthetically designed small-interfering RNAs (siRNAs) to a cell culture. In a screen, phenotypic features of single cells are evaluated by microscopy and image analysis using Cell Profiler. The effects of pathogenic perturbation given a specific siRNA are assessed via the phenotypic responses of roughly 1800 single cells per siRNA. There are around 100 phenotypic features measured, such as cell shape, cell size, DNA content, and actin content. It is our hope that we can discover a clustering in the responses of the cells to perturbation, either caused by RNAi or pathogen. |
Diekmann M*, Beerenwinkel N
*Swiss Federal Institute of Technology Zürich (ETH Zurich), Switzerland |
Bioimaging, and Spatial-temporal Modeling |
|
B01 |
Mathematical modeling and optimization are valued in the development of new drugs of human metabolic disorders. Most of optimal drug designs are generally differentiated as two stages, identification and decision-making, to find optimal targets. Moreover, such a two-stage design uses the therapeutic goal achieving as the design specification. However, side effects have been reported due to toxic metabolites produced after usage of drugs. In this study, we developed a fuzzy equal metabolic adjustment method in which the detection of candidate enzyme targets is combined with decision-making strategies to create a unified optimization framework for determining satisfactory targets. Both therapeutic goal and minimizing side effects are simultaneously considered in the fuzzy optimization framework for detecting drug targets. An existing generalized mass action model of dopamine metabolism is used as a case study to remedy two types of enzymopathies caused by absence of tyrosine hydroxylase (ATH) and decrease activity of vesicular monoamine transporter 2 (DAV). From the computational results, the optimal drug design found that the ATH disease could obtain nearly 100% satisfaction for both therapeutic and detoxication effects using adjustment of 3 enzyme targets. In contrast with DAV disease, we found 100% satisfaction for therapeutic effect and 38% for detoxication effect. For adjustment of 5 enzyme targets, higher 90% overall satisfaction for curing both diseases could be achieved. The detailed problem formulation and computational results for the approach will discuss in the conference. |
Hsu K, Wang F*
*National Chung Cheng University, Taiwan |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B02 |
Human Immunodeficiency Virus (HIV) infects 35 million people and is the cause of AIDS, one of the most deadly infectious diseases with over 39 million fatalities. The medical and scientific communities have produced a number of outstanding results in the past 30 years, including the development of over 25 anti-retroviral drugs and over 82,000 publications. To help users to access virus and host genomic data in a useful way, we have created an HIV resource that organizes knowledge and aims to give a broad view of the HIV life cycle and its interaction with human proteins. This resource has been created following an extensive review of the literature. The lifecycle is annotated with controlled vocabulary linked to dedicated pages in ViralZone, Gene Ontology, and UniProt resources. The HIV proteins in the cycle are linked to UniProt and the BioAfrica proteome resource. In total, approximately 3,400 HIV-host molecular interactions have been described in the literature, which represents about 240 partners for each viral protein. This list has been reduced to 57 essential human-virus interactions by selecting interactions with a confirmed biological function. These are all described in a table and linked to publication, sequence database and ontologies. Lastly, this resource also summarizes how antiretroviral drugs inhibit the virus at different stages of the replication cycle. We believe that this is the first online resource that links together high-quality content about virus biology, host-virus interactions and antiviral drugs. This resource is publically accessible at ViralZone website (http://viralzone.expasy.org/). |
Hulo C, Masson P, Druce M, Bouguelleret L, Xenarios I, de Oliveira T, LeMercier P*
*SIB Swiss Institute of Bioinformatics, Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B03 |
Somatic mutation detection is a key and challenging step in the analysis of cancer sequencing data. Standard variant calling pipelines perform badly when confronted with heterogeneous, highly mutated data with abnormal chromosomal composition. We aimed to reduce calling errors in cancer data by combining the best parts of standard variant detection pipelines used in our group with software dedicated to somatic mutation detection. We used a synthetic tumor/normal dataset with a known count of 7903 somatic mutations from the ICGC-TCGA DREAM Mutation Calling Challenge. We assessed the calling performance of 4 different pipelines by comparing their ability to maximize correct mutation calls while minimizing false positive and false negative errors. Our 2 standard pipelines were a method developed in our group (internalTestCaller) and a combination of the latest BWA/GATK programs. We also performed similar analyses with 2 cancer-dedicated pipelines: VarScan, developed at Washington University, and MuTect from the Broad Institute of Harvard and MIT. Our standard pipelines recovered most mutations correctly but also many additional, presumably false positive variants. The accuracy (mean of the sensitivity and specificity) for BWA/GATK was 0.55. Our internalTestCaller achieved an accuracy of 0.73 without applying any statistical filtering methods. VarScan had an accuracy of 0.81, and MuTect performed best with an accuracy of 0.87. However, a significant number of errors were present in even the best case. To reduce calling errors without losing too many correct calls, we combined a sensitive version of VarScan with standard MuTect parameters. This reduced the number of errors by half, while keeping 99.8% of MuTect’s correct calls. In order to also reduce errors due to alignment, we applied two filtering strategies assessing read alignment stability using different aligners and reference genomes. Combining optimal alignment with VarScan and MuTect increased call accuracy but was accompanied by a slight decrease in global sensitivity. By further filtering calls unique to VarScan we were able to obtain a good overall correct call rate. This new pipeline increased accuracy to 0.95. Combining optimal alignment with multiple variant callers required more computational power and storage capacity, but it significantly decreased error calls. As this new pipeline was developed using the synthetic ICGC-TCGA dataset, we cannot exclude the possibility of over-fitting the data. Testing this new pipeline on real datasets will be the best way to correctly assess its performance. |
Jan M*
*University of Lausanne & SIB Swiss Institute of Bioinformatics, Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B04 |
Despite recent technological advances in genomic sciences, our understanding of cancer progression and its driving genetic alterations remains incomplete. We introduce TiMEx, a generative probabilistic model for the de novo detection of mutual exclusivity patterns of various degrees across carcinogenic alterations, which can indicate pathways involved in cancer progression. We regard tumorigenesis as a dynamic process, and base our model on the temporal interplay between the waiting times to alterations, characteristic for every gene and alteration type, and the observation time. For analyzing large datasets comprising many genes, we propose a three-step procedure. First, we apply TiMEx to estimate the degrees of mutual exclusivity between all gene pairs. Second, candidate groups are identified as maximal cliques of genes sharing a significant minimum degree of mutual exclusivity. Finally, we apply TiMEx to test the candidate groups for mutual exclusivity. In simulation studies, we show that our model outperforms previous methods for detecting mutual exclusivity. On large-scale biological datasets, TiMEx identifies gene groups with stronger functional biological relevance than other methods, while also proposing many new candidates for biological validation. TiMEx possesses several advantages over previous methods, including a novel generative probabilistic model of tumorigenesis, direct estimation of the probability of a mutual exclusivity interaction, computational efficiency, as well as high sensitivity in detecting gene groups involving low-frequency alterations. |
Constantinescu S*, Szczurek E, Beerenwinkel N
*Swiss Federal Institute of Technology Zürich (ETH Zurich) & SIB Swiss Institute of Bioinformatics, Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B05 |
Cancer results from the accumulation of mutations in cells that deregulate cell’s growth and function. Many mutations lie in the cellular DNA and are thus transmitted to cancer daughter cells. Because of these mutations, cancer cells could elicit an immune response, since a significant fraction of the mutations result in non-self protein segments that can be displayed on the cell surface through HLA class I molecules. The relationship between cancer mutations and the response from the immune system remains however unclear. Our project therefore aims to study this relationship through the integration and computational analysis of large-scale cancer genomics datasets obtained from the International Cancer Genome Consortium. Our ongoing efforts combine RNA-seq, exome mutations and HLA binding predictions data, which was obtained from thousands of patients with tumors originating in 22 different tissues. Our preliminary results show that the mutational load correlates with CD8+ T-cells in some tissues and CD4+ T-cells in other tissues. We also observe some unexpected anti-correlation between immune infiltration and the number of mutations in some other tissues. These differences in the immune system’s response can have interesting implications on how different tumors might respond to immunotherapy treatments such as immune checkpoint blockade. |
Racle J*, Gfeller D
*University of Lausanne & SIB Swiss Institute of Bioinformatics, Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B06 |
Overall survival of cancer patients depends on disease etiology, patient's general condition, genetic factors and treatments. Currently, both survival and molecular data are available in public databases for thousands of patients. Using TCGA, we investigate the role of different molecular markers related to immune infiltration on cancer patient survival. First, we study the possible influence of covariables (percentage of lymphocyte, percentage of monocyte, age and gender) on overall survival, using univariate and multivariate analyses. Except for gender, we observe an influence of these covariables on some cancers. Then, for several immune related genes, we calculate the Cox score to measure the correlation between gene expression levels and patient survival. Cancers for which these genes have the most significant effect are brain, kidney, skin and uterine cancers. We further use clustering techniques to identify subgroups of patients with similar expression profiles. These subgroups may help predicting overall survival in other patients. |
Aouadi I*, Racle J, Gfeller D
*Ludwig Institute for Cancer Research (LICR), Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B07 |
Viral diseases such as that caused by the highly pathogenic avian influenza virus are an important public health and economic burden. In response, the United Nation’s Food and Agriculture Organization (FAO) has developed “Empres-i” (Global Animal Disease Information System), a web-portal and database that compiles epidemiological data on animal disease outbreaks throughout the world, and serves as a global repository for many important livestock diseases (e.g. avian flu, foot and mouth disease, swine fever). Combining these epidemiological records of virus outbreak (virus type, host, location) with georeferenced environmental variables (e.g. human population density, host species density, surface of land covered by water bodies) allows the calibration of statistical models from which outbreak likelihood risk maps can then be derived (e.g. Stevens and Pfeiffer, 2011). Such Risk maps are useful for the planning of epidemiological monitoring, and optimizing the allocation of often limited resources. This poster presents Riskmod, a prototype of a web-based risk mapping tool that allows users to generate risk maps based on robust and well established disease mapping methods, directly in a web browser. |
Engler R*, Liechti R, Kuzsnetsov D, Xenarios I, Gilbert M
*SIB Swiss Institute of Bioinformatics, Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B08 |
The European research project PreciseSADs aims to reclassify Systemic Autoimmune Diseases (SADs) based on genetic and molecular biomarkers. In this scope, we developed a method to perform unbiased Single Nucleotide Polymorphism (SNP) clustering. Geographical ancestry is known to be a major contributor to genetic variation. Due to this bias and the large data dimensionality, unsupervised clustering of SNPs is usually performed on candidate genes and not genome-wide. Our goal is to develop a method for genome-wide unsupervised SNP clustering. Our test dataset contained 4,212 systemic lupus erythematosus patients genotyped on Illumina 1M microarrays. After quality control, minor allele frequency filtering, and tag SNP selection, we performed Principal Component Analysis on the 300,000 remaining SNPs. For each principal component significantly explaining variance, strong contributing SNPs were selected and spatially close ones were summarised by haplotypes. Approximately 400 SNPs were selected and 200 haplotypes inferred. On this new set, we obtained ancestry independent clusters with several unsupervised methods as hierarchical and density-based clustering. |
Charlon T*, Martínez-Bueno M, Di Cara A, Wojcik J, Voloshynovskiy S, Alarcón-Riquelme M E
*Quartz Bio, Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B09 |
Cell type-specific regulatory circuits reveal variable modular perturbations across complex diseases
Mapping the molecular circuits that are perturbed by genetic variants underlying complex traits and diseases remains a great challenge. Here we integrate human promoter and enhancer activity data with transcription factor (TF) sequence motifs to infer a unique panel of ~400 cell type and tissue-specific regulatory networks. We find that shared regulatory programs mirror developmental and functional relationships between different cell types, tissues, and organs, with the immune and nervous system showing the greatest regulatory complexity of all human cells and tissues. Integration with 37 genome-wide association studies (GWASs) shows that disease-associated genetic variants — including variants that do not reach genome-wide significance — often perturb genes that are densely inter-connected within regulatory circuits, and these perturbed modules are highly specific to disease-relevant cell types or tissues. Our results demonstrate that cell type-specific regulatory networks are key to understand the fine-scale mechanism of genes underlying complex diseases. |
Marbach D*, Lamparter D, Quon G, Kellis M, Kutalik Z, Bergmann S
*University of Lausanne & SIB Swiss Institute of Bioinformatics, Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B10 |
Due to the process of mutation and selection, a tumor is composed of various subclones with different genotypes and phenotypes. It is crucial to investigate the subclonal structure and to understand the dynamics of their interplay for improving treatment success. We analyzed two biopsies of each renal cell carcinoma (RCC) together with a matched normal sample from the same individual. Our analysis is based on next-generation sequencing data of the exomes of 16 RCCs from 16 patients. We performed variant calling and pairwise comparison of the variations found in the two tumor biopsies. On average two thirds of the mutations in a patient were private to one of the two samples. This finding points towards a high intra-tumor heterogeneity. However, this number might be confounded by the variant calling and filtering process. Indeed, when checking whether the private mutations would have at least one read in the respective other sample which supports the private variant, we obtained a different picture: Counting these private variants with support in the other sample as potentially shared mutations, the fraction of private mutations decreased to less than one third. Pairwise comparison of the frequencies of the SNVs showed that some of them differed remarkably. The results showed that most ancestor clones might still exist at varying frequencies in the two samples. The private mutations represent new clones that emerged in some samples. Ultra-deep sequencing of those genes harboring the potentially shared mutations will enable a more detailed investigation of the subclonal tumor structure. |
Hofmann A*, Beisel C, Behr J, Schraml P, Moch H, Beerenwinkel N
*Swiss Federal Institute of Technology Zürich (ETH Zurich), Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B11 |
The risk of coronary heart disease (CHD) is higher in South Asians compared to Europeans. Although lifestyle and congenital factors have been proposed to increase the risk for myocardial infarction (MI) there is limited knowledge of the genetic basis of the elevated incidence of CHD in these populations. Here we report an RNA-seq analysis of monocytes from 71 cases of confirmed acute MI and 77 healthy individuals from the Pakistan Risk Of Myocardial infarction study. For all the individuals we performed RNA-seq (median 42,8M reads, 75 bp paired-end) and we have genome-wide SNP data. Principal Component Analysis based on gene expression reveals clear differentiation between cases and controls suggesting different expression patterns. We identified 5244 differentially expressed genes with up-regulated genes having more divergent effect sizes. Specific regulated pathways were not found since MI is an unpredicted event with no specific regulation. In order to find genetic variants that affect the gene expression levels we performed expression quantitative trait locus (eQTL) analysis. We discovered 4799 eQTLs (5% FDR, significance level based on permutations) ± 1 Mb from the TSS of the genes. We also identified 179 variants affecting alternative splicing asQTLs (FDR 5%). Finally we found 935 MI-specific eQTL genes and we are conducting enrichment analysis in MI GWAS SNPs for MI-specific eQTLs in order to identify putative causative variants. Overall, these findings will allow us to investigate further the genetic architecture of CHD in the Pakistani population and better understand the relationship of genetic regulatory variation and gene expression. |
Panousis N*, Tuna S, Lataniotis L, Rasheed A, Shah N, Danesh J, Dermitzakis E, Saleheen D, Deloukas P
*University of Geneva & SIB Swiss Institute of Bioinformatics, Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B12 |
The initiation and progression of Cancer is caused by the accumulation of multiple aberrations in different genes and System Biology tries to understand the biological functions of these genes. The goal of our study is to develop and apply computational optimization-based approaches to perform functional analysis in Cancer genomics data. In particular, we are interested in developing new methods based on Steiner Tree Problems (STP) and graph related approaches. The Prize-collecting Steiner Tree Problem (PCSTP) is a generalization of the STP and it is well known in combinatorial optimization. It has been successfully applied to solve real problems such as fiber-optic and gas distribution networks design. In our work, we concentrate on its application in biology to perform a functional analysis of genes by employing genomics data. Briefly, in a large gene-gene interaction network, the PCSTP attempts to find a neighborhood or sub-network where genetic aberrations and mutations are mostly concentrated. For example, by using gene expression data one may try to find a connected neighborhood where many genes are differentially expressed. In the same manner, by combining the expression data with the survival data and solving the PCSTP, one can try to find a sub-network that locates interesting genes that are differentially expressed and are correlated with patient survival statistics. Due to the NP-hard characteristics of the PCSTP, it is computationally costly to find exact solutions for very large interaction networks. In our study, we propose a method that efficiently scale up to large interaction graphs. |
Akhmedov M*, Montemanni R, Kwee I
*IDSIA, Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B13 |
Somatic tumor mutation data is a valuable but not sufficient source to uncover new cancer genes and pathways, as the genomic alterations of same-clinical-type tumors are extremely diverse. Based on graph diffusion methods we identify groups of mutated genes (modules) with similar effects on the expression, protein abundance or phosphorylation of downstream target genes. We performed the analysis to breast cancer data from The Cancer Genome Atlas (TCGA). Many genes of the modules have been previously associated with breast cancer. Our approach addresses the common phenomenon of mutational heterogeneity and simultaneous homogeneity of expression patterns between patients. Based on known regulatory network connections we generate statistically sound hypothesis of how common expression changes in many patients can be achieved by mutations in various genes of the network. Our approach allows for detecting network modules of mutated genes that cover a large number of samples with significantly different mutations and closely connected to the same differentially expressed genes and can potentially handle different sources of differential expression data and different types of interactions. |
Dimitrakopoulos C*, Behr J, Beerenwinkel N
*Swiss Federal Institute of Technology Zürich (ETH Zurich) & SIB Swiss Institute of Bioinformatics, Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B14 |
Protein structures are entities onto which a vast amount of biological information, such as biochemical function and phenotypic traits, can be mapped. Searches for protein structures thereby lend the ability to simultaneously find each type of information. One of our goals is to provide technologies that traverse information mapped to protein structures in a way that connects human diseases with their etiological factors. Here, we present updates to the KB-Rank search engine, which is accessible via http://protein.tcmedc.org/KB-Rank/ and http://sbkb.org. Literature abstracts mapped through UniProt and phenotypic descriptions of structures from model organism resources are added for text match. Once structures are retrieved by text search, a representative range of protein functions are utilized for their prioritization. A prioritization score of a structure is the product of its bit array, corresponding to the presence or absence of different functions, and the matrix of similarly constructed arrays for all structures retrieved by the search. The logic being there are salient protein functions relevant to the given disease, and structures with those functions are the most relevant. As an example search with the word asthma, we identify apoptosis and muscle contraction as cellular pathways of high priority. We describe how literature confirms these pathways as being important in the etiology of the disease. Sharing open source information is essential to the project, and we look to further increase collaborative efforts to augment the required data integration and validation. This work was supported in part by a grant from NIGMS (U01 GM093324). |
McLaughlin W*, DePietro P, Julfayev E
*The Commonwealth Medical College, United States of America |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B15 |
Viral Hemorrhagic Septicemia and Infectious Hematopoietic Necrosis are serious fish diseases, causing high losses in fish farms. It is therefore important to ensure early detection of the responsible viruses, to improve biosecurity of the aquaculture industry. The aim of this project is to develop a rapid in-field immuno-assay-based screening test. This requires generating highly sensitive and specific antibodies against virus antigens. In order to ensure high sensitivity and early-onset detection of the infection, we selected those viral proteins shown to be most abundant in early and late responses to infection. We also ensured that no homologous sequences to these proteins existed in Eukaryotes. Then, using the publicly-available sequences for all the isolates of the virus, we identified variant amino acids that should be avoided for epitope selection, as the antibody should recognize all the virus isolates. Known and predicted secondary structures, as well as post-transcriptional modifications were also identified, as they may hinder antibody generation. We are currently performing sequence-based in silico epitope mapping to (i) predict continuous epitopes, which are formed by adjacent amino acids on the sequence, and (ii) predict discontinuous antigenic residues based on state-of-the-art methods (CBTope, BEST, BEEPro, CBEP). Potential improvements of these methods are also under investigation. Final selection of candidate epitopes from the predicted antigenic residues will take into consideration amino-acid variants, secondary structures, disordered regions and post-translational modifications. |
Neves A*, Peña CA
*University of Applied Sciences - Western Switzerland, HEIG-VD, Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B16 |
Postnatal brain development is important in conferring risk and resilience to mental illness. 75% of all psychiatric disorders have an age of onset before 24 years. We are assessing the predictive value of patient similarity networks (PSN) in several applications for neurodevelopmental disorders. Our approach is to use supervised clustering based on patient data. As the input, we create PSN computed based on various data types, including SNPs, CNVs, epigenomics, pathways and clinical data. For class prediction, we use an algorithm that applies label propagation to networks integrated by ridge regression (Ref 1). We use cross-validation to identify networks that robustly separate the two classes. Precision-recall based on the “blind” test is used to assess the accuracy of the classifier. We have started applying this approach to predict case/control status in autism using rare CNV data (Ref 2, N=4,236). Using cross validation, we identified pathways of neurodevelopmental relevance, such as axon guidance, as important predictors of case status. This finding is consistent with autism biology. We are now extending input networks to include information about miRNA targets, regulatory elements and known disease risk loci. A second application is the identification of PSN underlying performance in neurocognitive tests of relevance to schizophrenia and autism. (Ref 3, N=8,719). We eventually aim to integrate other layers of brain-related measures, such as functional neuroimaging. 1. Mostafavi et al. (2008). Genome Biology 2008, 9:S4. 2. Pinto et al. (2010) Nature| 466. 3. Robinson et al. (2015). Mol Psychiatry 20:454. |
Pai S*, Bader G
*University of Toronto, Canada |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B17 |
Networks have become a popular way to conceptualize a system of interacting elements, such as electronic circuits, social communication, metabolism or gene regulation. Network inference, analysis, and modeling techniques have been developed in different areas of science and technology, such as computer science, mathematics, physics, and biology, with an active interdisciplinary exchange of concepts and approaches. However, some concepts seem to belong to a specific field without a clear transferability to other domains. In this work we propose to adopt and adapt the concepts of influence and investment from the world of social network analysis to biological problems, and in particular to apply this approach to infer causality in the tumor microenvironment. We showed that constructing a bidirectional network of influence between cell-cell communication molecules allowed us to determine the direction of inferred regulations at the expression level and correctly recapitulate cause-effect relationships described in literature. This methodology constitutes a step forward in the design of combined cancer target therapies based on the integrated used of experimental data, clinical observations, and computational approaches. |
Crespo I*, Doucey M, Xenarios I
*SIB Swiss Institute of Bioinformatics, Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B18 |
Numerous studies in recent years have for several cancer types identified key patterns of immune activity associated with patient outcome. Immune cell activity in tumors, linked to tumor destruction, often corresponds to the immune cell infiltration profile of the tumor. Immunotherapy strategies can be guided by knowledge of such infiltration activity of immune cells, and it appears that activated subsets of immune cells may influence therapy responses. Some tools exist which attempt to digitally dissect stromal and immune gene signatures from whole transcriptome data. However, novel algorithms are needed to identify the specific subsets from aggregate data, such that, at high resolution, the specific phenotype of hematopoietic subsets of cells in transcriptome profiles from tumors are revealed. This is a particular algorithmic challenge for closely related cell types with critical differences that are predictive of immune contexture of the tumor and its response to therapy. We have developed a computational approach to monitor and gauge the immune cell activity in tumors at high resolution. We then grade the immune component of cancer transcriptomes based on a computational score for several different immune cell subtypes. To this end, we integrated a genome-wide ranked immune subtype relevance score for all human genes. We developed this using a combined approach, utilizing text mining and network characteristics of immune cells. We explored the resulting signatures of detailed immune cell subsets from patient expression profiles in melanoma to demonstrate the power of applying this computational strategy to identify prognostic markers. Such bioinformatics methods to explore the dynamics of immune cell infiltration during cancer progression may contribute to increase our understanding of the immunopathology of a tumor, and guide the design of novel cancer immunotherapies. |
Clancy T*, Hovig E
*Institute for Cancer Research, Oslo University Hosptial, Norway |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B19 |
The recent development of single-cell sequencing techniques revealed that the genetic make-up of tumors is better described as a heterogeneous cell population than a monoclonal cell mass. Sequencing data of a large number of single cells can be used to reconstruct the evolutionary history of a tumor population. A key task in this procedure is to populate the ancestral states in the phylogenetic tree with mutations that split the cell population into subclones. Recurrent mutation orders observed in multiple tumor instances may lead to a better understanding of the mutational patterns associated with a specific tumor type, and help with the identification of tumor subtypes. Currently one of the major challenges in analyzing single cell sequencing data is its low quality due to the limited amount of DNA obtainable from a single cell. The main sources of error are a high allelic dropout rate and an increased false discovery rate compared to bulk sequencing. We introduce a probabilistic approach to estimate the mutation history of a heterogeneous tumor based on single-cell sequencing data that can deal with various error-types in the data. The method is evaluated in a simulation study and used to reconstruct mutation histories of different tumor types. |
Jahn K*, Kuipers J, Beerenwinkel N
*ETH Zurich, Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B20 |
A hallmark of the Guide to PHARMACOLOGY database (GtoPdb) is expert curation of ligand-target binding data (PMID 24234439). Ligands include approved medicines, clinical candidates, drug research leads, receptor ligands and tool compounds. Mechanistic relationship mapping has now grown to ~1300 human proteins, ~5500 small molecules, ~1200 peptides ~50 clinical antibodies and ~12000 binding constants (e.g. IC50, Ki or Kd). This facilitates analysis of molecular pharmacology from both the ligand and target perspectives. Results are presented here for the drugged, druggable (i.e. with leads) and tractable (i.e. at least some chemical starting points) target landscape. This is defined by our own high-stringency data capture but can be compared with other sources. Our recent UniProt cross-links enable detailed target analysis, together with Venn diagram generation, the PANTHER resource for Genome Ontology (GO) and pathway analysis. We have used these to explore differences between our ~ 300 primary targets of approved drugs, the set of ~ 950 targets with quantitative binding data and a further ~ 350 proteins with non-quantitative but pharmacologically important interactions. Utility is demonstrated by analysing our own linked Swiss-Prot set and comparing to other target-centric sources. For example, a query for proteins with transmembrane content give results of 68% for targets of approved drugs, 40% for non-quantitative interactions and an average of 56% for all 1300 human proteins. Comparative figures for DrugBank and ChEMBL target proteins were 42% and 45%, respectively. Additional data will be presented for GPCRs, channels, kinases, proteases, other target classes, secreted vs transmembrane proportions, intersects with pathway enzymes and links to Orphan Diseases genes. Results will also show GtoPdb utility for addressing drug discovery questions. For example, we executed the following Boolean series: which targets have endogenous peptide interactions? (plus) exogenous synthetic peptide interactions? (plus) synthetic molecule interactions? (plus) are the targets of approved drugs? The four-way intersect produced 23 proteins for which pathways they were in could be determined using PANTHER. We also sliced target sets by ligand binding affinities (<0.1, <1.0, <10 and <100 nM). Not unexpectedly, this indicated receptor enrichment for each step of increased potency. This work enables researchers to move from in silico analysis to experimental studies of target validation, pathway intervention points and functional genomics perturbations. This applies to all our annotated ligands and extends beyond ~300 approved (drugged) targets out to ~ 1000 proteins covering possible future tractable targets |
Southan C*, Sharman J L, Pawson A J, Benson H E, Faccenda E, Davies J A
*University of Edinburgh, United Kingdom |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B21 |
HIV drug resistance development is a consequence of viral evolution. This evolutionary process is characterized mainly by the accumulation of resistance mutations, i.e., mutations that confer a selective advantage under the selective pressure of antiviral drugs. Models of HIV viral evolution have been shown to improve the prediction of therapy response. We introduce a new model called the observed time conjunctive Bayesian network (OT-CBN) that describes the accumulation of genetic events (mutations) under partial temporal ordering constraints [1]. Consequently, according to this model, evolution follows only a subset of all possible mutational pathways from the wild type, the genotype carrying no mutation, to the fully resistant genotype, the genotype carrying all resistance mutations. The OT-CBN model uses sampling time points of genotypes in addition to genotypes themselves to estimate model parameters. We developed an expectation-maximization algorithm to obtain approximate maximum likelihood estimates by accounting for this additional information. We have shown the superiority of the new model in comparison to the previous viral evolutionary models on several applications to HIV drug resistance datasets. In addition, we will show how this model can be used to derive an informative genotypic predictor for HIV treatment outcome. [1] Hesam Montazeri, Huldrych F. Günthard, Wan-Lin Yang, Roger Kouyos, and Niko Beerenwinkel, Estimating the dynamics and dependencies of accumulating mutations with applications to HIV drug resistance, To appear in Biostatistics, 2015. |
Montazeri H*, Beerenwinkel N
*Swiss Federal Institute of Technology Zürich (ETH Zurich), Switzerland |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B22 |
Cancer is a disease of evolution whose process is characterized by the accumulation of somatic alterations to the genome, which selectively make a cancer cell fitter to survive [1]. The understanding of progression models for cancer, i.e., the identification of sequences of mutations that leads to the emergence of the disease, is still unclear. The problem of reconstructing such progression models is not new; in fact several methods to extract progression models from cross-sectional samples have been developed such as [2, 3, 4]. In this work, we propose a novel algorithm called CAPRI (CAncer PRogression Inference) to reconstruct DAGs, modeling the sequences of mutations, which characterize cancer evolution. To the best of our knowledge, the existing techniques are based either on correlation or on maximum likelihood. Differently, we perform the reconstruct by exploiting the notions of probabilistic causation in the spirit of Suppes’ causality theory [5]. We note that in the context of biological systems and cancer progression, the notion of causality can be interpreted as the notion of selective advantage of the occurrence of a mutation. In those settings, we prove the correctness of our algorithm and, on synthetic data, we show that our approach outperforms the state-of-the-art. Moreover, for real cancer datasets, we highlight biologically significant differences in the progressions inferred with respect to other competing techniques. [1] Hanahan D., Weinberg R.A. (2011). Hallmarks of cancer: The next generation. Cell 144: 646–674. [2] Vogelstein B., Fearon E.R., et al. (1988). Genetic alterations during colorectal-tumor development. New England Journal of Medicine 319: 525–532. [3] Desper R., Jiang F., et al. (1999). Inferring tree models for oncogenesis from comparative genome hybridization data. Journal of Computational Biology 6: 37–51. [4] Beerenwinkel N., Eriksson N., et al. (2007). Conjunctive bayesian networks. Bernoulli: 893–909. [5] Suppes P. (1970). A probabilistic theory of causality. North Holland Publishing Company. [6] CAPRI: Efficient Inference of Cancer Progression Models from Cross-sectional Data, (2015), Bioinformatics, 2015-05-13 |
Ramazzotti D, Caravagna G, Olde Loohuis L, Graudenzi A, Korsunsky I, Paroni A, De Sano L, Mauri G, Mishra B, Antoniotti M*
*Università degli Studi di Milano Bicocca, Italy |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
B23 |
MicroRNAs are a recently discovered class of small noncoding functional RNAs. These molecules mediate post-transcriptional regulation of gene expression in a sequence specific manner. MicroRNAs are now known to be key players in a variety of biological processes and have been shown to be deregulated in a number of cancers. The discovery of viral encoded microRNAs, especially from a family of oncogenic viruses, has attracted immense attention towards the possibility of microRNAs as critical modulators of viral oncogenesis. The host-virus crosstalk mediated by microRNAs, messenger RNAs and proteins, is complex and involves the different cellular regulatory layers. In this commentary, we describe models of microRNA mediated viral oncogenesis |
Rengasamy P*, Selvam G, Veerasamy B
*Dr.Murugan's Bioinformatics Media & Research Centre, India |
Bioinformatics of Health and Disease, Biomarkers and Personalized Medicine |
|
C01 |
Venoms are complex mixtures composed of numerous proteins that act synergistically to incapacitate prey. Their study is important in the health domain because; 1) they are excellent drug leads and 2) it permits the development of treatment against envenomation. Biological information on these proteins is extracted from the literature, compiled, and continuously updated in UniProtKB, a general knowledgebase on proteins. The UniProtKB/Swiss-Prot animal toxin annotation program (http://www.uniprot.org/program/Toxins), which is specifically dedicated to venom proteins, provides access to more than 6,000 entries from all venomous taxa. However the format of UniProtKB does not allow indication of more general information such as venom composition, venom lethal dose or protein family description. In addition, good working knowledge is necessary to rapidly access the information of interest. In order to address these issues we are developing VenomZone, a highly cross-linked web-portal that aims to provide an overview on venoms from the six major venomous taxa (snakes, scorpions, spiders, cone snails, insects and sea anemones). Key features of VenomZone include information on venoms, links to several websites (the most important are on venom clinical effects and antivenoms) and description pages on venom activity, pharmacological target protein families and venom protein families. Moreover these pages link to the related entries of UniProtKB/Swiss-Prot, which are (re-) annotated in parallel with the creation of VenomZone. Three types of pages are available to access and compare information: taxonomy (currently 80 pages), activity (40 pages) and venom protein family (30 pages). VenomZone is available online at http://venomzone.expasy.org/. |
Jungo F*, de Castro E, Bougueleret L, Xenarios I, Poux S
*SIB Swiss Institute of Bioinformatics, Switzerland |
Databases, Ontologies, and Text Mining |
|
C02 |
Co-expression networks have proven effective at assigning putative functions to genes based on the functional annotation of their co-expressed partners, in candidate gene prioritization studies and in improving our understanding of regulatory networks. The growing number of genome resequencing efforts and genome-wide association studies often identify loci containing novel genes and there is a need to infer their functions and interaction partners. To facilitate this we have expanded GeneFriends, an online database that allows users to identify co-expressed genes with one or more user-defined genes. This expansion entails an RNA-seq-based co-expression map that includes genes and transcripts that are not present in the microarray-based co-expression maps, including over 10,000 non-coding RNAs. The results users obtain from GeneFriends include a co-expression network as well as a summary of the functional enrichment among the co-expressed genes. Novel insights can be gathered from this database for different splice variants and ncRNAs, such as microRNAs and lincRNAs. Furthermore, our updated tool allows candidate transcripts to be linked to diseases and processes using a guilt-by-association approach. GeneFriends is freely available from http://www.GeneFriends.org and can be used to quickly identify and rank candidate targets relevant to the process or disease under study. |
van Dam S*
*University of Liverpool, United Kingdom |
Databases, Ontologies, and Text Mining |
|
C03 |
Lipids are a large and diverse group of biological macromolecules with roles in membrane formation, energy storage, and signaling. Lipid composition and metabolism is tightly regulated in response to cellular signals and lipid availability, with dyslipidemia a common occurrence in cardiovascular disease, hypertension, diabetes, and many other diseases. A more complete understanding of the roles of lipids in human health will require the integration of quantitative measurements of lipidome composition with knowledge of lipid metabolic pathways, enzymes, and interacting proteins. To facilitate this task we have developed a new knowledge resource for lipids and their biology – SwissLipids. SwissLipids provides a hierarchical lipid classification that links mass spectrometry (MS) analytical outputs to more than 244,000 lipid structures and information on lipid metabolic pathways, enzymes, and subcellular and tissular location curated from over 600 publications. SwissLipids provides a reference namespace for lipidomic data publication, data exploration and hypothesis generation, and is fully mapped to existing knowledge resources such as ChEBI, LIPID MAPS and HMDB. SwissLipids is continually updated with new lipid classes and expert curated knowledge as this becomes available, and can be accessed at http://www.swisslipids.org/. |
Aimo L, Liechti R, Hyka-Nouspikel N, Niknejad A, Gleizes A, Gotz L, Kuznetsov D, David F, van der Gout G, Xenarios I, Riezman H, Bougueleret L, Bridge A*
*SIB Swiss Institute of Bioinformatics, Switzerland |
Databases, Ontologies, and Text Mining |
|
C04 |
UniProtKB/Swiss-Prot (http://www.uniprot.org) provides the scientific community with a collection of information, expertly curated from the scientific literature, on protein variants. Priority is given to single amino-acid polymorphisms (SAPs) found in human proteins, their functional consequences and association with diseases. UniProt release 2015_04 includes close to 72,000 human SAPs, 8’500 of which are enriched by free text descriptions of the functional characteristics of the variant. To ease access to this knowledge and to make it computer readable, we are restructuring the annotations using controlled vocabulary. By combining terms from Variation Ontology (VariO) and Gene Ontology (GO), we can describe the large spectrum of effects caused by SAPs on proteins. We use VariO terms to indicate which protein property is affected, such as its structure, expression, processing, and function. GO terms are used to specify which biological process, protein function, or subcellular location are impacted. A limited number of controlled attributes complete the annotations, defining how the protein property is affected, e.g. increased, decreased or missing. Currently, 20% of the SAPs reported in UniProtKB/Swiss-Prot have been reviewed, producing over 2’800 structured annotations. We plan to provide this new structured format to our users by the end of 2015. |
Famiglietti L*, Breuza L, Neto T, Gehant S, Redaschi N, Bougueleret L, Xenarios I, Poux S, Consortium U
*SIB Swiss Institute of Bioinformatics, Switzerland |
Databases, Ontologies, and Text Mining |
|
C05 |
Science is a global endeavor that requires the ongoing exchange of ideas and research findings. Efficient access to reliable research data is of paramount importance for the advancement of science. Research data are published in scientific papers as figures, which do not allow re-analysis of the data and are inaccessible to systematic data mining or search. It is thus currently extremely difficult to verify whether an experiment has already been published before or to compare the data from related studies. As research output continues to grow massively, new solutions are urgently required to maximize the efficiency of research investments. To address these issues, the SourceData project has been initiated by several partners to develop the necessary editorial tools and workflows that enable the biocuration of figures by data editors during the production phase of the publication process. The SourceData curation tool allows 1) to delimit coherent experimental units within a figure; 2) to efficiently tag biochemical entities in figure captions; and 3) to normalize entities and specify their role in the experimental design. The resulting semantic information can be used by researchers to find, compare and combine data from different sources to generate new hypotheses and stimulate novel discoveries. SourceData will hopefully transform published articles into open and enriched resources that make published figures and the underlying source data searchable and available for reuse. By improving the accessibility and discoverability of published data, SourceData will help shape and improve the transparency and reuse of the scientific output. |
Liechti R*, Götz L, Niknejad A, Xenarios I, El-Gebali S, George N, Lemberger T
*SIB Swiss Institute of Bioinformatics, Switzerland |
Databases, Ontologies, and Text Mining |
|
C06 |
HAMAP (High-quality Automated and Manual Annotation of Proteins) is a system for the classification and annotation of protein sequences that leverages the expert curated knowledge of UniProtKB/Swiss-Prot. HAMAP is composed of a library of family profiles for protein classification and rules for functional annotation. HAMAP can be used directly for the annotation of protein sequences via our web interface at http://hamap.expasy.org which accepts individual protein sequences or complete microbial proteomes for annotation. HAMAP is also a mainstay of the UniProt automatic annotation pipeline, providing annotation for millions of unreviewed protein sequences in UniProtKB/TrEMBL. HAMAP is continuously updated by expert curators with new family profiles and annotation rules as new protein families are characterized, and the rules are applied anew with each UniProtKB release, keeping the propagated annotation up-to-date. |
Pedruzzi I*, Rivoire C, Auchincloss AH, Coudert E, Keller G, de Castro E, Baratin D, Cuche BA, Bougueleret L, Poux S, Redaschi N, Xenarios I, Bridge A
*SIB Swiss Institute of Bioinformatics, Switzerland |
Databases, Ontologies, and Text Mining |
|
C07 |
Rhea (http://www.ebi.ac.uk/rhea) is a comprehensive and non-redundant resource of expert-curated biochemical reactions described using species from the ChEBI (Chemical Entities of Biological Interest) ontology of small molecules. Rhea has been designed for the functional annotation of enzymes and the description of genome-scale metabolic networks, providing stoichiometrically balanced enzyme-catalyzed reactions, transport reactions and spontaneously occurring reactions. Rhea reactions are used as a reference for the reconciliation of genome-scale metabolic networks in the MetaNetX resource (www.metanetx.org) and serve as the basis for the computational generation of a library of theoretically feasible lipid structures and analytes in SwissLipids (www.swisslipids.org). Rhea reactions are extensively curated with links to source literature and are mapped to other publicly available metabolic resources such as EcoCyc/MetaCyc, KEGG, Reactome and UniPathway. Here we describe recent developments in Rhea, which include the provision of reactions involving complex macromolecules such as proteins, nucleic acids and other polymers, website developments, and substantial growth of Rhea through sustained literature curation efforts. Together these developments will significantly improve our ability to represent and model metabolic diversity. |
Morgat A*, Lombardot T, Axelsen KB, Aimo L, Niknejad A, Hyka-Nouspikel N, Coudert E, Redaschi N, Bougueleret L, Steinbeck C, Xenarios I, Bridge A
*SIB Swiss Institute of Bioinformatics, Switzerland |
Databases, Ontologies, and Text Mining |
|
C08 |
In UniProtKB/Swiss-Prot, biocurators combine information derived from 3D-structures and the scientific literature to provide a summary of protein function and active sites, physiologically relevant ligand binding sites, post-translational modifications, subcellular location, and protein-protein interactions. Information about well-characterized proteins is propagated to close family members. The result is provided in a structured format that facilitates querying and information retrieval. Here we present our work on the biocuration of potassium channels, such as KCNA2, where 3D-structures provide important information on the mechanisms underlying channel activity, gating, selectivity, tetramerization, and membrane topology, information which complements that provided by physiological studies of in vivo protein function, expression and regulation. UniProtKB entries facilitate access to information derived from protein 3D-structures via cross-references to PDB, the Protein Model Portal and related resources. In April 2015, UniProtKB/Swiss-Prot contained 112’000 cross-references to PDB, corresponding to over 21’800 entries, primarily from model organisms. Manual biocuration based on protein 3D-structures has high priority. About 25% of the 20’000 human entries have a cross-reference to PDB, and the majority of these have at least one matching literature citation. |
Hinz U*, Stutz A, Bougueleret L, Xenarios I, Poux S
*SIB Swiss Institute of Bioinformatics, Switzerland |
Databases, Ontologies, and Text Mining |
|
C09 |
Introduction: Harvesting of web content for text mining or data mining is not a trivial task, especially if you need accurate, comprehensive and high frequently updated information. When building a resource to support clinical decision for rare diseases we have to face all these challenges. The Web is well known to contain very useful information for human but unfortunately noisy and difficult to extract. Objectives: 1/ Compare the coverage of harvesting methods for a collection of nine legacy rare diseases resources on the Web (described in [1]), in term of coverage and availability. 2/ Study if Common Crawl (CC) database [2] is a good alternative to direct spidering & harvesting methods. Methods: A webometrics approach was investigated. For each of the nine RD web sites, we counted the number of pages identified 1/ by Google robots, 2/ by commercial SEO harvester such as MajesticSeo and Ahrefs 3/ by a new CC database, 4/ by the FindZebra search engine [1] and 5/ by a local Nutch harvester. Results: we found a maximum of 829’666 pages from Ahrefs for the nine sources. As usually in webometrics, we observe huge variations depending on the acquisition channel for the same sources. However, we found that CC provides a good alternative to direct crawl for most bio-medical sources in the field of RDs, although we identified a few important gaps. Conclusion: In the context of bio-medical contents for RDs, CC appears as a revolutionary initiative in the harvesting of data by giving a robust alternative to classical methods, by simplifying clearly the task and by finally bringing transparency and reproducibility to Web information Science. |
Gaudinat a*, Gobeill j, Patrick r
*Geneva School of Business Administration (HEG) & SIB Swiss Institute of Bioinformatics, Switzerland |
Databases, Ontologies, and Text Mining |
|
C10 |
The Orthologous MAtrix (OMA) project is a method and accociated database inferring evolutionary relationships among currently 1706 complete proteomes. In addition to inferring orthologous relations between many genomes, the OMA database provides different groupings that serve different purposes: OMA cliques of orthologs, Hierarchical Orthologous Groups, and orthologous pairs. Moreover, if your genome of interest is not included in the OMA database, you can use the standalone version and make your own analysis. And lastly, OMA has been continuously maintained for more than 10 years. Here, we present the recently introduced new features. |
Altenhoff A*, Skunca N, Glover N, Train C, Sueki A, Piližota I, Gori K, Tomiczek B, Müller S, Redestig H, Gonnet G, Dessimoz C
*SIB Swiss Institute of Bioinformatics, Switzerland |
Databases, Ontologies, and Text Mining |
|
D01 |
Selectome (http://selectome.unil.ch/) is a database of positive selection, based on a branch-site likelihood test. This model estimates the number of non-synonymous substitutions (dN) and synonymous substitutions (dS) to evaluate the variation in selective pressure (dN/dS ratio) over branches and over sites. Special care is taken to minimize false positives, with a thorough quality control procedure on multiple sequence alignments. We continuously benchmark multiple sequence alignment and filtering methods to use high quality and reliable alignments in our pipeline. We present here our most recent results on multiple sequence alignment quality control in Selectome. Proux et al 2009 Nucl. Acids Res. 37: D404-D407 Moretti et al 2014 Nucl. Acids Res. 42: D917-D921 |
Moretti S*, Robinson-Rechavi M
*SIB Swiss Institute of Bioinformatics, Switzerland |
Evolution, Phylogeny, and Comparative Genomics |
|
D02 |
The visualization of massive datasets, such as those resulting from comparative metatranscriptome analyses or the analysis of microbial population structures using ribosomal RNA sequences, is a challenging task. We developed a new method called CoVennTree (Comparative weighted Venn Tree) that simultaneously compares up to three multifarious datasets by aggregating and propagating information from the bottom to the top level and produces a graphical output in Cytoscape. With the introduction of weighted Venn structures, the contents and relationships of various datasets can be correlated and simultaneously aggregated without losing information. We demonstrate the suitability of this approach using a dataset of 16S rDNA sequences obtained from microbial populations at three different depths of the Gulf of Aqaba in the Red Sea. CoVennTree has been integrated into the Galaxy ToolShed and can be directly downloaded and integrated into the user instance. |
Lott S*, Voß B, Hess W, Steglich C
*University of Freiburg, Germany |
Evolution, Phylogeny, and Comparative Genomics |
|
D03 |
Quest for Orthologs (QfO) is a community effort whose goal is to improve and benchmark orthology predictions. The SIB Swiss Institute of Bioinformatics provides a collection of Gold Standard gene trees (SwissTree) for the quality assessment of inferred gene relationships. As the interpretation of gene trees assumes prior knowledge on species phylogenies, we investigated the consistency between existing species trees by comparing the relationships of 147 QfO reference organisms from six Tree of Life (ToL) / species tree projects: the NCBI taxonomy, Opentree of Life, the sequenced species/species Tree of Life (sToL), the 16S rRNA database, and trees published by Ciccarelli et al in 2006, and by Huerta-Cepas et al in 2014. The study reveals that each species tree suggests a different phylogeny. Our results provide ways of using the species consensus tree (swisstree.vital-it.ch/species_tree), for instance, for different benchmarking purposes. In times of rapid growth of genome data, quality control and reference data sets have become critical. Community approaches, such as QfO, and gold standard data sets like SwissTree are an asset that a research laboratory would not be able to maintain up-to-date on its own. Such activities require tight interaction between research and service provider partners. |
Boeckmann B*, Marcet-Houben M, Rees JA, Forslund K, Huerta-Cepas J, Muffato M, Yilmaz P, Xenarios I, Bork P, Lewis SE, Gabaldón T
*SIB Swiss Institute of Bioinformatics, Switzerland |
Evolution, Phylogeny, and Comparative Genomics |
|
D04 |
Although it has been recently learnt that some pseudogenes can be functional, they are still mostly considered as relics of evolution, that is, stretches of DNA that are nothing but residues; the product of retrotransposition, segmental duplications or gene decay. Actually, if a pseudogene becomes a functional product, it is not considered anymore a pseudogene but a functional gene (i.e the lncrna Xist); a processed pseudogene with intact coding frames and functionality right after its biogenesis is not called a pseudogene but a retrogene.There are still pseudogenes that can be translated to a functional protein, under specific conditions and subject to genetic variation, they are referred as “polymorphic pseudogenes”. The terminology is confusing and leads to misunderstandings. Having said that, it is true that most pseudogenes do not seem to have any functionality, but if evidence seriously overcomes mistaken non functionality they can end up not being referred as pseudogenes anymore. The very same has been long accepted for paralogous genes arising from duplicated pseudogenes. On top of that detection and classification of new “functional pseudogenes” happens, but is rare and not fully understood. In this work the pseudogene biogenesis, by means of reverse transcriptase, is revisited and studied in relation to chromatin architecture with the goal of having a better definition/classification and discovering new evidence of functionality for pseudogenes. |
Muro EM*
*Johannes Gutenberg University and Institute of Molecular Biology, Germany |
Evolution, Phylogeny, and Comparative Genomics |
|
D05 |
Despite abundant experiments and diverse data available to study aging, the mechanisms of aging are still poorly understood. To tackle this, we are interested in evolutionarily conserved marks associated with aging, from short lived model organisms to long lived species such as human. We present the analysis of publicly available aging datasets from human and mouse tissues to analyze gene expression changes during aging. We characterized co-modules showing the level of gene expression conservation between homologous tissues. Meta-analysis across different tissues in mouse and human shows overall down-regulation of age-related gene expression profiles. We identified the biological processes using gene set enrichment analysis for these tissues, and found that changes associated with age-related gene expression in skeletal muscle and brain are involved in the mitochondrion pathways and inflammatory response, respectively. These tissues are known to be important to changes in aging. However, there is only a weak positive correlation between aging effects in the human and mouse homologous tissues. The co-module identification showed connection to immune response process in brain tissue between human and mouse. Our study provides a framework for further comparative analysis in aging across different species. |
Komljenovic A*, Robinson-Rechavi M
*University of Lausanne & SIB Swiss Institute of Bioinformatics, Switzerland |
Evolution, Phylogeny, and Comparative Genomics |
|
D06 |
A wide range of phylogenetic methods model the evolution of discrete traits using a continous-time Markov chain. The likelihood computations for these models is done using Felsenstein's tree pruning algorithm and most of the computation time is spent on matrix exponentiation and partial likelihood computations. The evaluation of the likelihood function has therefore a critical impact on the overall performance of phylogenetic methods. We purpose here a method to speedup the evaluation of matrix exponentiation and partial likelihood by reducing the number of states in a continous-time Markov chain without loosing the dimensionality of the models. We used state aggregation techniques to selectively combine states of the instantaneous rate matrix. Depending on the particular model used, a number of aggregation strategies may be employed. Maximum reduction is achieved when all the states, unobserved at the tips of the tree, are aggregated into a single state. We implemented the aggregation optimization in FastCodeML (Valle et al., 2014, Bioinformatics), which uses Branch-Site model (Yang and Nielsen, 2002, Mol Biol Evol) to infer positive selection along positions of a protein-coding gene. We use biological data as well as simulations to show that the proposed approximation does not lead to a bias in the parameter values or positive selection detection while giving a twofold speedup. We also measure the speedup with variable tree sizes and an alignment length and discuss applicability of the optimization for a number of phylogenetic methods. |
Davydov I*, Robinson-Rechavi M, Salamin N
*University of Lausanne & SIB Swiss Institute of Bioinformatics, Switzerland |
Evolution, Phylogeny, and Comparative Genomics |
|
D07 |
Sweta Talyan, Miguel A. Andrade-Navarro, and Enrique M. Muro Johannes Gutenberg University. Institute of Molecular Biology, Ackermannweg 4, 55128 Mainz, Germany Abstract, Pseudogenes are extant genomic loci very much similar to their parental functional genes, but not able to be translated into a functional proteins because of deleterious mutations like frameshift disruptions or premature stop codons. They are classified depending on their biogenesis -retrotransposition, DNA duplication or gene decay- respectively as processed, duplicated or unitary. While duplicated pseudogenes can maintain the parental gene structure and all regulatory regions, processed pseudogenes are not able to retain neither the 5' upstream regulatory regions nor the introns. Recent studies confirm previous evidence on transcriptional activity, very specific on tissue, of about 13% of all human pseudogenes; for some of those, regulatory roles have been found. Most methods for ab initio detection and classification of pseudogenes were developed about the same period of time: Pseudopipe (Zhang et al. 2003, 2006, Zhang and 2004), Retrofinder (Baertsch et al. 2008) and the method from Torrents et al. (2003). These methods are still the norm and rely mostly on homology. They mainly differ in the parental gene query representative, the use of parameters and thresholds, and the incorporation of amino acid substitution replacement (Ka/Ks) measurements. These methods were developed at an early stage of the Human genome annotation when little sequencing information from other organisms was available. We present a novel method for pseudogene genome-wide prediction that takes advantage on information provided by the annotation on all the genomes sequenced till now, improving the current pseudogene annotation. |
Talyan S*
*Institute of Molecular Biology, Germany |
Evolution, Phylogeny, and Comparative Genomics |
|
D08 |
Recent advances in sequencing technology are making it possible to sequence the genomes of individual cells. Single-cell sequencing provides an opportunity to survey the genomic heterogeneity of cells within an organism, both in healthy development and in cancer. Most cell divisions introduce mutations as a result of DNA replication errors. Consequently, the history of cell divisions in the organism is encoded in the genomes of individual cells, and could be reconstructed by phylogenetic methods. We present a method for reconstructing evolutionary histories of cells while accounting for high levels of allelic dropout often found in single-cell sequencing data. The problem can be reduced to finding a series of graph cuts in a certain graph. Through simulations, we show that our method outperforms standard phylogenetic methods for this task. |
Truszkowski J*, Goldman N, Tavare S
*EMBL-EBI and Cancer Research UK Cambridge Institute, University of Cambridge, United Kingdom |
Evolution, Phylogeny, and Comparative Genomics |
|
D09 |
We present MLgsc, a software package for classifying protein or nucleotide sequences into taxa. One program in the package trains a classifier, another performs classification itself, and a third is used for leave-one-out cross-validation. Training uses (1) a multiple alignment of sequences from the reference taxa, and (2) a phylogenetic tree of the reference taxa, and produces a hierarchy of clade models. The phylogeny defines the inner taxa and guides model-building, and serves as a decision tree in the classification process. Classification walks through the tree from the root, comparing the query sequence to each model at a given node, and discarding all but the best-matching node. The classifier's error rate is around 1%, depending on region parameters. It was compared to four other classifiers implementing different methods, and was found to be at least as fast, and sometimes faster by more than an order of magnitude. MLgsc has a simple command-line interface for the Unix shell, which makes it easy to integrate in analysis pipelines, for example in targeted metagenomics. MLgsc is freely available and open-source. |
Junier T*, Hervé V, Wunderlin T, Junier P
*SIB Swiss Institute of Bioinformatics, Switzerland |
Evolution, Phylogeny, and Comparative Genomics |
|
E01 |
DNA in bacterial chromosomes and bacterial plasmids is supercoiled. DNA supercoiling is essential for DNA replication and gene regulation. However, the density of supercoiling in vivo is circa twice smaller than in deproteinized DNA molecules isolated from bacteria. What are then the specific advantages of reduced supercoiling density that is maintained in vivo? Using Brownian dynamics simulations and atomic force microscopy we show here that thanks to physiological DNA–DNA crowding DNA molecules with reduced supercoiling density are still sufficiently supercoiled to stimulate interaction between cis-regulatory elements. On the other hand, weak supercoiling permits DNA molecules to modulate their overall shape in response to physiological changes in DNA crowding. This plasticity of DNA shapes may have regulatory role and be important for the postreplicative spontaneous segregation of bacterial chromosomes. |
Benedetti F*, Racko D, Japaridze A, Dorier J, Kwapich R, Burnier Y, Dietler G, Stasiak A
*University of Lausanne & SIB Swiss Institute of Bioinformatics, Switzerland |
Macromolecular Structure, Dynamics and Function |
|
E02 |
We have created communication material and workshops dedicated to drug design for high school students, high school teachers and the public at large. Our aim is to introduce, in an engaging and challenging way, concepts such as 3D structure, protein function, diseases and the role played by bioinformatics in drug discovery and development. The workshops include the construction of drug models with chemical kits (“sticks and balls”) together with the manual docking of 3D-printed small molecules into a 3D-printed structure representing the target protein ‘on scale’. Participants can then design drugs in-silico, test their interactions with the target proteins and predict their pharmacodynamic and pharmacokinetic properties through web-based tools developed by the SIB Swiss Institute of Bioinformatics, i.e., SwissTargetPrediction and SwissADME. These tools have been embedded into a user friendly interface (www.atelier-drug-design.ch). 150 participants attended our pilot workshops and became almost addicted to the activity, painstakingly designing molecules with improved properties when compared to existing ones. We will present the first feedbacks, and discuss the importance of using bona fide bioinformatics tools in such activities. |
Blatter MC*, Daina A, Baillie Gerritsen V, Marek D, Palagi PM, Xenarios I, Schwede T, Michielin O, Zoete V
*SIB Swiss Institute of Bioinformatics, Switzerland |
Macromolecular Structure, Dynamics and Function |
|
E03 |
Recently, fascinating insights into the architecture of interphase chromatin have been revealed by "all vs all" chromosome conformation capture (3C) experiments, among them HiC and Single Cell HiC. Low resolution models of chromatin structure have been obtained from 3C data mostly by minimizing a restraint energy function. The distance restraints are derived from 3C contact frequencies and imposed on a coarse-grained polymer model. But this procedure is problematic: The conversion of contact frequencies to distances as well as the minimization protocol involve unknown parameters that have to be chosen somehow. Moreover, it does not allow to quantify the uncertainty of chromatin structure in a statistically sound way. To address these issues, we use the Inferential Structure Determination (ISD) approach, which originated in NMR structure determination of proteins. ISD views structure determination as a problem of statistical inference and estimates the unknown parameters along with structural models. By applying ISD to HiC data we are not only able to obtain a meaningful ensemble of chromatin structures, but also to estimate the magnitude of errors and important parameters of the polymer model. We investigate several methods to back-calculate data from the structure and rank them using Bayesian model comparison. |
Carstens S*, Nilges M, Habeck M
*Institut Pasteur, France |
Macromolecular Structure, Dynamics and Function |
|
E04 |
Cysteines cannot only stabilize protein folds by forming disulfide bonds or coordinating metal ions, they play also an important role for the regulation of protein function by enabling redox-sensitive conformational changes. Here, we present three examples of the analysis of the role of cysteines for protein structure and function: 1) MD simulations of two about 25 residue long cysteines rich domains (CRDs) form hydra proteins that share the same cysteine sequence distribution, however adopt different disulfide bond patterns and folds. The MD data nicely complements earlier published NMR residual dipolar coupling (RDC) data. Additional MD simulations may provide information about the redox potentials of the cysteines and the reshuffling of intra- to intermolecular disulfide bonds upon formation of the extremely stable capsule wall. 2) The analysis of the redox-potential and the structure and dynamics of the free oxidized as well as the oxidized and reduced membrane associated states of the FATC domain of the central cell growth regulator ‘target of rapamycin’ (TOR) that has two conserved cysteines by NMR, fluorescence and MD simulations. 3) The NMR analysis of the redox-sensitive rubredoxin domain (RD) that is N-terminal of the catalytic domain of the mycobacterial kinase G (PknG). Also this project would benefit from additional MD simulation data to for example simulate the unfolding upon oxidation and metal release, which may facilitate substrate access to the kinase active site. The RD can further interact with membrane mimetics. As for the TOR FATC domain, MD simulations may help to determine the immersion properties. |
Dames SA*
*Technische Universtiät München, Germany |
Macromolecular Structure, Dynamics and Function |
|
E05 |
Drug discovery has been profoundly changed by the use of computational methods that help making rational decisions at the different steps of the process. A typical in silico drug design pipeline may be seen as the interplay between several main activities, e.g. hit finding, lead optimization and selection of the molecules to be tested experimentally. A large number of techniques can be used, which can be, in turn, divided into ligand-based and structure-based approaches. The SwissDrugDesign project is an ambitious initiative that aims at providing a comprehensive web-based in silico drug design environment to the worldwide scientific community. Its purpose is to offer a large collection of tools covering all aspects of computer-aided drug design, including both ligand-based and structure-based approaches. Several components of SwissDrugDesign are already online. SwissDock, is a web service dedicated to the docking of small molecules to protein active sites. SwissParam provides topology and force field parameters for small organic molecules for use with CHARMM and GROMACS. SwissSidechain gathers information about hundreds of commercially available non-natural sidechains for peptide design. The SwissBioisostere database collects more than 4.5 millions molecular replacements for lead optimization. SwissTargetPrediction allows the prediction of possible targets of a query small molecule. Several other tools are being finalized, such as SwissADME, which calculates physicochemical parameters for small molecules in relation with pharmacokinetic, pharmacodynamic and druglikeness properties The interoperability between these tools and the simplicity of use will create a comprehensive environment able to assist the user through a complete computer-aided drug design pipeline. |
Zoete V*, Bovigny C, Daina A, Michielin O
*SIB Swiss Institute of Bioinformatics, Switzerland |
Macromolecular Structure, Dynamics and Function |
|
E06 |
We address the challenges of treating polarization and covalent interactions in docking by developing a hybrid quantum mechanical/molecular mechanical (QM/MM) scoring function based on the semiempirical self-consistent charge density functional tight-binding (SCC-DFTB) method and the CHARMM force field. To analyze the pitfalls of classical docking algorithms and to benchmark the success of our QM/MM docking algorithm, we created a publicly available dataset of high-quality X-ray structures of zinc metalloproteins (http://www.molecular-modelling.ch/resources.php). For zinc-bound ligands (226 complexes), the QM/MM scoring yielded a substantially improved success rate compared to the classical scoring function (77.0% vs 61.5%), while, for allosteric ligands (55 complexes), the success rate remained constant (49.1%). For some therapeutically relevant enzyme classes from zinc binder's data set such as carbonic anhydrase 2 and ADAM metalloproteinase domain 17 (ADAM17), the QM/MM approach yielded a particularly high success rate. The results of our study suggest that including information from QM/MM calculations during docking is advantageous and that this increased accuracy can be useful for drug design applications. We are now using the insights gained in this project to derive a robust protocol for on-the- fly QM/MM docking. Towards this goal, we have coupled the QM/MM scoring function with the Attracting Cavities (manuscript in preparation) classical docking code and tested its performance on different benchmark sets. References [1] Chaskar, P.; Zoete, V.; Röhrig, U. F., Toward on-the-Fly Quantum Mechanical/Molecular Mechanical (QM/MM) Docking: Development and Benchmark of a Scoring Function, J. Chem Inf. Model. 2014, 54, 3137-3152. |
Chaskar P*, Zoete V, Roehrig U
*Swiss Federal Institute of Technology Lausanne (EPFL) & SIB Swiss Institute of Bioinformatics, Switzerland |
Macromolecular Structure, Dynamics and Function |
|
E07 |
Protein structure homology modelling has become a routine technique to generate 3D models for proteins when experimental structures are not available. SWISS-MODEL is a widely used automated protein structure homology-modelling server. Around 2000 models are built every day for scientists around the world. Current development aims to improve prediction of correct oligomeric state of protein. During evolution, the quaternary structure of proteins is less conserved than their tertiary structure. This implies that even in the same protein family we can often observe a range of different oligomeric assemblies. This observation obviously poses a challenge for modelling and prediction of protein structures. Here, we present our approach to predict the oligomeric structure of target proteins, the results of the validation, and the overall performance of our predictor. |
Bertoni M*, Waterhouse A, Bienert S, Arnold K, Studer G, Bordoli L, Schwede T
*University of Basel - Biozentrum & SIB Swiss Institute of Bioinformatics, Switzerland |
Macromolecular Structure, Dynamics and Function |
|
E08 |
Drug development is a very costly and failure-prone endeavor with about 81 % of the drug candidates failing. Major reasons for discontinuation are lack of drug efficacy and off-target binding. Despite increased efforts, the rate at which pharmaceutical companies bring drugs to the market became stale over the last decades, suggesting limitations in the current drug development model. While the development of a drug costs more than one billion dollars, a repositioned drug costs 40 % instead. We present a drug repositioning pipeline based on the analysis of protein structures and network analysis. Our approach contributes to successful drug development by providing a comprehensive view on a drug’s target space. Starting from a known drug target, other targets are reliably identified by binding site comparison. Identified off-targets will help to reduce attrition rates since potential adverse drug reactions are discovered timely during drug development. On the other hand, new therapeutic targets will be identified, forming the base for drug repositioning. Our pipeline would enrich drug development in the early stage by providing a means to optimize drugs against specific targets. Moreover, drugs shelved due to lack of efficacy could be subject to repositioning, using the identified targets. |
Haupt J*, Daminelli S, Salentin S, Schroeder M
*TU Dresden, Germany |
Macromolecular Structure, Dynamics and Function |
|
E09 |
Indoleamine 2,3-dioxygenase 1 (IDO1) is a key regulator of immune responses and therefore an important therapeutic target for the treatment of diseases that involve pathological immune escape, such as cancer. Over the last 10 years, the search for IDO1 inhibitors has been intensely pursued both in academia and in pharmaceutical companies, and many IDO1 inhibitor scaffolds have been described. However, a significant number of reported compounds contain problematic functional groups suggesting that enzyme inhibition could be the result of undesirable side reactions instead of selective binding to IDO1. Here, we describe issues in the employed experimental protocols, review and classify reported IDO1 inhibitors based on structural data and cheminformatics filters, and suggest different approaches for confirming viable inhibitor scaffolds. |
Roehrig U*, Zoete V, Michielin O
*SIB Swiss Institute of Bioinformatics, Switzerland |
Macromolecular Structure, Dynamics and Function |
|
E10 |
Activatory and inhibitory lymphocyte receptors are important modulators of immune responses. They activate the immune system in reactions against foreign antigens, or inhibit reactions that are overwhelming, and/or directed against healthy tissues and cells. They also regulate T cell reactions against tumors, demonstrated by the findings that the blockade of inhibitory receptors can lead to powerful immune responses against cancer. Specifically, novel treatments with antibodies that block the inhibitory receptors CTLA-4 or PD-1 have brought great progress for the treatment of patients with melanoma, lung cancer and further malignancies. Another potentially interesting inhibitory receptor is the B and T lymphocyte attenuator (BTLA) that suppresses T cell reactions upon interaction with its ligand, the herpes virus entry mediator (HVEM) protein. Here we present the detailed computational analysis of interactions in the HVEM-BTLA complex. We subjected the crystal structure of the complex to molecular dynamics simulation and calculated the interaction energy between the partners. We analyzed in detail the contributions of the residues to the interaction energy and their importance for the structural stability. The computational analysis results are in good agreement with published experimental alanine scanning results. The presented data can be useful for the design of peptide or small molecule inhibitors of BTLA-HVEM interactions. |
Iwaszkiewicz J*, Zoete V, Derré L, Rodziewicz-Motowidlo S, Speiser D, Michielin O
*SIB Swiss Institute of Bioinformatics, Switzerland |
Macromolecular Structure, Dynamics and Function |
|
E11 |
Bacterial diseases are a major threat to humans, amplified by ever-growing antibiotic resistance. Vaccines are the most efficient preventive measures against life threatening bacterial infections. Their production typically involves either attenuating a pathogen or unspecific chemical conjugation of a glycan with a carrier protein. A novel alternative approach consists of cloning the sugar synthesis gene cluster from the pathogen into Escherichia coli, and functional co‐expression of a carrier protein and the N‐oligosaccharyltransferase PglB. This procedure enables production of well‐defined glycoconjugate vaccines in prokaryotic cells. This approach is however limited by the narrow substrate specificity of PglB. Based on a homology model of the trans‐membrane protein PglB from Campylobacter jejuni, a structure guided protein engineering approach was chosen to achieve higher glycosylation rates for several antigens that are not processed at all (or only with low turn-over numbers) by the wild type enzyme. For assessing the local model quality of trans‐membrane proteins such as PglB, we have developed QMEANBrane ‐ a statistical potential of mean force tailored to models of membrane proteins. QMEANBrane assigns local quality estimates on a per‐residue basis |
Haas J*, Studer G, Ihssen J, Kowarik M, Wacker M, Thöny-Meyer L, Schwede T
*SIB Swiss Institute of Bioinformatics, Switzerland |
Macromolecular Structure, Dynamics and Function |
|
E12 |
Despite huge advances in homology detection and modelling performance in general, the core technologies for homology modelling did not change much in the last two decades. The aim of our work is to improve this second step in homology modelling, namely building a model from a template structure and a target-template alignment. To this end we have developed a modular modelling framework based on state of the art methods, which allows us to easily test new ideas to improve overall modelling quality. For example, we are working on incorporating structural information from several template structures as well as information from low resolution experimental techniques. The framework developed so far is currently under evaluation and will replace the core homology modelling engine in SWISS-MODEL. It provides a wide variety of novel tools for modelling regions lacking template information as well as sidechain reconstruction and molecular mechanics tasks. |
Studer G*, Bienert S, Johner N, Schwede T
*SIB Swiss Institute of Bioinformatics, Switzerland |
Macromolecular Structure, Dynamics and Function |
|
E13 |
Protein structure modeling is nowadays successfully used in life science when no experimental data is available for the structure of a given protein of interest. In fact, having reliable protein structure predictions is fundamental since it has been extensively demonstrated that a high quality 3D structural model can be used for the same purposes of an experimental structure. Since 2012 the Continuous Automated Model EvaluatiOn (CAMEO, http://cameo3d.org) has been automatically assessing protein structure and ligand binding site residues prediction servers. Central to the workflow is the weekly PDB pre-release of sequences, which are going to be released the following week. These sequences form the target set on which the services are being assessed. Lately, also the 3D protein structure quality estimation tools (MQAP) have been included in the benchmark given their pivotal importance in determining the biological applicability of a 3D protein structure model. CAMEO has a two-fold aim: to push the community towards the development of more reliable methods and to provide the life science community with an unbiased view on the state-of-the-art protein structure prediction tools dependent on which aspect is most important for the given question at hand. Here we present the latest results of the CAMEO assessment for the categories 3D (protein structure prediction) and QE (model quality estimation). |
Barbato A*, Haas J, Roth S, Bertoni M, Waterhouse A, Bienert S, Schwede T
*SIB Swiss Institute of Bioinformatics, Switzerland |
Macromolecular Structure, Dynamics and Function |
|
F01 |
Characterization of the phenotypic effect of mutations provides evidence on which variants of unknown significance (VUS) can be evaluated. We have annotated the phenotypes caused by missense mutations in BRCA1 associated with increase susceptibility to breast and ovarian cancers. Using the information derived from 87 publications, the function, the phenotype and the binding properties of 385 unique missense mutations were captured according to their impact on gene ontology function and mammalian phenotype ontology, resulting in 1106 different annotations. Each annotation is supported by detailed experimental evidences. Well characterized and assessed functions of BRCA1 includes its ubiquitin-protein ligase activity and its role in DNA repair, as well as its transcriptional regulation activity, response to DNA damage, and UBE2D1, BARD1 and BRIP1 binding. These data provide the most comprehensive resource on phenotypes of BRCA1 variants. We are expanding this work to other disease-causing genes: BRCA2, as well as genes implicated in Lynch syndrome (MSH2, MSH6, MLH1). |
Cusin I*, Zahn M, Bairoch A, Gaudet P
*SIB Swiss Institute of Bioinformatics, Switzerland |
Mutations, Variations, and Population Genomics |
|
F02 |
DNA methyation assessed by SMRT sequencing is linked to mutations in Neisseria meningitidis isolates
The gram-negative prokaryote Neisseria meningitidis features extensive genetic variability. To present, proposed virulence genotypes are also detected in isolates from asymptomatic carriers, indicating more complex mechanisms underlying variable colonization modes of N. meningitidis. We applied the SMRT sequencing method from Pacific Biosciences to assess the genome-wide DNA modification profiles of two closely related N. meningitidis strains of serogroup A. The resulting DNA methylomes revealed high divergence, represented by the detection of shared target motifs and of one novel strain-specific DNA methylation target motif. The positional distribution of these methylated target sites within the genomic sequences displayed clear biases, which suggests a functional role of DNA methylation related to the regulation of genes. DNA methylation in N. meningitidis has a likely underestimated potential for variability, as evidenced by a careful analysis of the ORF status of a panel of confirmed and predicted DNA methyltransferase genes in an extended collection of N. meningitidis strains of serogroup A. Based on high coverage short sequence reads, we find phase variability as a major contributor to the variability in DNA methylation. Taking into account the phase variable loci, the inferred functional status of DNA methyltransferase genes matched the observed methylation profiles. Towards an elucidation of presently incompletely characterized functional consequences of DNA methylation in N. meningitidis, we reveal a prominent colocalization of methylated bases with Single Nucleotide Polymorphisms (SNPs) detected within our genomic sequence collection. These findings suggest a more diverse role of DNA methylation and Restriction-Modification (RM) systems in the evolution of prokaryotic genomes. |
Sater M*, lamelas A, Roeltgen K, Wang G, Clark TA, Mane S, Korlach J, Pluschke G, Schmid CD
*Swiss Tropical and Public Health Institute (Swiss TPH) & SIB Swiss Institute of Bioinformatics, Switzerland |
Mutations, Variations, and Population Genomics |
|
F03 |
SNPs (Single Nucleotide Polymorphisms) are among the most common forms of DNA variations and they are observed in 1% of population. It is thought that SNPs may be localized on critical residues for protein-protein interactions called hotspots. SNPs, by affecting protein-protein interactions, may change the stability and formation of protein complexes. In this study, the correspondence between SNPs and hotspots is investigated. SNP data is taken from LS-SNP dataset. A protein-protein interface dataset from Piface containing 130209 interfaces is analyzed. Total 150 SNPs and 3200 hot spots were found on contacting residues of 103 protein chains containing SNPs. The analysis showed that %36 of these 150 SNPs were hotspots. 50 of the 150 SNPs were putative destabilizing, and 20 of them are hotspots. The average temperature factors (B-factor) were calculated for the 103 protein chains containing SNPs. Also, average B-factor of SNPs in a chain, and average B-factor of hotspot SNPs were calculated. In 73,78% of protein chains, SNPs have lower average B-factor than their corresponding chains, and in 88,89% of the chains hotspot-SNPs have lower or equal average B factor than SNPs average. This indicates that hotspot-SNPs are less flexible than SNPs which are not hotspots. |
Ozdemir S*
*Koç University, Turkey |
Mutations, Variations, and Population Genomics |
|
F04 |
Macaque monkeys are a key model species for various fields of biomedical research such as simian immunodeficiency virus pathogenesis, transplantation biology, drug development and safety testing. Cynomolgus monkeys (Macaca fascicularis) are the most widely used non-human primate species for drug safety testing in pharmaceutical companies and experimental results might be influenced by variation in biological processes among the individuals sampled. Knowledge of genetic factors contributing to variability with respect to biological drug responses could help to design better experimental approaches, which in turn would help to reduce, refine or even replace animal experiments. We attempt to investigate the importance and implications of genetic variation on cellular processes using genome-wide information on copy number variation (CNV) and gene expression from 24 Cynomolgus monkeys originating from four different populations used in pharmaceutical research (Mauritius, Vietnam, China and the Philippines). Using aCGH data and a CNV calling pipeline combining three different methods for CNV calling, we assess copy number variation among our cohorts. These results are combined with gene expression data from five different tissues (heart, kidney, liver, lung and spleen) to map expression quantitative trait loci (eQTLs). We discover eQTLs in all tissues, mostly acting in a tissue specific manner. Of interest many of these loci are found in the kidney, a key organ for drug excretion. Using further downstream analyses, we will attempt to get information on cellular processes possibly affected by these gene regulatory changes and to make statements on potential implications for drug safety testing. |
Gschwind A*, Heckel T, Certa U, Reymond A
*University of Lausanne & SIB Swiss Institute of Bioinformatics, Switzerland |
Mutations, Variations, and Population Genomics |
|
F05 |
Oaks are one of the most dominant trees in the forests of the Northern Hemisphere. The genus Quercus consists of about 600 extant species. It is believed that the Oak tree that is present at the University of Lausanne, Switzerland was planted in memory of Napoleon Bonaparte crossing through Dorigny campus in 1800. We aim to study how much the oak has undergone alterations over 236 years and understanding the genetic contributions to variability? In contrast to animals, progenitor cells of all somatic and reproductive structures in plants are produced in apical meristems, which will independently acquire mutations particularly in long-lived individuals of large stature such as trees. These local genetic alterations can then be transmitted to successive generations. Thus the extent to which the oak has undergone alterations is yet unknown. We have assembled and sequenced the genome of the oak and selected leaves from two divergent locations on the tree to evaluate the amount of genetic variation that has accumulated over more than two hundred years. We are interested in looking into the potential differences appearing in the same organism during life, with silent, moderate or large impact on evolution or health. Currently, we are looking at Single Nucleotide Polymorphisms across the two branches and we have around 4 Million sites that are called heterozygote across the branches. We investigate further, on the alterations that has undergone on one branch and not on other. |
Sarkar N*, Reymond A, Robinson-Rechavi M
*Université de Lausanne, Switzerland |
Mutations, Variations, and Population Genomics |
|
G01 |
KRAB-containing zinc finger proteins (KRAB-ZNFs) constitute the largest family of transcriptional regulators encoded by higher vertebrates. The structure of the KRAB-ZNFs consists of a N-terminal KRAB domain which behaves as a transcriptional repressor domain by binding to its universal cofactor KAP1, while their C-terminal consist of tandem repeats of C2H2 zinc fingers motifs that are sequence-specific DNA-binding potentials. The ZNF808 is among the longest KRAB-ZNFs with 23 C2H2 zinc finger repeats, and according to a ChIP-Seq analysis it binds on a specific DNA-sequence of 44bp length. In this work we give a prediction of which ZNF domains of the ZNF808 could bind on the DNA-target. Also, through an automatic procedure we produce and assess all the possible, consistent DNA-ZNF808 complex models to define how the ZNF808 binds on the DNA and, more in general, to understand which are the structural determinants of KRAB-ZNF/DNA recognition. |
Kalantzi A*, Dal Peraro M
*Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland |
Protein Interactions, Molecular Networks, and Proteomics |
|
G02 |
Quantitative mass-spectrometry (MS) employing stable isotope labeling with amino acids in cell culture (SILAC) for differential metabolic labeling is a frequently used fundamental tool for global analyses of (sub)proteomes including systematic studies of changes in protein expression, protein synthesis, various posttranslational modifications, and protein-protein interactions. The use of state-of-the-art MS instrumentation results in large amounts of quantitative data requiring a powerful computational framework for preparation, analysis, and interpretation of these data. We here present the PROVIS system, built for storage, processing, and evaluation of SILAC-based quantitative MS data analyzed with MaxQuant. PROVIS, based on R scripts, allows for fully automatized data analysis. It provides statistical analyses for quality control of the experimental data as well as for classification of proteins significantly altered in abundance between different experimental conditions, for example the identification of specific protein interaction partners against a background of co-purified contaminants. Algorithms for the automated prediction of outliers were implemented to ensure highest sensitivity without loosing specificity. The analysis pipeline of PROVIS allows for the fast and easy generation of various modes of data presentation, which facilitates straightforward data analysis and interpretation. Moreover, PROVIS provides software solutions for connectivity-, centroid-, and distribution-based clustering to determine patterns of significantly altered or constant protein abundances across several experiments. The PROVIS system and workflow was evaluated and optimized based on experimental SILAC interaction data. In addition, PROVIS is currently being extended to handle different experimental setups such as siRNA-mediated protein knockdown experiments or label-free approaches. |
Peikert C*, Drepper F, Oeljeklaus S, Warscheid B
*Institut für Biologie II (Biochemie), Universität Freiburg, Germany |
Protein Interactions, Molecular Networks, and Proteomics |
|
G03 |
Reverse engineering allow de novo inference of gene networks, in the absence of any a priori knowledge. This computationally intensive strategy is applicable, on a large scale, to gene regulatory networks because of the vast amount of data accumulated with gene array and RNAseq experiments. However, no such high-content high-throughput experimental technology exists for signaling networks. Thus, de novo network reconstruction for large biochemical networks is achieved by confronting the experimental data with an interaction subspace constrained by available literature evidence. SIGNOR (http://signor.uniroma2.it), the SIGnaling Network Open Resource, was developed to support experimental approaches based on multi-parametric analysis of cell systems. Typically, in such approaches, the state of a cell and its dynamic changes are revealed by monitoring the activation of key sentinel proteins. The experimental results are then compared with a literature derived logic network thereby allowing for context specific network optimization. SIGNOR offers a large network of experimentally validated logic relationships between signaling proteins that can be used as an a priori model for optimization strategies. The core of SIGNOR is a collection of approximately 11800 manually-annotated logic relationships between over 2800 proteins participating in signal transduction. As use-case, we have used the Signor network to interpret the proteomic data produced by “The Cancer Genome Atlas” project. We show that the analysis of this information in the context of a large a priori logic network allow: i) classification of cancer-specific sub-profiles ii) identification of different signaling perturbations in different cancers. |
Pirrò S*, Perfetto L, Briganti L, Calderone A, Cerquone Perpetuini A, Iannuccelli M, Langone F, Licata L, Marinkovic M, Mattioni A, Pavlidou T, Peluso D, Petrilli LL, Posca D, Santonico E, Silvestri A, Spada F, Castagnoli L, Cesareni G
*University of Rome "Tor Vergata", Italy |
Protein Interactions, Molecular Networks, and Proteomics |
|
G04 |
In structural biology, medicine, and drug design, one often needs to know the 3D structure of a target protein-protein complex. The Protein Data Bank [1] can help, but only a small fraction of its structures involve complexes. Consequently, without good prior knowledge of the binding sites, computational protein docking methods become necessary. Often, existing methods first explore the six dimensional (6D) search space of rigid body motions in order to locate possible positions of one protein with respect to the other, typically using a simple energy function. They then refine the candidate solutions using a more precise energy function (and possibly also including flexibility). However, important solutions may be missed right at the first search step. Our research focuses on improving the first stage predictions. We believe it will be very useful in virtual screening applications and it will also improve the subsequent refinement searches. Here, we present a protein docking algorithm that uses a more precise energy function to explore the 6D search space. It combines a very fast FFT-accelerated exhaustive search with a detailed data-driven model of the binding free energy. More precisely, it uses 3D Gauss-Laguerre expansions to represent each protein, the spherical polar Fourier transform to compute energy overlap integrals rapidly [2], and a convex optimization technique to learn the interaction potential. We describe the chemistry and geometry of each protein with 20 spherical 3D grids, and 210 pair-wise distance-dependent interaction potentials that have been learned from known 3D protein-protein interfaces using a convex optimization technique. The method runs in 5-10 minutes on a modern laptop for a mid-size protein complex. When tested on 195 bound protein hetero-dimer complexes, non-homologous to the training set, the method achieves correct rank-1 solutions in over 51% of cases, and it produces a correct solution within the top 10 solutions in 67% of cases. Here, a “correct solution” means one with a ligand root-mean-squared deviation (RMSD) of less than 5 Å from the native solution. We have then computed the success rate according to the separation distance between the centres of mass of the two proteins. With a separation distance up to 15 Å, a “correct solution” is found within the top 10 in 100% of the cases, 88% with a distance up to 25 Å, and 73% up to 35 Å. A low-resolution interaction potential of 1.5 Å instead of 1 Å is less accurate but has a success rate of 100% for separation distances up to 10 Å. This shows that our potential correctly predicts the interactions and that it could achieve even better results if it was not limited in its search range by the spherical sampling grid. Because many complexes have separation distances greater than 20 Å, we are now working on a multi-centre definition of the potentials in order to correctly predict the structures of protein complexes starting from their unbound structures. [1] H.M. Berman et al. The protein data bank. Acta Cryst. (2002), D58, 899-907. [2] D.W. Ritchie, D. Kozakov, and S. Vajda, Accelerating and Focusing Protein-Protein Docking Correlations Using Multi-Dimensional Rotational FFT Generating Functions, Bioinformatics (2008), 24 (17): 1865-1873. |
Neveu E*, Ritchie D, Grudinin S, Popov P
*INRIA, LJK, University of Grenoble Alpes, France, France |
Protein Interactions, Molecular Networks, and Proteomics |
|
G05 |
Despite decades of research, the molecular mechanisms that lead from somatic mutations to complex diseases like cancer remains illusive. The utilization of protein interaction networks emerged as a powerful tool, however with the problem that interaction maps are often incomplete and lack protein structural data for the functional interpretation of cancer mutations. Here we introduce a new protein-protein interaction database that exploits protein structural data to extend and enrich the current landscape of protein interactions. Binding sites and interactions are predicted on the basis of hidden markov sequence alignments and graph matching of protein structures from the Protein Data Bank. First results indicate a superior performance to known interaction databases with smaller number of false negatives and larger number of true positives. Using a first prototype of the database, we mapped over 566,000 cancer mutations from The Cancer Genome Atlas (TCGA) on proteins from the interaction database. Of these mutations, over 138,000 could be identified at binding sites. Most of these binding site mutations were found with proteins that were involved in cellular signaling and organization. One advantage of the new interaction database is its amino-acid level data on binding sites, which is able to distinguish distinct from mutually exclusive binding sites. For TCGA mutations we found most mutations to be located at distinct binding sites. However we also found a large fraction of mutations in mutually exclusive binding sites, which are particularly interesting as they alter multiple interactions simultaneously, and thus have a greater impact on interaction networks. |
Kahraman A*, Szklarczyk D, von Mering C
*University of Zurich, Switzerland |
Protein Interactions, Molecular Networks, and Proteomics |
|
H01 |
Mammalian gene expression displays widespread circadian oscillations. Rhythmic transcription underlies the core clock mechanism, but it cannot explain numerous observations made at the level of protein rhythmicity. We have used ribosome profiling in mouse liver to measure the translation of mRNAs into protein around-the-clock and at high temporal and nucleotide resolution. Transcriptome-wide, we discovered extensive rhythms in ribosome occupancy. Cycling proteins produced from non-oscillating transcripts revealed rhythmicity in specific pathways (notably in iron metabolism) and indicated feedback to the rhythmic transcriptome (through novel rhythmic transcription factors). Globally, translation efficiencies spanned a broader range than reported for cells, and showed pronounced signatures of regulation by upstream open reading frames (uORFs) and microRNAs. Moreover, we identified uORF translation as a novel mechanism within the core clock circuitry. In summary, our data offer a framework for understanding the dynamics of translational regulation, circadian gene expression, and metabolic control in a solid mammalian organ. |
Arpat B*, Janich P, Gatfield D
*University of Lausanne & SIB Swiss Institute of Bioinformatics, Switzerland |
Regulation, Pathways, and Systems Biology |
|
H02 |
Over the past decade, toxicologists have applied in-vitro screening to identify and prioritize chemicals potentially harmful for living organisms. Improvements in in-vitro assays, leveraging the mechanistic understanding of the cellular processes involved in complex response to toxicant has allowed for models to be generated and more reliable predictions to be made. Genome-wide molecular measurements, (e.g., microarrays) often used to infer high-level gene regulatory networks, are becoming more prominent in systems toxicological assessment. Are omics measurements sufficiently informative to predict/quantify the activity of cellular pathways involved in toxicity response? Answering this question is one of the upcoming goals of the Systems Toxicology challenge, which is part of the IMPROVER initiative (http://sbvimprover.com/). Over the past few years, IMPROVER has assembled a community of researchers and applied crowd-based approaches to address key questions, such as which and to what extend biological processes observed in animal models are translatable to human. For the upcoming challenges, participants will be provided of transcriptomics data after in-vitro single compound perturbation, and asked to assess the activity of cellular pathways involved in toxicity responses. Phenotypic high-content analysis readouts will be serving as reference for scoring predictions. A second challenge will assess methods aiming at identifying exposure biomarkers. Participants will be provided with transcriptomics data from blood samples, and asked to develop a method and provide a gene signature that best classifies subjects according to their smoking history. This initiative represents a step forward to integrating omics data into toxicological assessments. |
Belcastro V*, Poussin C, Boue S, Martin F, Sewer A, Titz B, Ivanov N, Peitsch M, Hoeng J
*PMI, Switzerland |
Regulation, Pathways, and Systems Biology |
|
H03 |
Adjuvants are compounds added to a vaccine formulation to enhance their immunogenicity. However, very few adjuvants are licensed for use in humans (four by the FDA, five by the European Medicines Agency) and in most instances, the mechanisms of adjuvant action remain poorly understood. Therefore, it is necessary to establish a standard method for evaluating adjuvants, which should be useful to predict immunological outcomes and facilitate the development of novel adjuvants. In an attempt to characterise and evaluate diverse adjuvants, comprehensive mice transcriptomes of 7 different adjuvants were captured using gene expression microarrays. We employed bioinformatics approaches to understand the functional relevance of the genes associated with adjuvants. First, we identified adjuvant-featured genes that were differentially expressed in the individual adjuvant-administered mice relative to control mice by using unsupervised learning technique. Subsequently, we retrieved the PPIs for the differentially-regulated genes corresponding to each adjuvant cluster and inferred PPI networks using TargetMine, an integrated database for drug and target discovery. Next, we investigated the expanded gene sets for the enrichment of specific biological themes such as KEGG pathways. Our analysis was able to highlight specific pathways the components of which were activated or repressed in response to specific adjuvants. Likewise, we identified genes and pathways that were specifically modulated by individual adjuvants. Our analysis also highlighted biological processes that appeared to be influenced by multiple adjuvants, thereby provided detailed insights into the basal mechanisms underlying the mode of action of different adjuvants. Our findings are crucial from the point of biomarker discovery and identification of potentially therapeutic targets that could facilitate the development of newer vaccines and therapeutic strategies for improved clinical outcomes. |
Tripathi L*, Ito J, Aoshi T, Ishii K, Mizuguchi K
*National Institutes of Biomedical Innovation, Health and Nutrition, Japan |
Regulation, Pathways, and Systems Biology |
|
H04 |
Stem cells are central to emerging concepts in health, medicine and therapy. Human embryonic stem cells (hESCs) in particular, harbour a great potential for future application in biomedical research. They can be differentiated into all cell types of the adult human body [1,2] providing an invaluable source of cells for basic research and the development of therapeutical approaches. In recent years, a plethora of studies defined the extracellular factors involved in the maintenance of pluripotency in culture, the transcription factor networks and the chromatin state of hESCs to great details [3]. However, how these layers of regulation intersect and interact to regulate pluripotency and cell fate specification remains enigmatic. The FGF [4] and ACTIVIN [5] signaling pathways play crucial roles in pluripotency and cell fate specification of hESCs, therefore we dissected the interplay of ACTIVIN and FGF signaling and identified their target genes in hESCs using genome-wide approaches (microarrays, RNA-seq and small RNA-seq). Common target genes are currently in the process of identification and validation thanks to the development of new bioinformatic tools integrating data from our laboratory and from publications, The different computational approaches will be presented as well as preliminary candidate genes. [1] Smith, A. EMBO Mol. Med. 1, 251–4 (2009). [2] Young, R. Cell 144, 940–54 (2011). [3] Chen, K. G., et al. Cell Stem Cell 14, 13–26 (2014). [4] Lanner, F. & Rossant, J. Development 137, 3351–60 (2010). [5] Tam, P. P. L. & Loebel, D. a F. Nat. Rev. Genet. 8, 368–81 (2007). |
Spies D*, Renz P, Beyer T, Ciaudo C
*ETH Zürich, Switzerland |
Regulation, Pathways, and Systems Biology |
|
H05 |
For the purpose of mathematical modeling of biochemical reaction networks by the frequently utilized nonlinear ordinary differential equation (ODE) models, parameter estimation and uncertainty analysis is a major task. In this context the term sloppiness has been introduced recently for an unexpected characteristic of nonlinear ODE models from the literature. In particular, a broadened eigenvalue spectrum of the Hessian matrix of the objective function covering orders of magnitudes is observed, although no such hierarchy of parameter uncertainties was expected a priori. In this work, it is shown that sloppiness originates from structures in the sensitivity matrix arising from the properties of the model topology and the experimental design. It will be clarified that the intensity of the sloppiness effect is controlled by the design of experiments, i.e., by the data. Thus, we conclude that the assignment of sloppiness to a model as a general characteristic is incomplete without discussing experimental design aspects. Furthermore, we validate this proposition by presenting strategies using optimal experimental design methods in order to circumvent the sloppiness issue and show results of non-sloppy designs for a benchmark model. |
Tönsing C*
*University of Freiburg, Germany, Germany |
Regulation, Pathways, and Systems Biology |
|
H06 |
Neurogenesis involves function of a plethora of transcription factors that become available at distinct stages of neuronal development. Using bioinformatics tools and data mining approaches, combined with genome-wide transcriptome profiling during neurogenesis, we shortlisted a set of transcription factors that are critical for transcriptional reprogramming underlying neuronal development. One such factor was JA1, which is known for its neurogenic potential. However, its genomic targets and mode of gene regulation during neurogenesis has not been explored. Here we show that the ectopic expression of JA1 is sufficient to cause neuronal differentiation of ES cells. In line, we find that JA1 induces entire neuronal differentiation program. We find that JA1 directly binds to the proximal and distal regulatory regions of key-neurogenesis genes to induce neuronal fate. We next implement Bayesian modelling to infer chromatin and transcription factor landscape at JA1 target sites in ES cells prior to JA1 occupancy. We find that in the absence of JA1, its target genomic regions are heterochromatic in nature and are enriched for repressor proteins. Following expression, JA1 is able to target its genomic sites in a highly sequence-specific fashion. Such targeting results in a loss of heterochromatin as well as repressor proteins and gain of features typical of active chromatin. Interestingly, JA1 targeting to intergenic regions was sufficient to create active enhancers that drive gene activation. Overall, our comprehensive findings uncovered the entire gene regulatory program through which JA1 specifies the neuronal fate and revealed how this function involves reprogramming the transcription factor and chromatin landscapes at its target sites. |
Pataskar A*, Jung J, Smiolowski P, Straub T, Tiwari V
*IMB Mainz, Germany |
Regulation, Pathways, and Systems Biology |
|
H07 |
One of the steps in the process of eukaryotic pre-mRNAs maturation is the 3' end cleavage and polyadenylation. In recent years, it has become clear that 3’ end processing is quite dynamic, with different transcript forms being generated from a given gene in different tissues and different cell states. Moreover, highly proliferative states such as cancer systematically favour cleavage and polyadenylation at 5' proximal sites, which modulates the malignancy of the cells [1]. These observations renewed the interest in 3’ end processing and its dynamic regulation. Our group has previously uncovered that the level of cleavage factor I (CF Im), a core component of 3’ end processing machinery, strongly influences the choice of polyadenylation sites [2]. To further understand the mechanisms underlying this regulatory effect as well as to uncover additional factors that regulate 3’ end processing, we constructed a genome-wide catalogue of polyadenylation sites that were observed across a large set of experiments. Furthermore, we developed a computational approach to infer regulators of polyadenylation site usage and to study their mechanism of action. Applying this approach to 3’ end processing site data obtained from cells that were exposed to various treatments, we uncovered a position-dependent effect of CF Im binding on the choice of poly(A) sites. [1] Sandberg, R., Neilson, J. R., Sarma, A. & Sharp, P. A. Science (2008). [2] Martin, G., Gruber, A. R., Keller, W. & Zavolan, M. Cell Rep. (2012). |
Schmidt R*
*University of Basel - Biozentrum & SIB Swiss Institute of Bioinformatics, Switzerland |
Regulation, Pathways, and Systems Biology |
|
H08 |
Parameter estimation is often the bottlenecking step in biological system modeling. For ordinary differential equation (ODE) models, the challenge in this estimation has been attributed to not only the lack of parameter identifiability, but also computational issues such as finding globally optimal parameter estimates over highly multidimensional search space. Recent methods using incremental estimation approach could alleviate the computational difficulty by performing the parameter estimation one-reaction-at-a-time. However, incremental estimation strategies usually require data smoothing and are known to produce biased parameter estimates. In this work, we developed a new parameter estimation method called integrated flux parameter estimation (IFPE). We employed the integral form of the ODE such that we could compute the integral of reaction fluxes from time-series concentration data without data smoothing. Here, we formulated the parameter estimation as a nested optimization problem. In the outer optimization, we performed a minimization of model prediction errors over parameters associated with a subset of reactions labeled as independent. The dimension of the independent reaction subset was equal to the degrees of freedom in the calculation of integrated fluxes (IF) from concentration data. We selected the independent reactions such that given their IF values, the IFs of the remaining (dependent) reactions could be uniquely determined. Meanwhile, in the inner optimization, we estimated the model parameters associated with the dependent reactions, one-reaction-at-a-time, by minimizing the dependent IF prediction errors. We demonstrated the performance of the IFPE method using case studies of ODE models of metabolic networks. In the case studies, the IFPE significantly outperformed standard simultaneous parameter estimation in terms of computational efficiency and scaling. In comparison to incremental parameter estimation (IPE) method, the IFPE produced parameter estimates with significantly lower bias and did not require time-series data smoothing. The advantages of IFPE over the IPE however came at the cost of a small increase in the computational time. |
Liu Y*, Gunawan R
*Swiss Federal Institute of Technology Zürich (ETH Zurich) & SIB Swiss Institute of Bioinformatics, Switzerland |
Regulation, Pathways, and Systems Biology |
|
H09 |
BACKGROUND: Amoeba D. discoideum is a bacteria predator. Its bacterial response is relevant to infections in humans because their mechanisms likely evolved from pathways in primitive eukaryotes to defend against bacteria. At present, though, there is little information on which genes in D. discoideum are responsible for coordination of bacterial recognition and resistance. A handful of these genes were recently identified through a screen for mutants that can grow on either Gram-positive or Gram-negative bacteria. Our aim was to extend this list by computationally proposing new gene candidates considering a plethora of available information. RESULTS: Our new method Collage prioritizes genes based on large collection of heterogeneous data. In a case study on Dictyostelium, we started from four bacterial response genes and 14 different data sets ranging from gene expression to pathway and literature information. Collage proposed eight candidate genes that were tested in the wet lab. Mutations in all eight candidates reduced the ability of the amoebae to grow on Gram-negative bacteria. This is a remarkably accurate result since only about a hundred of the 12,000 Dictyostelium genes are estimated to be responsible for bacterial response. CONCLUSIONS: Collage builds on our collective penalized matrix tri-factorization approach to data fusion that simultaneously compresses heterogeneous data and retains them in the original domain space. Collage can consider any data set represented with a matrix, including attribute-based representations, ontologies, associations and networks. Its high accuracy in bacterial resistance study of D. discoideum show promise for future applications. |
Zitnik M*, Nam EA, Dinh C, Kuspa A, Shaulsky G, Zupan B
*University of Ljubljana, Slovenia |
Regulation, Pathways, and Systems Biology |
|
H10 |
A major difficulty in genome-scale metabolic networks reconstructions and comparisons is to integrate data from different resources. Actually, these may use different nomenclatures and conventions for metabolites and reactions. To address this issue, we have developed MNXref, a precompiled automatic reconciliation of many of the most commonly used metabolic resources (ChEBI, Rhea, KEGG, MetaCyc, BRENDA, BiGG, The SEED, UniPathway, BioPath, HMDB, LipidMaps) (1). Based on MNXref, we designed MetaNetX.org, a website for accessing, analyzing, and manipulating genome-scale metabolic networks (GSMs) as well as biochemical pathways (2). It consistently integrates data from various public resources and makes the data accessible in a standardized format using the MNXref reconciliation. Currently, it provides access to hundreds of GSMs and pathways that can be interactively visualized, compared (two or more), analyzed (e.g. detection of dead-end metabolites and reactions, flux balance analysis, or simulation of reaction and gene knockouts), manipulated, and exported. Users can also upload their own metabolic models, choose to automatically map them into the common MNXref namespace, and subsequently make use of the website's functionality. New methods and data developed within MetaNetX, a project supported by the SystemsX.ch initiative, are also provided through this website. (1) Bernard, T., Bridge, A., Morgat, A., Moretti, S., Xenarios, I., and Pagni, M. (2014). Reconciliation of metabolites and biochemical reactions for metabolic networks. Briefings in Bioinformatics 15, 1. (2) Ganter, M., Bernard, T., Moretti, S., Stelling, J., and Pagni, M. (2013). MetaNetX.org: a website and repository for accessing, analysing and manipulating metabolic networks. Bioinformatics 29, 815–816. |
Pagni M*, Moretti S, Martin O, Tran TVD, Bernard T, Ganter M, Bridge A, Morgat A, Xenarios I, Stelling J
*SIB Swiss Institute of Bioinformatics, Switzerland |
Regulation, Pathways, and Systems Biology |
|
H11 |
Regulatory networks are becoming widely accepted as a useful mathematical tool to build qualitative models of regulatory processes. Boolean networks, as opposed to more quantitative modeling frameworks, are particularly interesting since they can cope with relatively large systems, and do not suffer from the lack of data on the stochiometry and kinetics of biochemical reaction. Boolean networks can be used to predict the behavior of the underlying biological system, by performing in-silico perturbations of any combination of genes and measuring the stable phenotypes reached by the network. Using Boolean networks to perform in-silico experiments is much faster and cheaper than in-vivo/in-vitro experiments and, although it does not eliminate the need for in-vivo/in-vitro experiments, it can significantly reduce the number of experiments needed to find for example therapeutically interesting combination of treatments. In this work, we present a method to infer Boolean regulatory networks using as much orthogonally generated data as possible, ranging from prior knowledge network (PKN) obtained from the literature to training sets obtained from in-vivo/vitro experiments (gene-expression, FACS, ...). |
Dorier J*, Niknejad A, Crespo I, Roller A, Tarditi A, Pradel LP, Maisel D, Berntenis N, Liechti R, Ebeling M, Xenarios I
*SIB Swiss Institute of Bioinformatics, Switzerland |
Regulation, Pathways, and Systems Biology |
|
H12 |
Various modelling frameworks have been devised to study complex biological systems. In this context, the logical formalism has proven to be well adapted for large signalling-regulatory networks [1]. In the case of multi-cellular processes, such as tissue pattern formation, one has to consider intracellular regulatory networks and cell-to-cell communication that together drive the behaviour of the whole. We present our framework to tackle the logical modelling of simple (one layer) epithelia defined as multi-cellular systems involving intracellular regulatory models, cell-to-cell communication, as well as other environmental cues. Such epithelial models are defined as cellular automata, where the behaviour of each cell is governed by its (logical) regulatory model, subject to input signals from neighbouring cells. Neighbouring relations are defined through appropriate functions, which qualitatively describe signalling ranges and integration. This framework has been implemented in the form of a software tool called EpiLog, freely available at http://ginsim.org/epilog, which enables the definition, simulation and visualisation of epithelial models. A graphical interface allows the user to configure a 2D grid of hexagonal cells, with each being attributed a logical model (loaded as an SBML file) and integration functions that define cell-cell communication. Simulation parameters include: initial configuration (state) of the grid, updating scheme (synchronous, α-asynchronous [2] and priorities), as well as perturbations (e.g. gene knock-out or knock-in in all or a subset of the cells). Simulation results in successive states of the grid that can be visualized and recorded. We illustrate the use of our framework on the regulatory control of the Drosophila eggshell pattern formation, where a group of specialised cells shape the dorsal appendages [3]. References: [1] C. Chaouiya et al. (2013) BMC Systems Biology, 7:135. [2] N. Fatès (2013) LNCS 8155, 15:30. [3] A. Fauré et al. (2014) PLoS Computational Biology, 10(3), e1003527. |
Varela P*, Fauré A, Chaouiya C, Monteiro P
*INESC-ID, Portugal |
Regulation, Pathways, and Systems Biology |
|
H13 |
Identifying SNPs interfering with transcription factor-target site interactions is important for understanding regulatory genetic variations. Here, we define a framework exploiting the fact that TF binding is partly predictable from DNase I hypersensitivity (DGF) assays. Our method requires: (i) genotypes and DGF data for a cohort of individuals, (ii) one ChIP-seq experiment/ TF for same cell-type and (iii) position weight matrix (PWM) for the TF. We scan the genome with the PWM to generate a list of predicted binding sites. Next, we train a predictor using one ChIP-seq experiment to predict the TF occupancy form DGF data. Then use this model to predict TFBS occupancy in other individuals. We then compile a candidate list of SNPs that lie within a predicted TFBS, causing significant difference in PWM score and are highly polymorphic. Finally, we perform genotype-phenotype correlation for each SNP-TFBS pair using DGF-inferred TF occupancy as phenotypes. To demonstrate the proof of concept, we chose CTCF. Data in this study includes, genotypes for 63/ 42 Yoruba individuals from HapMap PhaseIII/ 1000 genome project, DGF profiles for same individuals (GEO/GSE31388), CTCF ChIP-Seq data from ENCODE (GEO/GSE33213), and CTCF PWM from JASPAR (MA0139.1). We obtained a good prediction (Pearson’s correlation, R=0.78) of CTCF binding using Multivariate Adaptive Regression Splines. Using this predictor values, we identified 84 SNPs (commonly identified in HapMap and 1000 genome data) associated with differential CTCF binding (FDR≤10%). This general framework, can be applied to other factors thereby reducing the costs and potentially speeds-up the discovery of regulatory SNPs. |
Kumar S*, Dreos R, Ambrosini G, Bucher P
*Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland |
Regulation, Pathways, and Systems Biology |
|
H14 |
Much of the recent interest in variants affecting expression (eQTL) has focused on understanding their relationship with risk loci for disease. Currently, expression studies are reporting eQTL found in many different tissues. By combining a complete ascertainment of eQTL and the tissues they act in with information from GWAS, we should have a better concept of mechanisms of disease. We have taken RNA-seq data from four tissues (skin, fat, whole blood and LCLs) in 800 individuals and deconvolved expression into "tissue specific" and "cross tissue" components. These two new phenotypes were used to identify cis eQTL, both acting across tissues and in specific tissues, and we compare the results to those of a standard cis mapping treating tissues independently (referred to as standard analysis). We find more than 91% of the genes have a cross tissue eQTL (FDR < 0.05); the standard analysis found 84% of genes had an eQTL in at least one tissue. More cross tissue eQTL (38%) lie within 20kb of the transcription start site than tissue specific eQTL (30%), consistent with eQTL acting in multiple tissues being often located in promoter regions. Replication rates across tissues were higher for tissue shared eQTL, (median replication rate of 93% compared to 69% for tissue specific eQTL). Further work will integrate these tissue shared and specific eQTL with GWAS signals, to help prioritise the search for new risk loci, and to infer mechanisms and important tissues in disease pathology. |
Brown A*, Viñuela A, Buil A, Spector T, Dermitzakis E
*University of Geneva & SIB Swiss Institute of Bioinformatics, Switzerland |
Regulation, Pathways, and Systems Biology |
|
I01 |
Pacific Biosciences sequencing allows for the simultaneous detection of DNA methylation in particular m6A and m4C (Flusberg et al., 2010). The motifs around the methylation sites are provided by the SMRT pipeline in GFF format. However the visualization of these motifs methylated or non-methylated, for example on a bacterial genome, is not part of the pipeline. Circos is a software package for producing publication quality images of large scale data(Krzywinski et al., 2009), however mastering the numerous configuration files needed requires extra skills not easy to acquire for biologists. We developed a tool written in Perl and an associated web site called PACMAN (PacBio Methylation Analyzer) allowing users to easily create images with Circos in a user-friendly interface. The required files are 1) the genome or draft in FASTA format and 2) the motifs file in GFF format (from the SMRT pipeline). The counts of each motif are calculated according to a customizable sliding window and normalized by their expected frequency. PACMAN generates a publication quality image of the selected methylated motifs counts, locations and non-methylated locations, on one or both strands of the DNA. PACMAN is available on this web site: http://www.unifr.ch/bugfri/pacman References: Flusberg, B.A., Webster, D.R., Lee, J.H., Travers, K.J., Olivares, E.C., Clark, T.A., Korlach, J., and Turner, S.W. (2010). Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods 7, 461–465. Krzywinski, M., Schein, J., Birol, I., Connors, J., Gascoyne, R., Horsman, D., Jones, S.J., and Marra, M.A. (2009). Circos: An information aesthetic for comparative genomics. Genome Res. 19, 1639–1645. |
Loetscher A, Falquet L*
*University of Fribourg & SIB Swiss Institute of Bioinformatics, Switzerland |
Sequencing and Sequence Analysis |
|
I02 |
X-ray crystal structures have revealed that numerous membrane proteins, such as GPCRs or some secondary transporters, despite the lack of any detectable sequence similarity between them, still share very similar 3D structures. Moreover, some proteins were originally categorized into different sequence families and yet after their structural models became available, it has been revealed that they may share the same evolutionary ancestor. One of the representative examples is the LeuT-fold transporters. Their core structure consists of two units of 5TM helices. That these two units are related implies that LeuT-like transporters evolved from gene-duplication and fusion events. However, the lack of significant sequence similarity requires sensitive sequence search methods for analyzing their evolution. To this end, we developed a software application called AlignMe, which can use various types of input information, such as residue hydrophobicity, to perform pairwise alignments of sequences and/or of hydropathy profiles of (membrane) proteins. Here, we describe a modification of the dynamic programming algorithm that it allows positions in the input sequences to be connected, or constrained. This novel feature allows the user to define any number of so-called “anchors” with varying strength for improving the quality of pairwise alignments in challenging cases lacking notable sequence similarity. Information about possible anchors can be obtained from experimental studies, expert knowledge of specific motifs or even from the alignments of hydropathy profiles. There are manifold applications in homology modeling as well as in the context of mutagenesis experiments. |
Khafizov K*, Staritzbichler R, Ivanov M, Stamm M, Forrest LR
*Moscow Institute of Physics and Technology, Russian Federation |
Sequencing and Sequence Analysis |
|
I03 |
As the majority of TFs bind to DNA in a sequence specific manner, computational methods for motif discrimination have been critically important for the prediction of TFBSs. The interaction of TFs and DNA results from a complex interplay between nucleotide-amino acid synergy and the intrinsic topological DNA structure. We previously developed a sequence-based flexible hidden Markov model approach for TFBS prediction, the Transcription Factor Flexible Model (TFFM). The method capitalizes on ChIP-seq data with a refined approach that captures the internal relationships between positions. Existing software, DNAshape, computes four DNA shape features associated with a DNA sequence: the minor groove width, the roll, the propeller twist, and the helix twist. We hypothesized that combining both the DNA sequence and the four DNA shape features will enhance the TFBS predictions. We developed a new model which combines the sequence and the structure information of the DNA at TFBSs to extend the TFFMs. Our results on 400 human ENCODE ChIP-seq data sets show that adding the DNA shape features to the TFFM scores in a machine learning framework improves the prediction of TFBSs relative to using TFFM scores alone across almost all data sets. We highlighted that the observed significant improvements in TFBS predictions are not specific to the TFFMs as they are reproduced when combining the classical position weight matrix scores with DNA shape information in the same framework. The results highlight that incorporating DNA shape information is more beneficial to specific TF families, providing new insights to TF-DNA interactions. |
Mathelier A*, Wasserman W
*University of British Columbia, Vancouver, BC, Canada, Canada |
Sequencing and Sequence Analysis |
|
I04 |
The 3' ends of most RNA polymerase II-generated transcripts are processed by cleavage and addition of a ~70 nucleotides long poly(A) tail. Many groups have developed experimental methods that allow identification and quantification of mRNA 3' ends at a genome-wide scale. Currently, 14 different protocols have contributed more than 4.5 billion reads that pertain to putative pre-mRNA 3' end processing sites in mouse and human. Although each protocol comes with its own advantages and disadvantages, a common problem arises from the priming of oligo-T sequences at A-rich regions other than the poly(A) tail. Such technical limitations together with vastly different computational analyses of the resulting data sets have lead to vastly different numbers of 3' end processing sites that were reported in different systems. In human between 280,000 [1] and 1,287,130 [2] polyA sites have been identified. Given the pervasiveness of alternative polyadenylation and its importance for cell physiology [3], we constructed the “PolyAsite” (http://www.polyasite.unibas.ch) resource, which integrates the majority of the large-scale data available to date. We have further developed a computational approach to identify novel polyA signals. Supported by various validation steps, PolyAsite is a reliable resource for studying mRNA metabolism and dynamics. [1] Derti et al., 2012, Genome Res., 22: 1173–1183 [2] Lin et al. 2012, Nucleic Acids Res., 40: 8460–71 [3] Gruber et al. 2013, Wiley Interdiscip. Rev. RNA 5: 183–196 |
Gruber A*, Gruber AR, Schmidt R, Belmadani M, Zavolan M
*University of Basel - Biozentrum & SIB Swiss Institute of Bioinformatics, Switzerland |
Sequencing and Sequence Analysis |
|
I05 |
Evaluation of a probabilistic partitioning approach to systematically refine ChIP-seq peaks location
Genomic profiling of regions bound by a transcription factor or bearing histones with specific modifications has unravelled many biological insights these last few years. Obtaining good quality ChIP-seq peaks is still currently a challenge, as simply putting a threshold on peak height or false discovery rate is not necessarily selecting the best peaks. Recently, a probabilistic partitioning method has been proposed [Nair, et al. Bioinformatics 2014], which is able to separate the “good” from the “bad” peaks using the signal shapes, and subsequently shift the coordinates to best align the peaks on their shapes. This study showed successful application of this approach on CTCF. Here, we applied this approach on a transcription factor (glucocorticoid receptor, GR) in three cell lines, as well as histone marks, to assess its systemic usage in our analysis pipelines. For GR, the partitioning is leading good results, although using presence of a motif as indicator may be misleading: GR motif is not particularly enriched in two cell lines, suggesting indirect binding. For histone marks, the results are deceiving, mainly because the shape of the signal does not follow a Gaussian. Furthermore, we wondered whether using the “good” class was influencing motif discovery step, compared to using the complete list of peaks. We indeed found that this improves the results of motif discovery, as the “bad” class do not longer affect the enrichment of overrepresented sequences in our peak calling results. |
Tirado Magallanes R*, Hernandez C, Thieffry D, Thomas-Chollier M
*École Normale Supérieure, France |
Sequencing and Sequence Analysis |
|
I06 |
Correlation or Mutual Information between pairs of alignment columns arise from evolutionary constraints acting on a multitude of protein or nucleic acid positions. Given the current growth of sequence information from a diverse range of species, this type of co-evolutionary signal can be confidently detected and exploited in predictive computational methods. However, the co-evolutionary signal needs to be gauged against the background Mutual Information, which is linked to the finite alignment depth and compositional biases. We present a precise formula for the computation of the background Mutual Information that is independent of the residue pairing in the chosen alignment columns. For illustrative purpose, the formula was applied to (i) a multiple sequence alignment and (ii) stacked sequences of a 1D-encoded molecular dynamics trajectory of the switch II region of the human Ras protein. A comparison of the new formula with the commonly used Mutual Information of randomised columns illustrates the validity of both analytical and numerical methods. |
Kleinjung J*, Coolen ACC
*The Francis Crick Institute, United Kingdom |
Sequencing and Sequence Analysis |
|
I07 |
Assembling reads into finished genomes is a challenging task, exacerbated by sequence repeats and the inability of short reads to capture sufficient genomic information to resolve those problematic regions. Emerging nanopore technologies and the long reads they produce show great promise in this regard. In this work, we present LINKS, our Long Interval Nucleotide K-mer Scaffolder. LINKS is a scalable algorithm that makes use of the information in error-rich long reads, without the need for read alignment or base correction. We demonstrate its use in scaffolding applications on several datasets. We show how the contiguity of an ABySS E. coli K-12 genome assembly could be increased over five-fold by the use of beta-released Oxford Nanopore Ltd. (ONT) long reads, and how LINKS leverages long-range information in S. cerevisiae W303 ONT reads to yield an assembly with less than half the errors of competing applications. We also provide a proof-of-concept on how LINKS scales to larger genomes by re-scaffolding the colossal white spruce assembly draft (PG29, 20 Gbp) using orthogonal datasets. We expect LINKS to have broad utility in harnessing the potential of long reads in connecting high-quality sequences of small and large genome assembly drafts. |
Warren RL, Vandervalk BP, Yang C, Birol I*
*BC Cancer Agency, Canada |
Sequencing and Sequence Analysis |
|
I08 |
Profiling the transmitted founder virus is yet a challenge in the HIV-1 field. This is due to the paucity of studies which analyze the relationship between the viral swarm of the transmitter and the recipient. Herein we aim to determine whether transmission of HIV-1 is a stochastic process via the reconstruction of HIV-1 haplotypes from Zürich Primary HIV Infection Study (ZPHI) and Swiss HIV Cohort Study (SHCS) individuals belonging to 8 transmission pairs. SHCS Drug Resistance Database HIV-1 pol genotypic resistance test sequences were used to determine whether individuals enrolled in the SHCS or ZPHI were possible transmitters for ZPHI individuals. Pairs were identified via phylogenetic clustering of sequences (distance < 1.5%) and validated with clinical data. Full-length sequencing of plasma viral RNA via Illumina MiSeq Reagent v2 500 cycles Kit at the time point nearest transmission. Statistical modeling and gene-wise haplotype reconstruction were performed (for genes: p24, gp120, and gp41) to determine the stochasticity. Gene-wise haplotypes for the given pairs clustered together, when the Shannon entropy of the transmitter was >0.02 the recipient’s haplotypes formed a cluster nested. Key regions of interest: the major haplotype was transmitted in 5/8 pairs for p24, 1/8 and 3/8 for gp120 and gp41 respectively - genes known to harbor a high degree of diversity. Here in we’ve successfully reconstructed the HIV-1 viral haplotypes of 8 transmission pairs. The monophyletic clustering of haplotypes further confirms the pairs. Modeling of the observed haplotypes in both pairs did not reveal any significant indication of stochasticity or selection.” |
Campbell N*, Seifert D, Leemann C, Kuster H, Braun D, Weber R, Günthard HF, Beerenwinkel N, Metzner KJ
*UniversitätsSpital Zürich, Switzerland |
Sequencing and Sequence Analysis |
|
I09 |
High-throughput genomics has revolutionised biological research, however while the number of sequenced genomes grows by the day, quality assessment of the resulting assembled sequences remains complicated and mostly limited to technical measures like N50. We propose a measure for quantitative assessment of genome assembly and annotation completeness based on evolutionarily informed expectations of gene content. We implemented the assessment procedure in open-source software, with large lineage specific sets of Benchmarking Universal Single-Copy Orthologs named BUSCOs. Our lineage specific sets contain thousands of highly-conserved single-copy genes, allowing high-resolution quality quantification of newly sequenced genomes, transcriptomes and annotated gene sets. The initial set of gene annotations generated by BUSCO provides an excellent source of information for training gene-finding programs, a crucial step in the annotation of any new genome. |
Simao Neto F*, Waterhouse R, Ioannidis P, Kriventseva EV, Zdobnov EM
*University of Geneva & SIB Swiss Institute of Bioinformatics, Switzerland |
Sequencing and Sequence Analysis |
|
I10 |
Small nucleolar RNAs are a subclass of non-coding RNAs known to have a major role in post-transcriptional processing of other non-coding RNAs mostly ribosomal RNAs. Recently, these noncoding RNAs have been implicated in several other processes ranging from microRNA-based silencing to alternative splicing. A crucial prerequisite for gaining a deeper understanding of these processes is a comprehensive map of snoRNA gene loci. In this work we describe an up-to-date catalog of human snoRNA gene loci combining data from various database sources, de novo prediction and extensive literature review. Moreover, we provide curated genomic coordinates of currently annotated snoRNAs and give insights into the plasticity of snoRNA gene expression as well as their processing patterns by analysing small RNA-seq data from the ENCODE project. We have also provided a list of dysregulated snoRNAs in cancer cell lines and their associated targets. |
Jorjani H*, Gruber A, Zavolan M
*University of Basel - Biozentrum, Switzerland |
Sequencing and Sequence Analysis |
|
I11 |
Currently, more than 40 sequence tandem repeat detectors are published, providing heterogeneous, partly complementary, partly conflicting results. We present TRAL, a tandem repeat annotation library that allows running and parsing of various detection outputs, clustering of redundant or overlapping annotations, several statistical frameworks for filtering false positive annotations, and importantly a tandem repeat annotation and refinement module based on circular profile hidden Markov models. Availability and implementation: TRAL is an open-source Python3 library and is available, together with documentation and tutorials via http://www.vital-it.ch/software/tral. |
Schaper E*, Korsunsky A, Messina A, Murri R, Pečerska J, Stockinger H, Zoller S, Xenarios I, Anisimova M
*SIB Swiss Institute of Bioinformatics, Switzerland |
Sequencing and Sequence Analysis |
|
I12 |
The transcriptional landscape of the mammalian genome comprises a variety of different well-known RNA types, such as protein-coding mRNAs, long noncoding RNAs or microRNAs. Spatial and temporal changes in expression of these RNAs are thought to have contributed to the evolution of species- or lineage-specific phenotypic features. Circular RNAs (circRNAs) were recently discovered to represent a rather abundant class of transcripts. However, given that previous studies assessed circRNAs mainly in cell lines from a few individual species, the evolutionary dynamics of circRNAs remain poorly understood. To study the functional and evolutionary relevance of circRNAs, we generated comprehensive RNA sequencing datasets (total RNA; enzymatically-enriched for circRNAs) for three organs (cerebellum, liver, testis) across five species (human, rhesus macaque, mouse, rat, opossum) that represent three mammalian lineages (primates, rodents, marsupials). We combined experimental and computational approaches to thoroughly predict and annotate circRNAs on a genome-wide scale across these species on the basis of these data. I will present our current pipeline for the detection and transcript reconstruction of circRNAs. |
Gruhl F*, Janich P, Gatfield D, Kaessmann H
*University of Lausanne & SIB Swiss Institute of Bioinformatics, Switzerland |
Sequencing and Sequence Analysis |
|
J01 |
In mammals, new genetic screens using retroviral (or transposon) gene-trap vectors followed by PhiT-seq (Phenotypic interrogation via Tag sequencing) in a haploid genome revolutionized the investigation of molecular networks responsible for various biological processes. However, a standardized and dedicated bioinformatics pipeline remains to be developed for analyzing the massive amount of sequencing data that is being generated. Here we describe VISITs (Vector Integration Sites Identification from PhiT-seq), a bioinformatics tool automating the identification of genes and intergenic regions enriched with independent insertion sites of a gene-trap vector. After providing the genomic aligned reads, VISITs will first automatically remove duplicates and multiple-hit reads, then count insertion sites on compiled gene annotations or use a sliding window approach to scan the whole genome. To further improved sensitivity and specificity, transcriptomic data from human haploid cells were integrated. Finally, statistical tests are performed to generate a candidate genes list and BED files as output. By using VISITs, we first investigate the effect of duplicates definition and sequencing depth on analyzing PhiT-seq data. After implementing VISITs on public dataset (Jae, et al., 2013), VISITs managed to find genes, which have been identified previously, as well as additional candidates that appear biologically relevant. |
Yu J*, Monfort A, Ciaudo C
*Swiss Federal Institute of Technology Zürich (ETH Zurich), Switzerland |
Technology and Software |
|
J02 |
FAIRDOM is called to establish a support and service network for European Systems Biology Projects. Being a jointed action, the project aims at: setting up an internationally sustained Data and Model Management Service, supporting research projects funded by ERA-Net for Systems Biology Applications (ERASysAPP) while extending its service upon the European systems biology community. FAIRDOM builds on the experience and expertise of two existing systems biology management platforms: SEEK developed as part of the SysMO-DB initiative and openBIS platform developed as part of the SyBit project. |
Kuzyakiv R*, Krebs O, Wolstencroft K, Stanford N, Golebiewski M, Owen S, Nguyen Q, Bacall F, Morrison N, Straszewski J, Barillari C, Ramakrishnan C, Kunszt P, Rinn B, Snoep J, Müller W, Goble C
*Universität Zürich, Switzerland |
Technology and Software |
|
J03 |
Eoulsan [1] is a versatile open source framework that can reproducibly analyse huge amount of sequencing data. Especially dedicated to service platforms, this workflow manager automates the analysis of a large number of samples by simply using a list of analysis steps (and their parameters) and an experimental design. Moreover, it is available on a large choice of computational infrastructure: Hadoop clusters, cloud-computing or any standard workstation. We present here Eoulsan 2 [2], a major update of our tool enhancing the original Eoulsan concepts with a new workflow manager. It allows to reuse wrappers developed for the Galaxy platform [3], optionally enhanced with a link to a Docker [4] image packaging the tool to execute. Besides RNA-Seq, Eoulsan 2 now natively comes with new modules especially committed to ChIP-Seq analyses. Finally, in order to deploy Eoulsan on more diverse computational infrastructures, we plan to support before the end of 2015 Condor, TORQUE and other cluster schedulers in addition to Eoulsan's current Hadoop features. Our framework provides an integrated and flexible solution for high throughput sequencing data analyses, from standalone workstations to clusters and cloud computing. With its modular structure and its parallel data processing, Eoulsan takes up the challenges of massive data amount production in high throughput sequencing and brings a simple solution for reproducibility of analyses in bioinformatics. [1] Jourdren et al. Bioinformatics 2012. [2] http://transcriptome.ens.fr/eoulsan2 [3] Goecks et al. Genome Biol. 2010. [4] http://docker.com |
Perrin S, Hernandez C*, Thomas-Chollier M, Le Crom S, Jourdren L
*Institut de Biologie de l'Ecole Normale Supérieure, France |
Technology and Software |
|
J04 |
The characterization of interactions in protein-ligand complexes is essential for research in structural bioinformatics, drug discovery and biology. However, comprehensive tools are not freely available to the research community. Here, we present the protein-ligand interaction profiler (PLIP), a novel web service for fully automated detection and visualization of relevant non-covalent protein-ligand contacts in 3D structures, freely available at projects.biotec.tu-dresden.de/plip-web. The input is either a Protein Data Bank structure, a protein or ligand name, or a custom protein-ligand complex (e.g. from docking). In contrast to other tools, the rule-based PLIP algorithm does not require any structure preparation. It returns a list of detected interactions on single atom level, covering seven interaction types (hydrogen bonds, hydrophobic contacts, pi-stacking, pi-cation interactions, salt bridges, water bridges and halogen bonds). PLIP stands out by offering publication-ready images, PyMOL session files to generate custom images and parsable result files to facilitate successive data processing. The full python source code is available for download on the website. PLIP’s command-line mode allows for high-throughput interaction profiling. |
Salentin S*, Schreiber S, Haupt VJ, Adasme MF, Schroeder M
*Technische Universität Dresden, Germany |
Technology and Software |
|
J05 |
For designing, discovering or developing a drug, potency on the target protein is only one side of the problem. Indeed, absorption, distribution, metabolism, and excretion (ADME) must be optimized for the compound to reach the biotarget in sufficient concentration. To this aim, SwissADME gives access to a collection of predictive models to compute small molecule properties. Physicochemical properties. Particular emphasis is put on multiple predictions for lipophilicity and water solubility, to enable consensus approaches. Druglikeness filters. Simple rules based on the predicted physicochemical properties estimate oral bioavailability. Besides, the “Egan Egg” is a simple 2D-map useful for the optimization of absorption. Pharmacokinetic behaviors (e.g. brain permeation or P-glycoprotein substrate) are predicted by binary classification models relying on physicochemical properties. The straightforward chemical interpretation supports efficiently the design of molecules with improved PK profile. We added procedures to evaluate whether a structure stands for a medicinal chemistry friendly molecule. Potentially problematic fragments are identified by cutting the structure to be compared to molecular moieties known as problematic from the literature. Moreover, synthetic accessibility is estimated by a score combining molecular complexity penalty for specific chemical features (e.g. large and fused ring systems or many stereo-centers) with contributions for about 450,000 fragments obtained by analyzing the structures of 12 million real molecules present in vendor’s stocks. The assumption is that the most frequent the fragment is, the easiest a synthesis involving this fragment. SwissADME is part of the SwissDrugDesign environment developed by the Molecular Modeling Group at SIB Swiss Institute of Bioinformatics. |
Daina A*, Michielin O, Zoete V
*SIB Swiss Institute of Bioinformatics, Switzerland |
Technology and Software |
|
J06 |
Any biological system carries out its function through an elaborate interaction of a multitude of molecular components that together form complex biological networks. Visualization of these networks is of great interest in modern biology as it helps to gain insight into complex biological processes. As of today, various desktop-based applications that are able to visualize large-scale graphs exist. However, the emergence of the internet as the default software platform triggered a shift towards Rich Internet Applications. These provide a rich and interactive user experience in a cross-platform manner through a standard web browser. Visualization of large-scale graphs benefits from this paradigm as web standards facilitate interactivity and scalability. We created VisualGraphX, a web-based visualization tool for large-scale graphs that follows the rich-internet paradigm and empowers the users to efficiently explore the data in an interactive manner. Furthermore, it has been developed as an visualization plugin for the galaxy platform and can be enabled through the visualizations registry. We demonstrate its universal applicability using a metagenomic dataset that has been analysed with CoVennTree. |
Schäfer R*, Voß B
*University of Freiburg - Institute of Biology III, Germany |
Technology and Software |
|
J07 |
The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and functional data. The UniProt website (http://www.uniprot.org), with approximately 500'000 unique visitors per month, provides the primary access point to the scientific community for exploiting this rich resource. In September 2014, we published a major redesign of the website's interface that was the result of extensive research conducted by user experience analysts and involving over 350 users worldwide with varied research backgrounds and use cases. We focused primarily on creating easier navigation, improving the visibility and usability of existing functionality and structuring annotation data better for improved findability. The home page now presents tiles for each dataset to allow quick access, links to tools and help for users. All search result pages include filters to help users narrow down their results and offer the possibility to save data for later analysis, as well as options to customize columns in the results table. Simple text queries can be seamlessly refined into advanced queries with the help of autocompletion and an improved advanced query builder. The UniProt protein entry view has been restructured to better tie together related annotations and data under new intuitive headings. We will present these enhancements and showcase some of the new features that are currently under development. |
Bansal P*, Gasteiger E, Bolleman JT, Pundir S, Watkins X, Bingley M, Martin MJ, Redaschi N, Bougueleret L, Xenarios I, Consortium U
*SIB Swiss Institute of Bioinformatics, Switzerland |
Technology and Software |
|
J08 |
Flow and mass cytometry are experimental techniques used for the characterization of cell properties at the single cell level. While flow cytometry is currently able to measure up to 16 cell markers, the recently introduced mass cytometry technology is able to measure up to 40 cell markers (Bendall et al., 2011). The Spanning Tree Progression of Density Normalized Events (SPADE) algorithm (Qiu, 2012; Qiu et al., 2011) has been proposed to analyze mass cytometry data. Briefly, SPADE is an agglomerative hierarchical clustering-based algorithm combined with a density-based down-sampling procedure to automatically identify clusters of similar cells based on their marker expression intensities. We present here two complementary R packages designed for the visualization and analysis of mass cytometry data and SPADE results. SPADEVizR, is an R package dedicated to new visualization and statistical methods for SPADE results. We demonstrate that the proposed methods, such as parallel coordinates, multidimensional scaling or streamgraph representations, offer new ways to represent features, similarities and kinetics of SPADE clusters. Moreover, the proposed statistical methods allow to automatically identify SPADE clusters with high biological information and to integrate them with additional biological variables. Complementary to this, we have developed CytoAnnot another R package for automatic annotation of cell populations in cytometry profiles or in SPADE clusters. Using a reference set of previously defined cell populations, CytoAnnot is able to automatically identify and label cells or SPADE clusters having similar profiles. Moreover, CytoAnnot can also be used to discover uncharacterized cell populations. These R packages are extremely valuable for bioinformaticians and biologists aiming to automatically mine and explore results analyzed with SPADE. |
Tchitchek N*, Pejoski D, Platon L, Le Grand R, Beignon A
*CEA, France |
Technology and Software |
|
J09 |
Quantitative trait loci (QTLs) are genomic regions containing variants (SNPs, indels, CNVs) that are linked to quantitative phenotypes. The most common QTLs are eQTLs that associate variants to gene expression. These genomic tags are particularly useful for biologists because they point to regulatory regions of genes such as promoters and enhancers. EQTLs linking variants to nearby genes are called cis-eQTL and long-range associations (>1Mb) that can even be on different chromosomes are called trans-eQTL. Existing QTL browsers have difficulties to provide an efficient interface that allows users to easily navigate and switch from trans to cis-eQTL views. Here we present the SwissQT web browser prototype that provides a user-friendly view and an easy navigation through molecular QTL data. We especially focused our efforts on displaying trans-eQTLs linking a gene to many variants on several chromosomes. Our main aim is to rapidly provide to the scientific community a reference browser for eQTLs but also for other types of molecular QTLs including chromatin state, histone mark and protein level QTLs. |
Howald C*, Masselot A, Götz L, Xenarios I, Dermitzakis ET
*University of Geneva, Switzerland |
Technology and Software |
|
J10 |
The European project MD-Paedigree aims to develop a paediatric digital repository, where several clinical centres pursue improved interoperability of paediatric biomedical information, data and knowledge by developing together a set of reusable and adaptable models. As part of this project, we describe here the development of a case-based retrieval service to help physicians to identify patients like theirs patients. A set of about 25’000 cases provided by one of the clinical centres of the project was locally stored. Then, an automatic assignment of MeSH descriptors was performed on the discharge summaries. Cases were then indexed using an Apache Solr engine. A web-based interface was finally developed. The user inputs a query, to which MeSH descriptors are also automatically assigned, and the retrieval engine outputs similar cases. A medical expert manually evaluated the system. The top-10 answers for 40 randomly selected cases were manually assessed and precision was computed. The first case returned by the system is considered as relevant in almost two thirds of the cases (P1=63%), while half of the top-10 returned cases are judged relevant (P10=54%). However, for 8 out of 40 queries, no similar case has been found by the system in the top-10. This tool is of particular importance to help physicians to provide a more predictive, individualized, effective and safer paediatric healthcare. Indeed, retrieving similar cases of their patients will enable them to see the different treatments used for patients like theirs so that the resulting outcome. |
Pasche E*, Gobeill J, Ruch P
*HES-SO HEG, Switzerland |
Technology and Software |
|
J11 |
The number of new biological papers continue to grow and the “manual” bio-curation process is time consuming. This work intended to propose a tool that would fully support the task of automatically annotating scientific documents. Like now, the challenge is that many constraints hinder the steps in the traditional workflow. (1)The low usage of nomenclature standards by researchers often leads to ambiguous or inaccurate description of biological objects (proteins, genes, constructs, reagents) by automated tools. (2)The ambiguous semantics are often more easily resolved by experts than by machines. (3)The information is dispersed throughout the research article. (4)And the specific information is frequently “drowned” by general statements, especially in the Introduction and in the Discussion sections of the article, while relevant ones could be found in the Method or the Figure legend. Our application would simplify the work of annotators by highlighting articles that present novel or unannotated findings. We introduce a new pathway to retrieve information from the mess to a reliable output that they would have only to validate before its recording. If the harvesting is important to lead in the workflow, results prioritization will have a major role. This aspect is treated through a ranking module customized for specific axes (diseases, GO, PPI, …). Thereby, we are developing an end-user pipeline able to avoid the redundancy of the knowledge we already treated. And we also want to maintain a quick and relevant annotation on new papers besides upgrading the older. |
Mottin L*, Gobeill J, Ruch P
*Geneva School of Business Administration (HEG) & SIB Swiss Institute of Bioinformatics, Switzerland |
Technology and Software |
|
J12 |
The PoSeNoGap project focuses on the design and implementation of efficient genomic data formats and tools to enable genome analysis applications to scale to large datasets. The proposed solutions shall exploit the available parallelism of the platforms operated by PASC partners. The project is promoting investigations on current issues in genomic data representation, compression and transmission within MPEG, an ISO/IEC working group specialized in digital media representation and compression. This activity has led to the definition of a reference genomic dataset to be used to assess the performance of existing bioinformatic tools and to identify requirements for a standard representation of genome data. |
Petraglio E, Thoma Y, Alberti C, Mattavelli M, Kuznetsov D, Guex N, Iseli C, Stockinger H, Schuepbach T, Topolsky I, Zerzion D*, Xenarios I
*SIB Swiss Institute of Bioinformatics, Switzerland |
Technology and Software |
|
J13 |
Although MATLAB is a prevalent and well-supported numerical computing environment used by modelers in biology, a growing body of researchers find it discouragingly difficult to convert MATLAB ODE models into open and widely-used formats in systems biology. Out of these formats, SBML (the Systems Biology Markup Language) has become the de-facto standard for exchanging models between software tools in the community. We introduce MOCCASIN (Model ODE Converter for Creating Awesome SBML INteroperability), an open-source tool that uses a combination of heuristics and user assistance to convert ODE models written in MATLAB into SBML format. In order to allow for structural analyses of SBML output, MOCCASIN computes a system of reactions with the same ODE semantics by inferring well-formed reactions whenever possible. In this manner, exported SBML files are correctly written as a set of formal reactions, with well-identified reactants, products, modifiers and stoichiometries for each reaction. Current enhancements to MOCCASIN will allow for the interpretation of flow control constructs, SBML encoding of MATLAB comments, pooling of models form several MATLAB files and generation of the appropriate SED-ML (The Simulation Experiment Description Markup Language) files, which are used to encode simulation parameters essential for reproducing simulations across platforms. |
Gomez H*, Hucka M, Keating S, Iber D
*Swiss Federal Institute of Technology Zürich (ETH Zurich), Switzerland |
Technology and Software |
|
J14 |
Transcription factor binding motifs (TFBM) are classically represented either as consensus strings or as position-specific scoring matrices (PSSM). Thousands of TFBMs are available in specialized databases (Jaspar, Transfac, CisBP) but they can also be discovered ab initio from genome-scale data sets (promoters of co-expressed genes, ChIP-seq peaks) using different and complementary motif discovery algorithms. For different reasons, TFBM collections usually contain groups of similar motifs: motifs bound by homologous TF, non-homologous TFs having a similar DNA-binding domain, redundant motifs discovered using alternative algorithms, etc. In order to interpret such results, there is an increasing need for tools to highlight groups of similarities among motif collections. We developed matrix-clustering, a tool enabling to cluster motifs, segment the distinct clusters in single trees, display intra-cluster relationships with user-friendly logo trees and perform multiple alignments at each level of these trees. The tool includes flexible user-selectable parameters to highlight different levels of similarity, by combining several inter-motif similarity metrics, and by combining thresholds on several of these metrics. matrix-clustering also supports the simultaneous clustering of motifs from different motif collections, which can be useful to compare full motif databases. We illustrate the potentialities of this tool with two study cases: (I) clustering of redundant TFBMs discovered from ChIP-seq peaks using several motif discovery algorithms and (II) inter-database comparisons of motif collections. Availability: matrix-clustering is integrated in the software suite Regulatory Sequence Analysis Tools (RSAT, http://rsat.eu/ ). It can be used via its web site or as a stand-alone application. |
Castro J*, Thomas-Chollier M, van Helden J
*Aix-Marseille Université, France |
Technology and Software |
|
J15 |
SwissRegulon portal (www.swissregulon.unibas.ch) is a repository of databases and bioinformatics tools related to transcription regulatory processes. It includes: SwissRegulon: A database of genome-wide annotations of regulatory sites. We currently have annotations for 17 prokaryotes and 3 eukaryotes (including human and mouse) in our collection. PhyloGibbs: An algorithm for inferring regulatory motifs and regulatory sites from collections of DNA sequences, including multiple alignments of orthologous sequences from related organisms. ISMARA: Integrated System for Motif Activity Response Analysis is a free online tool that models genome-wide expression data in terms of our genome-wide annotations of regulatory sites. TCS: A database of predicted two-component signaling interactions across bacterial genome. |
Pachkov M*, van Nimwegen E, Balwierz P
*Swiss Institute of Bioinformatics, Biozentrum Basel, Switzerland |
Technology and Software |
|
J16 |
Multi-dimensional genomic data sets combining DNA-seq and ChiP-/RNA-seq require methods that rapidly correlate thousands of molecular phenotypes with millions of genetic variants while correctly controlling for multiple testing in order to discover quantitative trait loci (QTLs). To this end we developed FastQTL, a new software that implements the most popular QTL mapping strategy in an user- and cluster-friendly manner, together with key improvements which make the permutation procedure fast and accurate. By modeling the permutation process using a beta distribution trained via maximum likelihood estimation on a small number of permutations (typically 100 to 1,000), we obtain a good approximation of the tail of non-parametric null distribution which allows us for the first time to accurately estimate corrected p-values of association at extremes of significance (~10-20 and below). We performed a comprehensive evaluation of FastQTL on RNA-seq (phenotype) and DNA-seq (genotype) data generated by the two largest eQTL studies to date; Geuvadis [1] and GTEx [2]. These comprise 11 distinct data sets with 14K to 35K quantified genes and 6.8M to 10.8M variant sites defined for 83 to 373 samples. We find that our approach provides accurate estimates of small corrected p-values that cannot be feasibly achieved by standard or adaptive permutation strategies. In addition, a reanalysis of these GTEx eQTL sets can now be performed in minutes on a compute cluster, dramatically faster than other methods with no loss of power. Source code, binaries and comprehensive documentation are available at http://fastqtl.sourceforge.net/. References 1. Lappalainen, et al. Nature 501, 506-511 (2013). 2. Lonsdale , J. et al. Nat. Gen. 45:580-585 (2013). |
Delaneau O*
*University of Geneva & SIB Swiss Institute of Bioinformatics, Switzerland |
Technology and Software |