A01 |
Scotti E*, Boué S, Hoeng J
*Philip Morris International R&D, Switzerland
Risk assessment in the context of 21st century toxicology relies on the development of relevant computational approaches for the extraction of mechanistic knowledge from big data. Crowdsourcing is a powerful approach to solve scientific problem(s) and independently verify methods, results and conclusions. Relevant scientific contribution depends on the interest of the scientific questions formulated incentives put forward. Based on these principles, the sbv IMPROVER organizes since 2011 crowd-sourced challenges covering a broad range of scientific questions. For instance, the Diagnostic Signature Challenge aimed to identify robust gene expression signatures and classification models in four disease areas. The Species Translation Challenge sought to refine our understanding of the limits of rodent models as predictors of human biology. The Systems Toxicology Challenge aimed to identify signatures predictive of smoking exposure or cessation status. In addition to computational methods benchmarking, curation of scientific literature can also leverage crowdsourcing. Our Network Verification Challenges (NVC) have been designed to encourage broad participation in causal network models refinement with lower level of expertise required and participation encouraged through a live leaderboard. A novel instance of NVC will focus on the verification of biological network models involved in the xenobiotic transformation of toxicants in the liver, which should be of great interest for toxicological and pharmacological assessment. Finally, we are developing a step-wise microbiomics challenge, which should leverage the knowledge of diverse scientific communities to tackle in turn computational, biological, and medical topics related to the microbiome.
|
Biocuration, Databases, Ontologies, and Text Mining |
online |
A02 |
Pan C, Lin W*
*Academia Sinica, Taiwan
microRNAs play important regulatory roles in cellular functions and developmental processes. They are also implicated in human oncogenesis processes and could serve as potential cancer biomarkers. Our laboratory has been working on the discovery of miRNAs using computational pipelines as well as NGS sequencing data. In previous studies, we established comprehensive 5p-arm and 3p-arm miRNA annotations and applied them for thorough interrogation on the arm-specific miRNA cancer expression profiles. We utilized The Cancer Genome Atlas miRNA expression datasets and explored the 5p-arm / 3p-arm miRNAs differential expression patterns. Following ANOVA statistical analysis, differentially expressed 5p-arm / 3p-arm miRNAs could be identified in various cancer types. We identified several miRNAs significantly modulated in each cancer types. While it is known that 5p-arm and 3p-arm miRNAs often expressed at different levels during maturation steps, there are few 5p-arm and 3p-arm miRNA pairs identified to be significantly modulated together in several cancer types. This implicated the significance of these miRNAs in the oncogenesis processes and could server as universal human cancer biomarkers. We then established interactive web resource to assist biologists exploring the unique expression profiles of individual miRNAs in different cancer types. Our goal is to better visualize the miRNA expressions using visual analytics techniques mainly based on the D3 JavaScript libraries. By using advanced interactive visual user interface, our web tools could allow users to learn more about multidimensional miRNA expression data in TCGA. Furthermore, we also developed an iOS mobile APP to visualize miRNA expression information across multiple cancer types. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A03 |
Akarsu-Egger H*, Falquet L
*Unifr, Biochemistry Unit & SIB Swiss Institute of Bioinformatics, Switzerland
Ensembl Bacteria (http://bacteria.ensembl.org) consists of completely sequenced genomes from eubacteria and archaea, which sequences have been deposited in INSDC and have at least 50 CDS annotations. Assembly sets are grouped into collections of up to 250 genomes randomly. The collections of genomes are then passed to the INSDC annotation import pipeline for loading into Ensembl. Each collection is a separate MySQL database. Currently Ensembl Bacteria contains more than 173 collection totalling more than 40'000 genomes.
Many toxin-antitoxins operons are not fully annotated by the INSDC pipeline. The reasons include mixed quality of the InterPro models describing the various families and high variability among these proteins with possibly totally new families of toxins or antitoxins.
We present here a pipeline compiling the mining of the MySQL databases of more than 40'000 genomes. We improve the annotation of the known TAs families, expand the database of known ones and detect new TAs families. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A04 |
Famiglietti ML*, UniProt Consortium
*CMU & SIB Swiss Institute of Bioinformatics, Switzerland
We are at the dawn of a new era of personalized genomic medicine where advances in human healthcare will be powered by the integration of data from many sources, including structured electronic patient records and data linking genomic variants to computable descriptions of functional and clinical impact. Here we describe work performed in UniProt/Swiss-Prot that aims to standardize the curation and provision of variant data using a range of ontologies including VariO, GO, and ChEBI. Our focus on variants with functional impact demonstrated using biochemical assays makes UniProtKB/Swiss-Prot variant data highly complementary to that from resources which use genetic data (such as pedigree analyses or GWAS studies) to link variants to specific diseases, phenotypes, or traits. UniProtKB/Swiss-Prot currently provides more than 8,000 variants with curated functional impact.
Keywords: UniProtKB/Swiss-Prot, Database, Expert curation, Variants, Ontology, Genetic diseases
|
Biocuration, Databases, Ontologies, and Text Mining |
|
A05 |
Zahn M*, Gateau A, Gleizes A, Cusin I, Michel P, Bairoch A, Gaudet P, Lane L
*CMU & SIB Swiss Institute of Bioinformatics, Switzerland
As the reference knowledgebase for HPP, neXtProt integrates mass spectrometry (MS) data from proteomics experiments and upgrades the protein existence value to “evidence at protein level (PE1)” for entries matching the criteria agreed upon with HPP. The latest PeptideAtlas Human and Phosphoproteome data, as well as manually curated data from MS experiments reported in the literature, are loaded in neXtProt. Manual spot checks using random examples and special cases allow systematic errors to be quickly identified. For example, when a curated MS publication refers to a new type of PTM, we spot check that an entry retrieved by searching with this term contains the PTM and that it is found in the PEFF file for the entry. With the advent of the neXtProt RDF data model and SPARQL querying in 2015, global checks are now being carried out at each neXtProt HUPO reference release. We have thus tracked the evolution of (1) the number of MS peptides mapping to entries (quantitative metric), (2) the percentage of MS peptides mapping which are proteotypic (quality metric) and (3) the percentage of entries with a MS peptide mapping (coverage metric). Our metrics indicate a gradual increase in the quality of the MS data integrated in neXtProt. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A06 |
Aimo L*, Liechti R, Hyka-Nouspikel N, Götz L, Niknejad A, Gleizes A, Kuznetsov D, David FPA, van der Goot G, Riezman H, Bougueleret L, Xenarios I, Bridge A
*CMU & SIB Swiss Institute of Bioinformatics, Switzerland
SwissLipids (www.swisslipids.org) is designed for life science researchers performing targeted and untargeted lipidomic analyses who wish to interpret their data in the light of prior knowledge of lipid structures and biology.
Lipids are a diverse group of biological molecules with fundamental roles in membrane formation, energy storage, and signaling. The lipidome of an individual cell may contain thousands of lipids whose levels are tightly regulated by cellular signaling and nutritional status. Pathologies like cancer, diabetes, cardiovascular and neurodegenerative diseases, as well as infections, alter the lipidome composition, making lipids a rich source of potential biomarkers and drug targets.
High-throughput mass spectrometry-based platforms provide a means to study lipidome composition. However, the gap between lipidomic data and prior knowledge of lipid biology limits its interpretation and integration with other ‘omics datatypes. To facilitate this task we developed SwissLipids, a manually curated knowledge resource for lipids and their biology. SwissLipids links mass spectrometry outputs to around 300,000 possible structural variants of over 180 lipid classes and to expert-curated knowledge of metabolism (using www.rhea-db.org), enzymes (using www.uniprot.org), lipid functions, protein interactions and occurrence, in human and model organisms. Annotations are sourced from over 1300 peer-reviewed articles and are linked to source publications with supporting text and evidence codes. Mapping of identifiers from other resources such as LIPID MAPS and HMDB to structures and enzymatic reactions in SwissLipids is also available. SwissLipids is updated with new knowledge on a daily basis. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A07 |
Pasche E*, Mottaz A, Mottin L, Gobeill J, Teixeira D, Stockinger H, Singer F, Toussaint N, Stekhoven D, Ruch P
*HES-SO Geneva & SIB Swiss Institute of Bioinformatics, Switzerland
Personalized medicine in oncology relies on the use of treatments targeting specific genetic variants. However, identifying the variants of interest is a fastidious task: a patient usually presents several thousands of genetic variants. We propose here a system to automatically rank genetic variants of a given patient based on their occurrences in the biomedical literature. Our system receives as input a patient with a diagnosis and a set of text files containing all identified mutations. Data is pre-processed to remove non-coding SNVs, which reduces the number of variants by a factor of 5. The remaining variants are then submitted to our search engine, where each variant is assigned a score. For sake of tuning and evaluation, we use 5702 variants from 5 different patients. This set of patients was manually curated by a tumor board: the variants are associated with a pathogenicity and a possible chemotherapy. The effectiveness of our variant ranker reaches 81% (mean reciprocal rank), with a precision at rank 5 of 62%. It means that almost two thirds of the variants ranked in the top-5 positions by our system were judged as clinically relevant by the tumor board. Further, it is to be noted that within the top-ranked mutations not identified as relevant, there might be some interesting variants because the coverage of our curated benchmark is likely partial. While we are currently focusing our efforts on the ranking of genetic variants, complementary steps will include the automated processing of the literature to extract potentially beneficial treatments. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A08 |
Mottin L*, Pasche E, Gobeill J, Teodoro D, Rech de Laval V, Gleizes A, Michel P, Bairoch A, Gaudet P, Ruch P
*HES-SO / HEG Geneva, Battelle campus & SIB Swiss Institute of Bioinformatics, Switzerland
The curation and maintenance of molecular biology databases is labour intensive. While text-mining is gaining impetus among curators, its integration in curation workflow has not yet been widely adopted. The SIB Text Mining and CALIPHO groups joined forces to design new a curation support system named nextA5. So that, we explore the integration of novel triage services to support the curation of two types of biological data: PPIs and PTMs. The recognition of PPIs and PTMs poses a special challenge, as it not only requires the identification of biological entities (proteins or residues) but also that of particular relationships (e.g. binding or position). Prioritizing papers for these tasks thus requires the development of different approaches. We defined two sets of descriptors to support automatic triage (for PPIs and PTMs). All occurrences of these descriptors were marked-up in MEDLINE and indexed, thus constituting a semantically annotated version of MEDLINE. Then these annotations were used to estimate the relevance of a particular article with respect to the chosen annotation type. This relevance score was combined with a local vector-space search engine to generate a ranked list of PMIDs. We also evaluated a query refinement strategy, which adds specific keywords (such as “binds” or “interacts”) to the original query. Compared to PubMed, the search effectiveness of the nextA5 triage service is improved by 190% for the prioritization of papers with PPIs information and by 260% for papers with PTMs information. Thus, combining advanced retrieval and query refinement strategies with automatically enriched MEDLINE contents is effective to improve triage in complex curation tasks such as the curation of protein PPIs and PTMs. Prototype: http://candy.hesge.ch/nextA5 Publication: doi : 10.1093/database/bax040 |
Biocuration, Databases, Ontologies, and Text Mining |
|
A09 |
Pedreira T*, Monteiro P, Teixeira M, Chaouiya C
*Instituto Gulbenkian de Ciência, Portugal
With the rise of high throughput techniques, the amount of collected data regarding regulatory interactions in the budding yeast Saccharomyces cerevisiae has greatly increased. We review all the documented information publicly available in the YEASTRACT database and analyse structural properties of the full network and of a sub-network involved in multi-drug resistance (MDR). Here, we assess the current knowledge concerning the distribution of regulatory interactions among transcription factors, and the effects of these interactions. Furthermore, for each transcription factor, we aim to identify its potential activatory or inhibitory global role, and how it might contribute to specific biological functions. Data supporting the regulatory interactions comes from DNA binding and/or expression evidence experiments. We thus distinguish between these, which allows to shed light on the current understanding of the regulatory interactions exerted by each transcription factor, unveiling a lack of information on a small subset of transcription factors. This analysis permits to suggest further efforts to extend the full regulatory network of Saccharomyces cerevisiae. Finally, by considering the MDR sub-network embedded in the full network, we present a method to identify candidates potentially involved in drug resistance, a process of clinical relevance for close pathogenic species. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A10 |
Gao B*, Baudis M
*UZH, Institute of Molecular Life Sciences & SIB Swiss Institute of Bioinformatics, Switzerland
The Beacon project, initiated by the Global Alliance for Genomics and Health (GA4GH), is a pioneer to test the willingness of international data portals to share genetic data. It defines a web service that accepts queries of the existence of a single allele and responds with only “yes” or “no”. Currently over 50 Beacons have been lit up and they form up a beacon network for integrated query. However, the simple information a beacon service can provide limits the applications of the network. BeaconPlus extends the one simple query of the original Beacon to three types of queries: variant, structural and meta data. It also responds with more information and implements the GA4GH schema in the backend. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A11 |
Kumari B*, Kumar R, Kumar M
*University of Delhi South Campus, India
Palmitoylation is a post-translational modification (PTM) of eukaryotic proteins in which a lipid moiety is covalently attached to the protein. A number of studies have already established the significance of palmitoylation on biological functions of multiple protein classes and many human physiological disorders including neurodegenerative diseases, cancer and X-linked mental retardation. Palmitoylation are mostly known to occur on cysteine but non-cysteine residues sometimes also get palmitoylated. Here, we present an association rule mining approach for detection of statistically overrepresented amino acids near the palmitoylation site. For palmitoylation involving glycine, the association pattern of <Met,–1><Cys,1><Leu,2><Gly,3><Asn,4><Ser,5><Lys,6> was detected with support levels 5%–50%. Near the serine residues undergoing palmitoylation, we identified 6 unique association rules: (I) <Glu,–7><Cys,6><Lys,–5><Cys,–4><His,–3><Gly,–2><Gly,1><Ser,2><Cys,3><Thr,7><Cys,8><Trp,9>; (II) <Cys,–6><Lys,–5><Cys,–4><His,–3><Gly,–2><Val,–1><Gly,1><Ser,2><Cys,3><Thr,7><Cys,8><Trp,9>; (III) <Cys,–6><Lys,–5><Cys,–4><His,–3><Gly,–2><Gly,1><Ser,2><Cys,3><Thr,7><Cys,8><Trp,9>; (IV) <Cys,–6><Cys,–4><His,–3><Gly,–2><Gly,1><Ser,2><Cys,3><Thr,7><Cys,8><Trp,9>; (V) <Gly,1><Ser,2><Cys,3><Thr,7><Cys,8><Trp,9> and (VI) <Gly,1><Ser,2><Cys,8><Trp,9>. Surprisingly, for cysteine palmitoylation a maximum association was of only 2 residues, <Leu,–3><Gly,–1>; <Gly,–1><Ser,3>; <Leu,-5><Leu,–3>; <Leu,–4><Leu,–3>. Our data demonstrated the abundance of glycine, leucine, serine and cysteine favors palmitoylation. Also among the types of palmitoylated residues, in-silico prediction methods are available only for cysteine. Therefore, we developed a prediction method RAREPalm to find potential glycine and serine residues on which palmitoylation may occur. RAREPalm is based on support vector machine and accessible in the form of web-server and software (http://proteininformatics.org/mkumar/rarepalm). |
Biocuration, Databases, Ontologies, and Text Mining |
|
A12 |
Teodoro D*, Mottin L, Pasche E, Gobeill J, Neomi Arighi C, Ruch P
*HES-SO / HEG Geneva, Battelle campus & SIB Swiss Institute of Bioinformatics, Switzerland
Advances in biomedical sciences are increasingly dependent on knowledge encoded in curated biomedical databases. In particular, the Universal Protein Resource (UniProt) provides the scientific community with a comprehensive, high-quality, and accurately annotated protein sequence knowledgebase. Currently, in the UniProtKB the classification of the computationally mapped bibliography is based on the underlying sources. This approach is limited as it relies on the individual source providing the classification, therefore a more systematic way for classification is needed. We investigate the use of automated classifiers based on Doc2Vec and Multilayer Perceptron (MLP) for helping UniProt to systematically classify the scientific biomedical literature according to 11 UniProt categories: Expression, Family & Domains, Function, Interaction, Names, Pathology & Biotech, PTM/processing, Sequences, Structure, Subcellular location, and Miscellaneous. Using a collection of 200 000 documents, we compare the deep learning approach to several machine learning classification methods, such as Naïve Bayes, kNN, Random forest, Logistic regression. The baseline model, based on Naïve Bayes, reached a mean precision of 0.7474 (f1 score of 0.6618). Apart from the kNN model (precision of 0.7374), all the other models outperformed the baseline method. Random forest achieved a precision of 0.8032 (f1 score of 0.7015), logistic regression achieved a precision of 0.8110 (f1 score of 0.7521), and MLP achieved a precision of 0.8292 (f1 score of 0.7980). We believe that such approach could be used to systematically categorize the computationally mapped bibliography, which represents a significant set of the publications in UniProt, and help improve the productivity in certain biocuration tasks. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A13 |
Huang Q*, Baudis M
*UZH, Institute of Molecular Life Sciences & SIB Swiss Institute of Bioinformatics, Switzerland
Malignant neoplasias are based on the accumulation of mutations in cells during the lifetime of an individual (“somatic mutations”), which can be influenced by inherited (“germline”) genome variations. As tumor types and incidences differ among human populations, the genetic background of individuals could be one factor influencing somatic variation and subsequent tumorigenesis. In recent years, a large amount of cancer genome studies has been published, in thousands of tumor series analyzed by various genome screening techniques. However, most studies have been focused on individual tumor types and have been limited to genomic backgrounds of a few human populations. So far, the systematic analyses and integration of multiple available data sources are lacking. In this project, we perform a meta-analysis of the curated oncogenomic data from the arrayMap database, derived from various types of genomic arrays, and combine genomic profiles with epidemiological data to evaluate the population specificity of genome variations in cancer. From sequencing data of 26 populations world-wide from 1000Genome project, we extract the SNP markers corresponding to Affymetrix platforms and use them for subsequent sample analysis. First, we show that using admixture analysis, the population classification is accurate even from low-resolution arrays (10k markers). This will append genome-derived population information to the Progenetix database, as an addition layer to the geographic location of the publication-affiliated institute. As next step, we will link different types of chromosomal aberration (e.g. CN-LOH) to the identified population group to discover potential population-specific oncogenic patterns. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A14 |
Carrio Cordo P*, Baudis M
*UZH, Institute of Molecular Life Sciences & SIB Swiss Institute of Bioinformatics, Switzerland
Screening for somatic mutations in cancer has become integral to diagnostic and target evaluation for personalized therapeutic approaches. arrayMap is a curated oncogenomic resource, focusing on copy number aberration (CNA) profiles derived from genomic arrays. The information has been processed from data accessed through NCBI’s Gene Expression Omnibus (GEO), EBI’s ArrayExpress, and, importantly, through targeted mining of publication data. Whereas this database contains raw probe data sets, the parental project, Progenetix, allows for genome variant analysis from additional sources and serves as metadata reference.
arrayMap underwent improvements to facilitate meta-analysis of cancer related genome data and clinical use. Recently, we have expanded arrayMap database scope and depth. In a systematic mining of genomic array data from NCBI's GEO, we obtained additionally around 22'000 data sets potentially related to somatic mutations in cancer (cancer specimen or associated reference profiles). Also, we expanded the publication database of original cancer genome profiling studies, to now more than 3'000 individual articles. The resulting, comprehensive resource consisting of Progenetix and arrayMap contains information for more than 400 ICD-O entities and 63'000 genomic array profiles.
Moreover, the new hierarchical schema representation of the data in individuals, biosamples and experiments and its association to metadata allows for further meta-analysis on different knowledge levels. Interestingly, under an epistemologic paradigm our data collections reflect knowledge gaps in the cancer genome research landscape, and highlight geographic biases which will be able to guide the direction of future studies. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A15 |
Lombardot T*, Morgat A, Axelsen K, Aimo L, Niknejad A, Nouspik N, Ignatchenko A, Coudert E, Redaschi N, Bougueleret L, Xenarios I, Bridge A
*Unige, CMU & SIB Swiss Institute of Bioinformatics, Switzerland
Rhea (http://www.rhea-db.org) is an expert curated resource of biochemical reactions designed for the annotation of enzymes and genome-scale metabolic networks and models. Rhea uses the ChEBI (Chemical Entities of Biological Interest) ontology of small molecules to precisely describe reactions participants and their chemical structures. All reactions are balanced for mass and charge and are linked to source literature and other functional vocabularies such as the Enzyme Classification of the IUBMB. The latest release of Rhea includes data on over 10,000 reactions (curated from a similar number of publications) which can browsed and searched interactively (by chemical names, structures, identifiers, and more), accessed through (RESTful) web services, and downloaded (in RD/RXN, CMLReact, and BioPAX formats) under the terms of a Creative Commons CC-BY license.
Rhea provides reaction data for a number of ELIXIR core resources (such as ChEBI) and deposition databases (such as MetaboLights), and many other resources including the SwissLipids knowledgebase for lipid biology (http://www.swisslipids.org) and the metabolic modelling platform MetaNetX (http://www.metanetx.org).
Here we describe recent and forthcoming developments in Rhea. These include a reaction classification that leverages the chemical structure ontology of ChEBI and an internal RDF/SPARQL pipeline to complement and extend the Enzyme Classification of the IUBMB, as well as the forthcoming use of Rhea as an annotation vocabulary for UniProt. Together these developments will enhance the utility of Rhea for the annotation of enzymes and metabolic networks and for the integrated analysis of metabolomics and other datasets. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A16 |
Argoud-Puy G*, IMGT®, the international ImMunoGeneTics information system® , UniProt Consortium
*CMU & SIB Swiss Institute of Bioinformatics, Switzerland
Here we describe a collaboration between UniProt and IMGT®, the international ImMunoGeneTics information system® that aims to provide representative sequences for functional immunoglobulin (IG) genes of the human reference genome through UniProtKB/Swiss-Prot. Immunoglobulin heavy and light chains are created during B cell maturation by DNA rearrangements in the IG multigene loci of variable (V), diversity (D), and joining (J) genes, resulting in V-(D)-J rearranged genes which are then spliced to constant (C) genes – producing an enormous number of potential sequence combinations. In UniProtKB/Swiss-Prot we have curated representative peptides corresponding to the functional germline D- and J-region genes as well as the products of all functional V- and C-region genes. This set of 141 UniProtKB/Swiss-Prot entries are identical to the reference genome (GRCh38), use official nomenclature from IMGT/GENE-DB (approved by HGNC and endorsed by NCBI Gene and the IUIS-Nomenclature Subcommittee for Immunoglobulins and T cell receptors), and are directly linked to the IMGT® resource, which provides a comprehensive human IG genomic repertoire of 927 known alleles from 462 functional and non-functional genes and a wealth of additional information concerning immunoglobulins or antibodies.
|
Biocuration, Databases, Ontologies, and Text Mining |
|
A17 |
Ilmjärv S*, Bolleman J, Liechti R, Xenarios I, Krause K
*Unige, Department of Pathology and Immunology & SIB Swiss Institute of Bioinformatics, Switzerland
Toxicological characterization of chemical compounds is important to ensure the protection of human health. Compound’s dose-dependent effect in a complex system can be measured with various methods ranging from high-throughput omics techniques to targeted bioassays that measure a single output. Presently there is no simple way to access all the in vitro compound data in a quick and synoptic manner. Data is fragmented across many different resources and interested parties need to invest invaluable time and effort to develop an expertise in order to navigate these systems efficiently. To this end, we have developed Toxgram – a web-based interface that serves as a single access point to compound data. It takes advantage of standardized ontologies and resource description framework technologies to integrate both expert curated activity data in addition to experimental metadata from omics experiments retrieved from multiple resources. This information is displayed in a user-friendly aggregated view that provides a systematic overview of concentrations used and experiments performed with the compound of interested represented in different in vitro systems. Overall, Toxgram will allow researchers to focus on data analysis and its interpretation instead of collection and curation. It will also contribute to public health by allowing a faster and better identification and management of compound data. Toxgram can be accessed from toxgram.vital-it.ch. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A18 |
Pedruzzi I*, Rivoire C, Auchincloss AH, Coudert E, Keller G, Masson P, de Castro E, Baratin D, Cuche BA, Bougueleret L, Poux S, Redaschi N, Xenarios I, Bridge A
*CMU & SIB Swiss Institute of Bioinformatics, Switzerland
HAMAP (High-quality Automated and Manual Annotation of Proteins) is an automatic system for the classification and functional annotation of protein (and whole proteome) sequences. HAMAP classifies protein sequences into families using expert-curated profiles and annotates them using rules which specify relevant annotations (such as protein names, function, domains and sequence features, GO terms and keywords) and the conditions under which they apply (such as the presence of specific functional residues). HAMAP is designed to achieve high specificity while avoiding overannotation, a major source of database errors. HAMAP rules use UniProtKB/Swiss-Prot data as a template for the definition and annotation of families; originally developed to support the curation of microbial proteins, the scope and content of HAMAP has been continually extended to leverage eukaryotic and lately also viral protein families of UniProtKB/Swiss-Prot. HAMAP data and tools are made freely available at http://hamap.expasy.org and are also incorporated by InterPro and the UniRule pipeline of UniProt, providing annotation of UniProtKB/Swiss-Prot quality for over 8 million unreviewed protein sequences in UniProtKB/TrEMBL. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A19 |
Genick U*
*Swiss Federal Institute of Technology Zürich (ETH Zurich), Switzerland
The MIDATA Health Data Cooperative (midata.coop) seeks to unlock the value of personal data for research and personalized health by putting individuals in control of their own data. With Smartphones, in-home diagnostic instruments, fitness trackers etc. each citizen generates a wealth of data every day - much of it with medical relevance. Aggregating these data with information about medical payments, accidents, hospital and doctor’s visits would give a very comprehensive and detailed view of a person's health and the factors effecting it. On a population scale, such a data set would be an invaluable data source for research, public policy making and personalized health. Currently this aggregation cannot take place, both for technological and, right fully so, for legal reasons. The Swiss-based MIDATA Health Data Cooperative is based on the realization, that only the individual him or herself has the right to aggregate and use all those data that exist about him or her. The MIDATA cooperative has develop a data- and consent-management platform and an organizational structure that allows individual citizens to aggregate all (health) data that exits about them. The data are stored in encrypted form and only the patient has the key. This allows the individual to use their data for their own benefit, and gives them fine-grained control over what data (if any) they share with others and for what purpose (e.g. researchers, doctors, family). The MIDATA cooperative is now operational, first pilot projects have been completed and several more are about to launch. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A20 |
Gasteiger E*, UniProt Consortium
*CMU & SIB Swiss Institute of Bioinformatics, Switzerland
The Universal Protein Resource (UniProt) is a comprehensive resource for protein sequence and functional data. The centerpiece of UniProt is the Knowledgebase (UniProtKB) which is composed of the expert curated UniProtKB/Swiss-Prot section and its automatically annotated complement, UniProtKB/TrEMBL. Swiss-Prot contains over 550,000 sequence entries that combine manually verified sequences with experimental evidence derived from biochemical and genetic analyses, 3D-structures, mutagenesis experiments, information about protein interactions and post-translational modifications. TrEMBL provides a further 88 million sequences that have been largely derived from high throughput sequencing of DNA and are annotated by our rule-based automatic annotation systems. UniProt contains data for about 60,000 species with completely sequenced genomes, organized into "proteomes".
UniProtKB is complemented by the UniProt Reference Clusters (UniRef) databases that cluster protein sequences at different levels of sequence identity to speed up sequence similarity searches, and the UniProt Archive (UniParc) which provides a complete set of known sequences, including historical obsolete sequences.
All these databases are available on the UniProt website at http://www.uniprot.org, where they can be browsed and queried seamlessly, both by interactive users and by programmatic access. The website was designed using a user-centric approach, and also includes services such as similarity search (BLAST), multiple sequence alignment, identifier mapping, and exact peptide search.
At this conference, we also present separate posters about the curation of immunoglobulins and variants in UniProtKB.
|
Biocuration, Databases, Ontologies, and Text Mining |
|
A21 |
Albrecht F*, List M, Bock C, Lengauer T
*Max Planck Institute for Informatics, Germany
While large amounts of epigenomic data are publicly available, their retrieval in a form suitable for downstream analysis is a bottleneck in current research. In a typical analysis, users are required to download huge files that span the entire genome, even if they are only interested in a small subset (e.g. promoter regions) or an aggregation thereof. Moreover, complex operations on genome-level data are not always feasible on a local computer due to resource limitations.
The DeepBlue Epigenomic Data Server mitigates this issue by providing a powerful interface and API for filtering, transforming, aggregating and downloading data from several epigenomic consortia, making it the ideal resource for bioinformaticians that seek to integrate up-to-date epigenomics resources into their workflow.
We present two projects that utilize the DeepBlue API to enable users not proficient in scripting or programming languages to analyze epigenomic data in a user-friendly way: (i) an R/Bioconductor package integrates DeepBlue into the R analysis workflow. The extracted data are automatically converted to suitable R data structures for downstream analysis and visualization within the Bioconductor framework. (ii) a web interface that enables users to search, select, filter and download the epigenomic data available in DeepBlue.
DeepBlue was well received by the International Human Epigenome Consortium and already attracted much attention by the epigenomic research community with currently 90 registered users and more than a million anonymous data requests since the release in 2015. The web interface and the API documentation, including usage examples and use cases, are available at http://deepblue.mpi-inf.mpg.de/. The DeepBlueR package is available at http://deepblue.mpi-inf.mpg.de/R. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A22 |
Komljenovic A, Roux J, Wollbrett J*, Bastian FB, Robinson-Rechavi M
*Unil, Department of Ecology and Evolution & SIB, Switzerland
Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It is based exclusively on curated healthy wild-type expression data (e.g., no gene knock-out, no treatment, no disease), to provide a comparable reference of normal gene expression. Data are then integrated and made comparable between species thanks to the use of dedicated curation and ontology tools. Bgee currently includes 29 animal species and is available at http://bgee.org/. We will present BgeeDB, an R package to use Bgee. It includes a collection of functions to import into R the normalized and annotated expression data. BgeeDB facilitates downstream analyses, such as gene expression analyses with other Bioconductor packages. Moreover, BgeeDB includes a new gene set enrichment test for preferred localization of expression of genes in anatomical structures (“TopAnat”). The novelty of TopAnat is that the terms tested are from the Uberon anatomical ontology, and that all associations between genes and ontology terms are experimentally supported. Along with the classical Gene Ontology enrichment test, this test provides a complementary way to interpret gene lists.
http://www.bioconductor.org/packages/BgeeDB/ Komljenovic A, Roux J, Robinson-Rechavi M and Bastian FB. BgeeDB, an R package for retrieval of curated expression datasets and for gene list expression localization enrichment tests [version 1; referees: 1 approved, 1 approved with reservations, 1 not approved]. F1000Research 2016, 5:2748 Highlights: BgeeDB allows easy access of reference data, to analyze for new questions as well as to integrate into analyses with personal data; e.g., a reference healthy expression from the same tissue as tumor data; e.g., consistent data from several organisms. |
Biocuration, Databases, Ontologies, and Text Mining |
online |
A23 |
Rech de Laval V*, Wollbrett J, Niknejad A, Moretti S, Echchiki A, Roux J, Bastian FB, Robinson-Rechavi M
*SIB & UNIL & UNIGE, Switzerland
Bgee is a database to retrieve and compare gene expression patterns in multiple animal species, produced from multiple data types (RNA-Seq, Affymetrix, in situ hybridization, and EST data). It is based exclusively on curated healthy wild-type expression data (e.g., no gene knock-out, no treatment, no disease), to provide a comparable reference of normal gene expression. We present the Bgee 14 update, which notably includes: - curation of the very large GTEx experiment (re-annotation of 10k samples as "healthy" or not); - annotated and integrated for 29 species, of which 12 are new to this release, with a focus on mammals and flies. All data are integrated and made comparable between species thanks to calls of presence/absence of expression and of differential over-/under-expression, integrated along with information of gene orthology, and of homology between organs. As a result of this integration, Bgee is capable of detecting the preferred conditions of expression of any single gene, accommodating any data type and species. These condition rankings are highly specific, even for broadly expressed genes. Bgee also provides a new type of gene list enrichment analysis tool, TopAnat, capable of detecting the preferred conditions of expression of a list of gene. Bgee is available at http://bgee.org/
|
Biocuration, Databases, Ontologies, and Text Mining |
online |
A24 |
Furrer L*, Rinaldi F
*Institute of Computational Linguistics, University of Zurich & SIB Swiss Institute of Bioinformatics, Switzerland
The “biomedical annotation metaserver” (BeCalm) shared task explored the perspectives of an inter-universitary cloud, connecting various expert systems for biomedical named entity recognition. Within the challenge, the “technical interoperability and performance of annotation servers” (TIPS) task focused on technical aspects. Participants were asked to provide an online annotation service that ran permanently for an evaluation period of two months. Participating systems were polled with requests for documents that had to be processed on the fly. The systems were evaluated in terms of performance (speed, annotation volume) and reliability (server up-time).
We participated in the BeCalm TIPS task with an annotation service built on top of our existing entity recognition system. The annotation server is a web application tailored to the needs of the task, using the OntoGene/BioMeXT biomedical entity recognition suite as a software library. The core module uses a knowledge-based strategy for term matching and entity linking. The server’s architecture allows parallel processing of annotation requests for an arbitrary number of documents from mixed sources.
We obtained best results in 4 of 6 metrics (single best for “average response time” and “mean time per document volume”, shared first place for “mean time between failures” and “mean time to repair”), showing that our tool is both very fast and stable. With additional experiments and an internal evaluation based on the server’s log files, we determined that the greatest part of the response time is due to network latency, whereas the actual processing time accounts for an almost negligible fraction.
|
Biocuration, Databases, Ontologies, and Text Mining |
|
A25 |
Rinaldi F*
*Swiss Institute of Bioinformatics and Institute of Computational Linguistics, University of Zurich, Switzerland
Faced with decreasing funding, life science databases struggle to keep pace with the constantly increasing amount of published results. Traditional approaches based on careful human review of published papers guarantee a high quality of database entries, and cannot be easily replaced by automated technologies, but are very slow and not cost-effective. However, NLP technologies which can support this process have been applied to the scientific literature for many years. In the life science domain several community-organized evaluation campaigns carried out in the past few years have shown a steady improvement in results. Nevertheless, there is still widespread skepticism on the possibility to use such tools in a curation pipeline.
We distinguish three levels of adoptions of text mining technologies for curation purposes:
- Digital assisted curation: human curation with the support of a specifically developed software environment
- Semi-automated curation: curation with the support of a text mining tool, which provides candidates for the annotation, to be validated by a human expert.
- Automated curation: annotation candidates automatically extracted by text-mining tools
In all cases, it is essential that the underlying text mining tools are embedded in a suitable software environment with an ergonomic user interface, so that they can be used as an effective aid to the curation process. We argue that although text mining tools on their own would not be easily usable in a curation pipeline, their integration in a supportive environment can lead to a remarkable increase in efficiency of the curation process. |
Biocuration, Databases, Ontologies, and Text Mining |
|
A26 |
Sidiropoulos K*, Viteri G, Sevilla C, Jupe S, D’Eustachio P, Stein L, Ping P, Hermjakob H, Fabregat A
*EMBL-EBI, United Kingdom
Reactome (http://reactome.org) is a free, open-‐source, curated and peer-‐reviewed knowledge base of biomolecular pathways. Pathways in Reactome are organized hierarchically, grouping related detailed pathways (e.g. Translation, Protein folding and Post-‐translational modification) into larger domains of biological function like Metabolism of proteins. While we provide a hierarchical pathway browser as a key element of the Reactome web interface, the relationships and connectivity between high-‐level pathways were previously not represented well. In addition, options for re-‐ use of the manually laid out low-‐level pathway diagrams were limited, as they were only downloadable as PNG images. Following intensive User Experience testing by external users, we implemented a series of major visual enhancements, to make Reactome more interactive and user-‐friendly: 1: In the detailed pathway diagrams, sub-‐pathways are now visually highlighted through shaded boxes. 2: Detailed pathway diagrams are now downloadable as PowerPoint™ slides, with pathway elements rendered as connected PowerPoint™ objects, allowing scientists to edit, modify, and re-‐use them to present their own pathway-‐related research results in presentations and publications. 3: The relationships between high level nodes in the Reactome hierarchy, for example between Adaptive Immune System, Innate Immune System, and Cytokine Signalling in Immune System, are now visualised through textbook-‐style diagrams developed by a professional illustrator. However, these diagrams are not static PNG images, but dynamic SVG graphics, allowing fast zooming and navigation, clicking to link to sub-‐ pathways, as well as overlay of aggregated pathway analysis results. Both diagrams and their graphic components are open data and are released as a re-‐useable library for biomolecular visualisation to the scientific community.
|
Biocuration, Databases, Ontologies, and Text Mining |
|
B01 |
Stricker G*, Engelhardt A, Schulz D, Schmid M, Tresch A, Gagneur J
*Technical University of Munich, Germany
Chromatin immunoprecipitation followed by deep sequencing (ChIP-Seq) is a widely used approach to study protein–DNA interactions. Often, the quantities of interest are the differential occupancies relative to controls, between genetic backgrounds, treatments, or combinations thereof. Current methods for differential occupancy of ChIP-Seq data rely however on binning or sliding window techniques, for which the choice of the window and bin sizes are subjective. Here, we present GenoGAM (Genome-wide Generalized Additive Model), which brings the well-established and flexible generalized additive models framework to genomic applications using a data parallelism strategy. We model ChIP-Seq read count frequencies as products of smooth functions along chromosomes. Smoothing parameters are objectively estimated from the data by cross-validation, eliminating ad hoc binning and windowing needed by current approaches. GenoGAM provides base-level and region-level significance testing for full factorial designs. Application to a ChIP-Seq dataset in yeast showed increased sensitivity over existing differential occupancy methods while controlling for type I error rate. By analyzing a set of DNA methylation data and illustrating an extension to a peak caller, we further demonstrate the potential of GenoGAM as a generic statistical modeling tool for genome-wide assays. |
Computational biology driving experimental design |
|
B02 |
Baar T*
*University Clinic of Cologne, Germany
Our work aims to understand the connection between the reproductive success of Alu elements and their sequence characteristics. With an average size of 300 nucleotides and a frequency of one million, Alu elements are the most abundant retrotransposable elements in the human genome. In order to replicate, Alu elements are transcribed into RNA which folds into a secondary structure and attaches itself to the ribosome’s exit tunnel. Once a LINE1 retrotransposon is translated, the Alu RNA hijacks the LINE1 retrotransposition mechanism to be reinserted into the genome. While Alu elements contain a bipartite RNA polymerase III (Pol III) promoter, it is under debate if they are Pol III or Pol II transcripts. E.g., we do not observe a significant correlation between Pol III binding to the DNA and Alu transcription. Using a multiple alignment of all Alu sequences and a Generalized Linear Model (respectively Random Forest), we relate Alu sequence variation to transcription activity. We apply metabolic RNA labeling to measure Alu transcription and degradation rates under standard and Pol II inhibiting conditions. We find strong evidence that Alu transcription is not a side product of Pol II gene transcription. Further, Alu transcripts appear to be surprisingly stable with an average half-life exceeding one hour. Lastly, we identify several loci in the Alu consensus sequence that are particularly relevant to Alu transcription. |
Computational biology driving experimental design |
online |
B03 |
Österle S*, Widmer L, Mustafa H, Berinpanathan N, Roberts TM, Stelling J, Panke S
*Swiss Federal Institute of Technology Zürich (ETH Zurich), Switzerland
Small peptide tags are commonly-used tools for protein purification, immunodetection, or as recognition sequences for proteases. Usually, tags are placed at the N- or C-terminus, but some cases and applications require an internal positioning. Because it is more likely that an internal tag interferes with protein function by disrupting its structure, the insertion site needs to be chosen wisely. Here we present a pipeline called GapMiner to in silico predict internal tagging sites in proteins, and refer to the propensity of a protein to functionally tolerate a peptide insertion and present it in an accessible way as “taggability” of this site. To predict taggability we chose four features: length and sequence variability among homologs at the site of potential insertion, preservation of the proteins secondary structure after insertion, and relative surface accessibility. For an accurate computation of length variability, we developed an insert length resolving profile hidden Markov model of protein clusters that models the empirical distribution of insert lengths at each residue. We trained a balanced random forest classifier on these four features and (non-) permissive labels derived from literature data and database annotation in UniProtKB/Swiss-Prot. When we tested, at most, the top three sites predicted by GapMiner, we could insert a Strep-tag and protease hydrolysis-tag in five essential proteins in Escherichia coli. We believe that GapMiner can accelerate internal tagging site design and reduce the number of sites that need to be experimentally tested, and in this way can accelerate engineering of proteins for applications in molecular biology. |
Computational biology driving experimental design |
|
B04 |
Modamio Chamarro J*, Zâgare A, Sompairac N, Danielsdottir A, Noronha A, Preciat G, Ghaderi S, Wiltgen L, Krecke M, Merten D, Roy A, Rouquaya M, Garcia B, Fiscioni L, Puente A, Rodriges M, Prendergast M, Thiele I, Fleming R
* Luxembourg Centre for Systems Biomedicine (LCSB), Luxembourg
Mitochondria are the main energy producers in cells and therefore, mitochondrial dysfunction is commonly associated with a wide range of diseases. In cells with high energetic demands, mitochondria play an essential role in order to maintain this minimum energetic requirement. In some diseases showing dysfunctional mitochondria, such as Parkinson's disease, mutations in signalling proteins are tightly linked to deregulation of metabolic pathways, hence, energy production.
Here, we present the MitoMap, a manually drawn mitochondrial map combining several types of molecular interactions. On the one hand, it compiles all mitochondrial reactions from the latest version of the human metabolic reconstruction, ReconX, accounting for ~1000 mitochondrial reactions. On the other hand, the map also includes manually curated molecular interactions, some of which were extracted from the latest version of the PDmap (http://minerva.uni.lu/MapViewer/). Amongst a wide range of functionalities, the multi-scale visualisation of the MitoMap will allow us to specify the amount of information displayed depending on the level of zooming.
Computational predictions from The Constraint-based Reconstruction and Analysis Toolbox (COBRA Toolbox) can also be plotted onto the map, allowing the visualisation of steady state fluxes through the different mitochondrial reactions.
Ongoing work, in the context of the SysMedPD project (http://sysmedpd.eu/) is aimed at computational prediction of mitochondrial targets to slow the progression of neurodegeneration in the subset of Parkinson's disease patients with overt mitochondrial dysfunction. For illustration, here we present an example of the outputs obtained from Parkinson's disease computational models with dysfunctional mitochondria versus control. |
Computational biology driving experimental design |
online |
B05 |
Zeidler S*, Wingender E
*University Medical Center Göttingen, Germany
The regeneration of cardiomyocytes in human is limited to a maximum of 50% renewal per lifetime or less than 1% per year, resulting in accumulation of cardiovascular diseases in elderly. With the goal to replace a damaged heart muscle and to establish a well-defined basis for investigating cardiogenesis and cardiac diseases, a reliable human organoid model is required. Recently, animal models and monolayer models (2D) are frequently used, but these models miss several functional aspects like a high force of contraction. During the last years a few 3D models became available coping with such issues, but the regulatory background of the 3D models is not completely understood yet. In our study we compared a 2D model with a 3D model at RNA-seq level to identify processes and master regulators promoting cardiogenesis. Further, we determined protocol specific and general processes which interfere with cardiogenesis. The results could be used in the future to regulate the identified processes in more detail and to improve the quality of the generated organoids. |
Computational biology driving experimental design |
|
B06 |
Rafiqi UN*, Gul I, Nasrullah N, Saifi M, Dash P, Abdin MZ
*Centre for transgenic plant development/Jamia Hamdard, India
Background: The content of antimalarial drug (artemisinin) in Artemesia annua L is relatively low compared to its demand for artemisinin based malaria treatment.One of the best approaches to increase artemisinin production is metabolic engineering.Aim:The present study represents the first effort to clone and characterize E-β-Farnesene synthase (BFS) and E-β-Caryophyllene synthase (BCS) enzymes using computational approach and further to check the effects of artemesinin content by down regulating these genes.Methodology: Full gene sequencing and a detailed in-silico analysis were performed to comprehend the functional and structural properties of these enzymes. Unique sequences were taken to design RNAi constructs and transformants were analysed for artemesinin content.Result and conclusion: The deduced amino acid sequence of both the enzymes possessed two important and highly conserved aspartate rich motifs, the DDxxD and NSE/DTE and lacks an N- terminal signal peptide.Using PSIPERED server, secondary structure analysis revealed that BCS contain 68% α-helices, 14.05% β sheets and 15.36% random coils whereas BFS contain 40.21% α-helices, 25.48% β sheets, 34.32% random coils. In three dimensional models prediction of BCS and BFS proteins we found that 77.7 % and 77.4% of amino acids residues were in favoured regions respectively. Using alignment as input four different structural models were generated by I-TASSER server. Both the structures were validated using Ramachandran plot.After transformation, we observed a significant increase in artemesinin content as compared to untransformed plants, however further results are underway.A thorough analysis of these two candidate genes involved in terpene biosynthesis showed several interesting aspects related to their sequences and brought novel information about the structures ,substrate binding sites and paves the way to essential insights concerning terpene biosynthesis and regulation in production of artemisinin.
|
Computational biology driving experimental design |
|
B07 |
Hajseyed Nasrollah ZS*, Tresch A, Fröhlich H
*Institute of Medical Statistics and Computational Biology, Germany
Data based learning of the topology of molecular networks, e.g. via Dynamic Bayesian Networks (DBNs) has a long tradition in Bioinformatics. The majority of methods take gene expression as a proxy for protein expression in that context, which is principally problematic. Further, most methods rely on observational data, which complicates the aim of causal network reconstruction. Nested Effects Models (NEMs – Markowetz et al., 2005) have been proposed to overcome some of these issues by distinguishing between a latent (i.e. unobservable) signaling network structure and observable transcriptional downstream effects to model targeted interventions of the network.
The goal of this project is to develop a more principled and flexible approach for learning the topology of a dynamical system that is only observable through transcriptional responses to combinatorial perturbations applied to the system. More specifically, we focus on the situation in which the latent dynamical system (i.e. signaling network) can be described as a network of binary state variables with logistic activation functions. We show how candidate networks can be scored efficiently in this case and how topology learning can be performed via Markov Chain Monte Carlo (MCMC). In future work, we plan to extend our method to incorporate multi-omics data and apply it to patient samples to identify disease related networks. |
Computational biology driving experimental design |
|
B08 |
Salentin S*, Adasme MF, Heinrich JC, Schroeder M
*Technische Universität Dresden, Germany
Drug resistance is an important open problem in cancer treatment. In recent years, the heat shock protein Hsp27 (HSPB1) was identified as a key player driving resistance development. Hsp27 is overexpressed in many cancer types and influences cellular processes such as apoptosis, DNA repair, recombination, and formation of metastases. As a result cancer cells are able to suppress apoptosis and develop resistance to cytostatic drugs.
To identify Hsp27 inhibitors we follow a novel structure-based drug repositioning approach. We exploit a similarity between a predicted Hsp27 binding site to a viral thymidine kinase to generate lead inhibitors and repositioning candidates for Hsp27. We characterise binding of a known inhibitor with interactions patterns of our tool PLIP and exploit this knowledge to assess better binders.
Several compounds, among them six leads, were verified experimentally. They bind HSP27, down-regulate its chaperone activity, and inhibit development of drug resistance in cellular assays. In summary, we make two important contributions: First, we put forward novel leads, which inhibit HSP27 and tackle drug resistance. Second, we demonstrate the power of structure-based drug repositioning. The identified compounds will now undergo preclinical studies. |
Computational biology driving experimental design |
|
B09 |
De Oliveira L*
*LANE, Departement of Genetics and Evolution & SIB Swiss Institute of Bioinformatics, Switzerland
We introduce a model for mass transfer of molecular activators and inhibitors in two media separated by an interface, and study it's interaction with the deformations exhibited by the two-layer skin tissue where they occur. The mathematical model results in a system of nonlinear advection-diffusion-reaction equations including mass cross-diffusion, and coupled with an interface elasticity problem. We propose a Galerkin method for the discretisation of the set of governing equations, involving also a suitable Newton linearisation, partitioned techniques, non-overlapping Schwarz alternating scheme, and high-order adaptive time stepping algorithms. The experimental accuracy and robustness of the proposed partitioned numerical methods is assessed, and some illustrating tests in 2D and 3D are provided to exemplify the coupling effects between the mechanical properties and the reaction-diffusion interactions involving the two separate layers. |
Computational biology driving experimental design |
|
B10 |
Campos Martin R*
*Max Planck Institute for Plant Breeding, Germany
Post-translational modifications (PTMs) of histones are highly conserved among organisms and affect chromatin structure and eukaryotic transcription. Among the PTMs, lysine methylation of the histone H3 has been extensively studied. Specifically, trimethylation of lysine 4 (H3K4me3) in promoter regions, lysine 36 (H3K36me3) within open reading frames (ORFs), and lysine 79 (H3K79me3) througout ORFs have been linked to active transcription.
However, little is known about proteins with potential domains to bind such modifications and their function in gene regulation. A detailed study of the potential binders and their interactions with the methylated histones will extend our knowledge about gene regulation from an epigenetic perspective.
In this project, we fitted a Generalized Additive Model (GAM) to ChIP-seq data from 3 histone methylations (H3K4me3, H3K36me3, and H3K79me3), their respective methyltransferases (Set1, Set2, and Dot1) and six proteins that contain binding domains for the previously mentioned PTMs (Asr1, Set4, Pdp3, Nto1, Rad9, and Ioc4). The signals extracted by the GAM were analyzed with a bidirectional Hidden Markov Model to segment the genome into discrete states. We reveal known and novel relations between readers and PTMs. Further, the Viterbi paths assigned to the genes were clustered hierarchically using hamming distance. The results hint towards a relationship between histone-modifiers-readers and “memory gene looping”, a mechanism to maintain the transcription of active genes. |
Computational biology driving experimental design |
|
B11 |
Leite D*, Peña-Reyes C, Brochet X, Resch G, Que Y
*School of Engineering and Management Vaud, Switzerland
The emergence and rapid dissemination of antibiotic resistance worldwide hinders medical progress and threatens with a return to the pre-antibiotic era. This therapy uses viruses that specifically infect and kill bacteria during their life cycle to reduce/eliminate bacterial load. However, as phages are highly strain-specific, the challenge is finding suitable matches to a bacterium among a fully-characterized phage library. Currently, scientists perform phage selection by means of infection tests that may take several days of lab work. We address such a challenge by combining genomic feature extraction and machine-learning predictive modelling.
To address this, we created a dataset containing more than 1000 known phage-bacteria interactions with their genomes based on public data from Genbank and phagesdb.org databases. From these genomes we extracted features, which include the distribution of protein-protein interaction scores, proteins' amino acids frequency and chemical composition, to build a quantitative dataset to train our predictive machine learning models.
Our approach attains, in average, performance values of around 90% in terms of f-measure, accuracy, specificity, and sensitivity. In addition, they are obtained in much less time than the corresponding in-vitro experiments. This promising results encourage us to further investigate new features to extract as well as additional predictive models (e.g., weighted ensemble-learning voting system). We will also enlarge our phage-bacteria interaction database so as to increase the predictive value and the possibility of prediction at the bacterial strain level. |
Computational biology driving experimental design |
online |
B12 |
Pachkov M*, Balwierz P, Arnold P, Gruber A, Zavolan M, van Nimwegen E
*Unibas, Biozentrum & SIB Swiss Institute of Bioinformatics, Switzerland
Understanding the key players and interactions in the regulatory networksthat control gene expression and chromatin state across different cell typesand tissues remains one of the central challenges in systems biology. Ourlaboratory has pioneered a number of methods for automatically inferring coregene regulatory networks directly from high-throughput data by modeling geneexpression and chromatin state measurements in terms of genomewidecomputational predictions of regulatory sites for hundreds of transcriptionfactors and micro-RNAs (PMID:19377474, PMID:24515121). These methods havenow been completely automated in an integrated webserver, called ISMARA(ismara.unibas.ch) that allows researchers to analyze their own data bysimply uploading raw datasets, and provides results in anintegrated webinterface as well as in downloadable flat form. For anydataset ISMARA infers the key regulators in the system, their activitiesacross the input samples, the genes and pathways they target, and the coreinteractions between the regulators.
More recently, we similarly developed CRUNCH, a completely automated systemfor ChIP-seq analysis which provides a rigorous standardization of all stepsin ChIP-seq analysis, from quality control, to read mapping, fragment lengthestimation, peak identification, and includes novel procedures foridentifying complementary sets of regulatory motifs that jointly explain thebinding data (doi:10.1101/042903).
We believe that, by empowering experimental researchers to apply cutting-edgecomputational systems biology tools to their data in a completely automatedmanner, ISMARA and CRUNCH can play an important role in developing ourunderstanding of regulatory networks across metazoans. |
Computational biology driving experimental design |
|
B13 |
Katzir I*, Cokol M, Aldridge B, Alon U
*weizmann institute of science, Israel
Finding potent multi-drug combinations against cancer and bacterial infections is a pressing therapeutic challenge; however, screening all combinations is difficult because the number of experiments grows exponentially with the number of drugs and doses. To address this, we recently developed a mathematical model which predicts the effects of three or more antibiotics or anti-cancer drugs at all doses based only on measurements of drug pairs at a few doses, without need for mechanistic information. The model provides accurate predictions on previous data for up to four antibiotic combinations, and on experiments on the response matrix of three cancer drugs at eight doses per drug. To further test the model beyond four drugs and for clinically relevant pathogens, we performed experiments on drug combinations at multiple doses in two organisms: E. coli and M. tuberculosis. We measured all 45 pair combinations of ten drugs in E. coli and M. tuberculosis, and tested predictions for combinations of three to five drugs. We find that the dose model works well in both E. coli and M. tuberculosis. We also use the model to find new synergistic combinations of three to five drugs for M. tuberculosis.
|
Computational biology driving experimental design |
online |
B14 |
Bahai A*, McHardy A
*Helmholtz Centre for Infection Research, India
The B-cell epitopes are the specific sites on the antigens that can bind to antibodies and are an important antigenic determinant. The identification and characterization of B-cell epitopes is of great importance to immunologists for facilitating the design of peptide-based vaccines and developing new methods of immunodiagnostic tests, and antibody production. Various computational methods (COBEpro, BCPRed, Disctope, ElliPro etc.) for epitope prediction have been developed in recent years, but the predictive performance of such methods is far from ideal and this remains a challenging task in Immunology.
In this study, a database of already known viral epitopes of Influenza, HCV, HIV and Measles virus was compiled and a different set of epitopes binding to broadly neutralizing antibodies was collated as well. Several amino-acid propensity scales (Hydrophilicity, Hydrophobicity, Flexibility etc.) along with amino acid composition and structural properties (relative accessible surface area, depth, pKa etc.) of these epitopes were combined to create a set of 581 features. We then trained various classification models on these features to develop a machine-learning model, which can best distinguish between an epitope from a non-epitope. Principal Component Analysis technique was used for feature selection and finally Support Vector Machine with RBF kernel was selected for the classification. The model was validated with five-fold cross validation and benchmarked against existing methods. We have presented our preliminary results. In future, this machine-learning model would be expanded to classify epitopes binding to bNAb’s, which could facilitate the design of future therapeutic vaccines against these viruses. |
Computational biology driving experimental design |
|
B15 |
Crespo I*, Doucey M, Xenarios I, Coukos G
*Unil, CIG & SIB Swiss Institute of Bioinformatics, Switzerland
For decades, cancer research has been focused on understanding the neoplastic transformation of normal cells into cancerous ones from a cell-centric perspective. However, it is increasingly evident that the surrounding tumor microenvironment (TME) is equally important for tumor growth, progression and dissemination. However, the TME is a complex system of interplaying elements strongly intertwined with normal processes of the surrounding hosting tissue. Cancerous cells, different types of infiltrating immune cells and resident tissue cells interact with each other and with extracellular matrix components, thus rendering the tumor growth and progression as well as the underlying antitumor immune response very difficult to anticipate. Mathematical and computational models may help on describing, explaining and predicting cancer in a new generation of experimental design assisted by computer simulations. Here we describe an experimental and computational platform developed in collaboration between Vital-IT-SIB and LICR in Lausanne to model the TME and predict response to treatments in specific patients, and to identify potential targets for novel checkpoint blockade therapies. The platform is based on the integration of experimental information (ex vivo) and mechanistic knowledge from both unperturbed (untreated) and perturbed (treated) tumors in the attempt to simulate the interplay between cancer-driving mechanisms at baseline and the mechanisms of response to single and combined checkpoint blockades. The platform is currently being applied to investigate novel immunotherapies in lung, colorectal and ovarian cancer. |
Computational biology driving experimental design |
online |
B16 |
Betz A*, Zupanic A, Stelling J
*EAWAG/ETH, Switzerland
Before a chemical is released onto the market, companies have to provide information on its toxicity. Possible synergistic and antagonistic effects with other chemicals are not taken into account due to the high number of experiments that would be required. Therefore, an interest in methods that predict the joint effect of two or more chemicals based on single chemical experiments has emerged. The mixture toxicity models available today are simple and their accuracy is low. We want to improve upon these methods by introducing metabolic modelling in the form of Flux Balance Analysis(FBA) to ecotoxicology. FBA is a linear programming framework that allows the calculation of equilibrium metabolic reaction fluxes on a genome scale, based on the stoichiometry of the metabolic network and nutrient uptake rates. We want to predict mixture effects on growth by integrating gene expression as constraints into the FBA and then combining the effects that the chemicals have on different metabolic pathways. In the first phase of our research we are assessing the accuracy of FBA for quantitative toxicity predictions of single chemical exposure. Here, we present predictions of growth reduction in the green alga C. reinhardtii, upon exposure to silver ions and three herbicides. |
Computational biology driving experimental design |
|
B17 |
Yu J*, Bodak M, Salinas D, Ngondo P, Wischnewski H, Ciaudo C
*Swiss Federal Institute of Technology Zürich (ETH Zurich), Switzerland
RNA interference (RNAi) effector proteins are essential for mouse early development. Canonical and non-canonical functions of these proteins have been recently described in several biological systems, but the independent or combinatorial contribution of each protein has not been assessed in a uniform context. In order to understand how individual RNAi gene influences the transcriptome of mouse embryonic stem cells (mESCs), we profiled RNAi-deficient mESCs from the same genetic background using RNA, small RNA and RNA immunoprecipitation (RIP) sequencing. Surprisingly, among all expressed genes, only a small proportion (1.3%) was commonly misregulated upon loss of RNAi genes. Integrative analysis identified a novel microprocessor-independent miRNA and revealed a specific group of genes escaping the direct miRNA regulation. To systematically dissect the primary and secondary effects of RNAi genes, we implemented a computational approach combining multi-omics data, which reconstructed a regulatory network for interesting genes. We found Sirt6, an H3K27ac deacetylase enzyme, could mediate the secondary effects upon loss of miRNAs. We further extended this network approach for transposable elements (TEs), and identified Gata1 as a potential regulator. Finally, we developed a web-service populated with multi-omics data to facilitate data-driven hypothesis generation. |
Computational biology driving experimental design |
|
B18 |
Mandal M, Mjelle R, Chawla K*, Tica V, Dima S, Popescu I, Sætrom P
*Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology (NTNU), Norway
Hepatocellular carcinoma (HCC) is one of the most deadly and common types of cancer having very limited therapeutic options available. Regular type of liver cancer diagnosis and treatment options currently depend on tumor size and staging. Methods for robust patient stratification are therefore important. It is shown that MicroRNAs have a role in regulation of central oncogenic and anti-apoptotic liver cancer signaling pathways. The expression levels of specific miRNAs have also been identified to correlate with clinicopathological parameters and treatment responses in liver cancer patients. Despite these advances, robust markers for diagnosis and prognosis of HCC patients are still lacking. In this study we developed a classifier for HCC based on miRNA expression in tumor tissue samples. This classifier is robust to batch effects in the samples, which is usually the case. We used two approaches; first, a classification model based on the paired structure of the data; second, a method that uses unsupervised analyses to identify and perform linear modeling to correct for differences between batches. Both strategies relied on support vector machines (SVMs) and we used both cross-validation and an independent test set to measure the performance of our classification strategies. |
Computational biology driving experimental design |
|
D01 |
Echchiki A*, Roux J, Robinson-Rechavi M
*Unil, Department of Ecology and Evolution & SIB Swiss Institute of Bioinformatics, Switzerland
Alternative splicing contributes to transcriptome diversity in eukaryotes, and is thought to be a major driver of phenotypic diversity. Surveying the patterns of alternative splicing and understanding its regulation is an essential step towards the assessment of its biological role in genome evolution and its functional consequences. To date, we still lack a detailed insight on alternative splicing patterns, even in model species, due to technical limitations of the established sequencing platforms. Nowadays, transcriptome profiling and analysis are mostly done using second-generation sequencing technologies, providing short-read data. The major limitation of these technologies is that assembly pipelines are necessary to infer transcript identity. Third-generation sequencing technologies, providing long-read data, recently catched the attention of the community for RNA analysis. The major interest of these technologies is the potential of providing reads as long as the input transcript, bypassing the reconstruction step. Long-read sequencing technologies promise to give new insights into the alternative splicing issue. In this work, we sequenced the Drosophila melanogaster transcriptome, using three sequencing platforms: Illumina HiSeq, providing short reads, and PacBio RSII and Oxford Nanopore MinION, both providing long reads. We provide a benchmark of both short and long-read technologies for the study of alternative splicing, through the analysis of their agreement in recovering isoform diversity within a given RNA sample. |
Emerging applications of sequencing |
online |
D02 |
Singer F*, Grob L, Irmisch A, Levesque MP, Toussaint N, Stekhoven D
*ETHZ, NEXUS Personalized Health Technologies & SIB Swiss Institute of Bioinformatics, Switzerland
High-throughput genomics have changed the way biomedical research is performed. The transition from directed testing of a few specific targets to analyzing comprehensive high-throughput data, offers tremendous possibilities, particularly for the diagnosis of patients with rare diseases, for tumors lacking known targetable mutations, or for patients for whom routine diagnostic and treatment paradigms have failed. Despite the great potential, the use of high-throughput techniques to expand standard diagnostics is not well established in the clinics. Establishing high-throughput molecular diagnostics for clinical use requires specific protocols accounting for stringent quality control, privacy issues, and thorough process documentation. To this end, we, a group of bioinformaticians, statisticians, and cancer biologists, have collaborated to develop a workflow for the molecular profiling of matched tumor and normal samples to improve clinical decision support. In order to gain a more comprehensive understanding of the tumor, we have recently begun to also include transcriptomic data in the analysis. Using publicly available transcriptome data as a reference, this will allow us to assess over-/underexpression of genes of interest and to complement genome measurements with gene regulatory data. In addition to the identification of somatic variants, expression changes, and gene fusions, our workflow links the detected alterations to possible treatment options. The analysis results are summarized in a concise and clearly structured clinical report designed to form the basis for discussions in a clinical molecular tumor board. Here, we showcase the designed workflow on dermatology samples. |
Emerging applications of sequencing |
|
D03 |
Hatje K*, Badi L, Berntenis N, Friedensohn S, Hoflack J, Schmucki R, Sturm G, Wells I, Zhang J
*F. Hoffmann-La Roche Ltd, Switzerland
Reference expression data from different tissues or cell type enables the characterization of any transcriptomic data in respect to tissue composition. We used GTEx as reference atlas for gene expression in human tissues to generate tissue marker gene signatures. These signatures enabled understanding of the sources of biological variability and identification of mislabeled or contaminated tissue samples for several expression datasets including samples of mixed or unknown cell composition. The tissue signatures were validated using both a 10-fold cross-validation on the GTEx sample collection itself and the FANTOM5 tissue atlas as an independent reference dataset. To complement the gene level signatures, we computed tissue specificity statistics at the exon and junction level. This highlighted tissue-specific splicing events. The tissue specificity annotation of exons and junctions have been condensed in GFF3 (general feature format) tracks to be visualized in commonly used genome viewers.
Tissue-specificity of gene expression is not only relevant to better understand human biology, but also to be able to translate findings from related model species. During drug development the tissue expression patterns across species are essential to plan and understand preclinical toxicology studies. Therefore we generated an mRNA and miRNA expression tissue atlas of cynomolgus (Macaca fascicularis), minipig (Sus scrofa), rat (Rattus norvegicus) and zebrafish (Danio rerio). Tissue-enriched mRNAs and miRNAs were identified for every species and a clustering analysis was performed to test if mRNA and miRNA expression profiles are more tissue-specific than species-specific. Several miRNAs so far unknown in public repositories were identified in all species. |
Emerging applications of sequencing |
|
D04 |
Arpat B*, Liechti LA, Gatfield D
*Unil, CIG & SIB Swiss Institute of Bioinformatics, Switzerland
During translation, ribosomes traverse along the linear mRNA template at non-uniform speeds. Conceivably, at sufficiently low speeds of translation, newly arriving ribosomes can be stacked behind the pausing ribosome increasing the local density of ribosomes. Such transient pausing of ribosomes potentially represents an important regulatory mechanism as it could affect a variety of processes, such as folding of the nascent polypeptide, efficiency of protein biosynthesis and ribosomal frameshifting. Due to the limited number of techniques providing genome-wide high resolution data on ribosome pausing and stacking, understanding their causes and regulatory effects has been challenging.
Here, we report on disome profiling, a variant of ribosome profiling, as a new approach to study translational pausing. A disome is formed by two adjacent ribosomes on an mRNA that sterically exclude nucleases and hence protect approximately a 60 nt stretch of the transcript. By transcriptome-wide sequencing of disome footprints, we were able to map exact locations of stacked ribosomes. We validated our approach by demonstrating that disome footprint densities correlated with the presence of signal peptides that are recognized by the signal recognition particle, binding of which has been known to induce an 'elongation arrest'. We applied established techniques from information theory and machine learning to different features of disome footprints, such as their size and location, to identify specific regulatory factors of translational pauses, including transcript structure, codon usage, tRNA abundance and nascent peptide sequence. Latest results on the application of disome profiling in understanding the kinetics of translation will be presented. |
Emerging applications of sequencing |
|
D05 |
Ullate Agote A*, Milinkovitch MC, Tzika AC
*LANE, Departement of Genetics and Evolution & SIB Swiss Institute of Bioinformatics, Switzerland
Very few genomic resources are available for reptiles considering the impressive variety of phenotypes present in the more than 10,000 species of this clade. Hence, additional genome sequencing projects are vital to improve our understanding of their evolution, development and diversification. We promote the corn snake (Pantherophis guttatus), an oviparous snake with a wide range of colour and colour pattern morphs, as an appropriate model species for evolutionary developmental studies in squamates. Besides our developmental work and color trait mapping studies, we have also published a draft genome of this species (Ullate-Agote & al, 2014). Since then, we work on a higher quality assembly that combines: (i) 250bp Illumina paired-end reads for contig assembly with DISCOVAR de novo, (ii) multiple long-fragment size mate pair libraries for initial scaffolding, (iii) transcriptomic data to improve gene connectivity and (iv) BioNano genome maps to generate megabase long super-scaffolds and to correct mis-assemblies and gap length estimations. We built a final genome assembly of 1.94Gbp, matching a 1.85Gbp expected genome size. With an N50 of 1.38Mbp (L50 = 279 sequences), we achieved a more than 300-fold improvement compared to the first version, making the corn snake genome one of the highest quality snake genomes available. We now integrate 10x Genomics linked-reads to differentiate haplotypes and facilitate the assembly of a more contiguous genome. Combined with our on-going genome annotation and comparative transcriptomic projects (made available at www.reptilomics.org), this genome should prove useful for the genomics, transcriptomics, proteomics and herpetology communities alike. |
Emerging applications of sequencing |
|
D06 |
Brito F*, Cordey S, Kaiser L, Preynat-Seauve O, Zdobnov E
*CMU & SIB Swiss Institute of Bioinformatics, Switzerland
Shotgun metagenomic sequencing gives us an in depth overview of the microbial genomics of a patient (bacteria, viruses, archaea, etc) all in one go. Contrary to routine assays, which only test for the presence of specific organisms, metagenomic sequencing is an open approach that characterizes the whole content of a sample, making it suitable to the detection of previously uncharacterized emergent infectious diseases. In order to assess the safety of platelet concentrates for transfusion, we shotgun sequenced total RNA and DNA from 10 platelet pools (30 donors each) and used our metagenomics analysis pipeline, ezVIR, to detect viruses based on a comprehensive and curated database of clinically relevant pathogens. Unlike our recently published metagenomic analysis of red blood cells and plasma from donors, where we found a case of an overlooked pathogen in a donor (Astrovirus MLB2), recently associated with cases of meningitis in immunocompromised patients, we do not find any clinically relevant viruses in platelet concentrates, which are often transfused to immunocompromised patients. Though these results could be affected by the loss of sensitivity on pooled data, as positive controls we found several expected commensal viruses: Anellovirus, Pegivirus, human Papillomavirus and Merkel cell Polyomavirus, which confirm the quality of these libraries. We also identified several expected false positive results of different origin (same lane cross-talk, reagent contaminants and ambiguous reads). Our current findings suggest the donor pools to be safe, presenting only viruses that aren’t a major risk to patients. |
Emerging applications of sequencing |
|
D07 |
Rands CMD*, Starikova E, Zdobnov E
*CMU & SIB Swiss Institute of Bioinformatics, Switzerland
The spread of antibiotic resistance (AR) and virulent pathogen strains are major global public health issues. Horizontal gene transfer of AR and virulence genes can occur by several mechanisms, including via phages (bacterial viruses).
We scanned 1,302 human gut metagenomes and metaviromes, in addition to 2,090 phage whole genome sequences, to look for examples of where phages have carried bacterial AR or virulence genes. To achieve this, we developed catalogs of profile Hidden Markov Models (HMMs) with model-specific thresholds annotated for AR and virulence function, and a sliding window approach to identify clusters of phage genes annotated via HMMs.
We identify and characterise several good candidates of possible AR gene mobilisation by phages in the human gut microbiome, including efflux pumps, tetracyclines, and beta-lactams. Otherwise, we find that AR genes are rarely co-located with phages. We are able to annotate known virulence genes, such Shiga toxin and Panton–Valentine leukocidin operons, in phage whole genome sequences, and we predict a small number of possible virulence genes, including effector proteins, in human gut metaviromes.
Previous pioneering studies have searched for phages linked to AR and virulence genes, but the availability of more metagenomic data and annotations, combined with our novel methods, allowed us to conduct arguably the most comprehensive search yet. The rare cases we identify where phages may mobilize AR or virulence genes in the human gut are worthy of further investigation. |
Emerging applications of sequencing |
|
D08 |
Aguilar Bultet L*, Nicholson P, Rychener L, Dreyer M, Gözel B, Origgi F, Oevermann A, Frey J, Falquet L
*Unibe, Institute of Veterinary Bacteriology & SIB Swiss Institute of Bioinformatics, Switzerland
Listeria monocytogenes (LM) is a foodborne pathogen, which affects ruminants and humans. There are two main phylogenetic LM lineages I and II. Generally, clinical cases are caused by lineage I, while most of the environmental and food isolates belong to lineage II, and little is known about why lineage I is more virulent. In order to find characteristics that distinguish lineage I from lineage II, we compared, at the whole genome level, 225 strains from the two lineages. We showed that isolates of the same lineage are closely related with more than 99% DNA identity in lineage II and more than 99.5% in lineage I. The two lineages differ from each other by 5.7%. A new approach based on RPKM values (reads per kilobase per million mapped reads) along the whole genome was developed, to identify genes predominantly present in lineage I and absent in lineage II. A group of genes with potential virulence function were identified exclusively in lineage I strains, which are mostly rhombencephalitis isolates from ruminants. The variations between the lineages are not only due to differences in the gene content, but also due to single nucleotide polymorphisms. A characteristic difference that separates all strains between lineage I and lineage II is located at the regulatory 5’UTR region of prfA gene, which is a central regulator of the main virulence factors, suggesting a possible role in the regulation of the virulence genes under its control. These differences could be significant during the in vivo infection process. |
Emerging applications of sequencing |
|
E01 |
Kanitz R*, Slater R
*Syngenta, Switzerland
Controlling pest populations is key for sustainable agricultural practice. To this goal, pesticides are critical elements of crop management, even in ‘organic’ farms. As any biological entities, pests are subject to evolution and will eventually adapt to the selective pressures posed by pesticides. Keeping these adaptations at bay is the main goal of resistance management efforts, which align society, governments, academia and industry. One preemptive method that can be applied to delay resistance is to diffuse the selective pressure onto different loci by mixing or alternating active substances. Multiple (mutational) steps would then be required for an organism to be resistant to the treatment they are being exposed. Curiously, in the domain of economic entomology, insecticide mixtures have been widely disregarded in favor of alternations. Here we revisit this principle and investigate the relative performance of mixtures and alternations using an “in silico” approach. Using a short life-cycle rice-pest species as a model (Chilo suppressalis), we ran individual-based simulations of farming practices for up to 50 years explicitly accounting for environmental variables and the biology of both crop and pest. Our results suggest that mixtures are considerably better at delaying resistance than alternations when the pest species has a short life cycle. That is probably because alternations often expose different cohorts to different single active substances, allowing for more survival of partially adapted individuals. Mixtures, on the other hand, tend to eliminate non-adapted and partially adapted individuals practically the same way.
|
Evolution and Phylogeny |
|
E02 |
Bilgin Sonay T*
*Unil, Department of Ecology and Evolution & SIB Swiss Institute of Bioinformatics, Switzerland
Short tandem repeats (STRs) are stretches of repetitive DNA elements that cover nearly 1% of the human genome. Their periodic structure induces DNA polymerase slippage, and results in a high rate of mutations that add or delete repeat elements. STRs can show remarkable variation between individuals, hence they are ideal candidates for differentiating between the individuals of same or close species. This extreme polymorhism also posed significant challenges in sequencing and genotyping of STRs which meant that they were largely discarded from large-scale analyses of genetic variation. Advances in long read sequencing technologies and more reliable genotyping algorithms help researchers to surpass these chalenges and introduce STRs back into genomewide studies.
Studies show that many STRs are located within the promoters or enhancers of genes. Such STRs are highly variable and contribute to variation in gene expression. In our previous studies, we have shown that STRs in gene promoters can enhance gene expression divergence both along an evolutionary time hence between species and also throughout cancer development. The importance of the latter finding stems also from its potential use in target identification for immunotherapy. Recently, we find that STRs can help explain some local adaptations in great apes. Considering STR’s greater capacity of mutating, they are well able to challenge SNPs, when it comes to studying recent evolution. Finally, we also examine possible mechanisms, how STRs may participate in gene regulation through transcription factor binding. |
Evolution and Phylogeny |
|
E03 |
Dib L*, Salamin N, Gfeller D
*Ludwig Centre for Cancer Research, UNIL & SIB Swiss Institute of Bioinformatics, Switzerland
Major histocompatibility complex (MHC) molecules are critical to adaptive immune defence mechanisms in vertebrate species and are encoded by highly polymorphic genes. Polymorphic sites are located close to the ligand-binding groove and entail MHC alleles with distinct binding specificities. Some efforts have been made to investigate the relationship between polymorphism and protein stability. However, less is known about the relationship between polymorphism and MHC coevolutionary constraints. Using Direct Coupling Analysis (DCA), we observe that coevolution analysis accurately pinpoints structural contacts, although the protein family comprises less than five hundred vertebrate species. Moreover, we show that polymorphic sites in human preferentially avoid coevolving residues. These results suggest that sites displaying high polymorphism may also have been selected to avoid those under co-evolutionary constraints and thereby maximize their mutability. To assess the results pertinence in the presence and absence of ligands, we provide a novel extension of DCA that incorporate the plurality of ligands of each HLA allele when looking for coevolving sites. |
Evolution and Phylogeny |
|
E04 |
Klingen T, Reimering S*, Loers J, Mooren K, Klawonn F, Krey T, Gabriel G, McHardy A
*Helmholtz Centre for Infection Research, Germany
Monitoring changes in the genome of influenza A viruses is crucial to understand its rapid evolution and adaptation to changing conditions e.g. establishment within novel host species. Selective sweeps represent a rapid mode of adaptation and are typically observed in the evolution of human influenza A viruses. We describe Sweep Dynamics (SD) plots a computational method combining phylogenetic algorithms with statistical techniques to characterize the molecular adaptation of rapidly evolving viruses from longitudinal sequence data. To our knowledge, it is the first method that identifies selective sweeps, the time periods in which these occurred and associated changes providing a selective advantage to the virus. We studied the past genome-wide adaptation of the 2009 pandemic H1N1 influenza A (pH1N1) and seasonal H3N2 influenza A (sH3N2) viruses. The pH1N1 influenza virus showed simultaneous amino acid changes in various proteins, particularly in seasons of high pH1N1 activity. Partially, these changes resulted in functional alterations facilitating sustained human-to-human transmission directly after is pandemic emergence. In the evolution of sH3N2 influenza viruses since 1999, we detected a large number of amino acid changes characterizing vaccine strains. Amino acid changes found in antigenically novel strains rising to predominance were occasionally revealed in a selective sweep one season prior to the recommendation of the WHO, suggesting the value of the technique for the vaccine strain selection problem. Taken together, our results show that SD plots allow to monitor and characterize the adaptive evolution of influenza A viruses by identifying selective sweeps and their associated signatures. |
Evolution and Phylogeny |
|
E05 |
Liu J*, Robinson-Rechavi M
*Unil, Department of Ecology and Evolution & SIB Swiss Institute of Bioinformatics, Switzerland
Developmental constraints on genome evolution have been suggested to follow either an early conservation model or an "hourglass" model. Both models agree that late development (after the morphological ‘phylotypic’ period: both late and post-embryonic development) diverges between species, but debate on which developmental period is the most conserved. Here, based on a modified “Transcriptome Age Index” approach, we analyzed the constraints acting on three evolutionary traits of protein coding genes (strength of purifying selection on protein sequences, phyletic age, and duplicability) in four species: C. elegans, D. melanogaster, D. rerio and M. musculus. In general, we found that both models can be supported from different genomic properties. The evolution of phyletic age and of duplicability follow an early conservation model in all species, but sequence evolution follows different models in different species: an hourglass model in both D. rerio and M. musculus, and an early conservation model in D. melanogaster. Further analyses indicate that stronger purifying selection on sequences in the early development (before the morphological ‘phylotypic’ period) of D. melanogaster and in the middle development (the morphological ‘phylotypic’ period) of D. rerio are driven by temporal pleiotropy of these genes. In addition, inspired by the “new genes out of the testis” hypothesis, we report evidence that expression in late development is enriched with retrogenes. This implies that expression in late development could facilitate transcription, and eventually acquisition of function, of new genes. Thus, it provides a model for why both young genes and high duplicability genes trend to be expressed in late development. Finally, we suggest that dosage imbalance could also be one of the factors that cause depleted expression of young genes and of high duplicability genes in early development, at least in C. elegans.
|
Evolution and Phylogeny |
|
E06 |
Guillot E*, Goudet J, Robinson-Réchavi M
*Unil, Department of Ecology and Evolution & SIB Swiss Institute of Bioinformatics, Switzerland
From differential expression to inference of selection, the length of genes creates a systematic bias in biological measures. To bridge the gap across different domains of bioinformatics where this problem occurs, we perform a general study on gene length bias, including cases of differential expression analysis and dn/ds analysis. Using simulations, we explore the impact of this bias on Gene Ontology Enrichment, as well as test a new method to correct for it. |
Evolution and Phylogeny |
|
E07 |
Kuipers J*
*ETHZ, D-BSSE & SIB Swiss Institute of Bioinformatics, Switzerland
Large-scale genomic data can help to uncover the complexity and diversity of the molecular basis of cancer and its progression. Statistical analysis of cancer data from different tissues of origin highlights differences and similarities which can guide drug repositioning as well as the design of targeted and precise treatments. Here, we developed an improved Bayesian network model for tumour mutational profiles and applied it to 8,198 patient samples across 22 cancer types from the TCGA database. For each cancer type, we identified the interactions between mutated genes, capturing signatures beyond mere mutational frequencies. When comparing networks, we found genes which interact both within and across cancer types. To detach cancer classification from the tissue type we performed de novo clustering of the pancancer mutational profiles based on the Bayesian network models. We found 22 clusters which significantly improved survival prediction beyond clinical and histopathological information. The models highlight key genes for each cluster that can be used for genomic stratification in clinical trials and for identifying drug targets within strata. |
Evolution and Phylogeny |
|
E08 |
Jahn K*, Kuipers J, Raphael B, Beerenwinkel N
*ETHZ, D-BSSE & SIB Swiss Institute of Bioinformatics, Switzerland
The mutational heterogeneity observed within tumours is a key obstacle to the development of efficient cancer therapies. A thorough understanding of subclonal tumour composition and the underlying mutational history is essential to open up the design of treatments tailored to individual patients. Recent advances in next-generation sequencing offer the possibility to analyse the evolutionary history of tumours at an unprecedented resolution, by sequencing single cells. This development poses a number of statistical challenges such as elevated noise rates due to allelic drop out, missing data and contamination with doublet samples.
We present SCITE our probabilistic approach for reconstructing tumour mutation histories from single-cell sequencing data [1] with a focus on two recent extensions, the explicit modelling of doublet samples and a rigorous statistical test to identify the presence of parallel mutations and mutational loss [2]. Looking at several single-cell sequencing datasets from various tumour types, we find strong evidence that such recurrent mutational hits of the same genomic site occur more frequently than would be expected by chance. This result casts severe doubt on the adequacy of the infinite sites assumptions which is typically made in current state-of-the-art models for reconstructing mutation histories of tumours from single-cell as well as bulk sequencing data.
[1] Jahn, K., Kuipers, J., and Beerenwinkel, N., 2016. Tree inference for single-cell data. Genome Biology, 17:86. [2] Kuipers, J., Jahn, K., Raphael, B., and Beerenwinkel, N., 2017. A statistical test on single-cell data reveals widespread recurrent mutations in tumor evolution. bioRxiv 094722; doi: https://doi.org/10.1101/094722 |
Evolution and Phylogeny |
|
E09 |
Kuznetsov D*, Tegenfeldt F, Waterhouse R, Zdobnov E, Kriventseva E
*Unil, CIG & SIB Swiss Institute of Bioinformatics, Switzerland
The OrthoDB [Evgeny M. Zdobnov et al., 2016], htttp://www.orthodb.org/, is a comprehensive resource of comparative genomics. OrthoDB employs BRH algorithm followed by clustering at each major radiation point of the considered species phylogeny. The resulted hierarchical catalog of orthologous groups (OGs) allows user to study target genes at the most relevant taxonomic level. OrthoDB provides evolutionary and functional annotations for the evaluated OGs including descriptive OG names, rates of ortholog sequence divergence, gene copy-number profiles, homology-related sibling groups, signature GO and InterPro identifiers and gene architecture profiles. On gene level, OrthoDB presents filtered sets of collated annotation from UniProt, Ensembl, GenBank, InterPro, GO, COG, as well as from a panel of model organism projects. The OrthoDB data are available via both web and programming interfaces. GUI provides comprehensive mining tools, including rapid google-like text search, sequence homology search, species tree navigation and comparative genomics charts, while API including URL programming and RDF/SPARQL console, allows fine-tuned data access for advanced users. OrthoDB project also accepts, processes and shows private genomic data for registered users The latest OrthoDB release (v9.1) covers 7806 species, including 330 Metazoa, 227 Fungi, 71 Protozoa and 31 plant species, as well as 3663 Bacteria, 345 Archaea and 3139 Viruses.
|
Evolution and Phylogeny |
|
E10 |
Gupta SK*, Srivastava M, Dandekar T
*University of Würzburg, Germany
Aspergillus fumigatus is an airborne fungal pathogen which can cause hypersensitive reaction, mucosal colonization and even life threatening invasive infections in the immune-compromised host. New antimycotic drugs are challenging to find against A. fumigatus, hence molecular target-screening pipelines are helpful (1). Furthermore, the concept of synthetic lethality can help to reveal antibiotic targets which were overlooked previously. The identification of synthetic lethality pairs is critical to propound such novel lethal targets, but direct screening of all possible synthetic lethality pairs is experimentally strenuous. Here, using computational methods, we defined the genome-wide synthetic lethal genetic interaction network of A. fumigatus composed of 2004 genes and 6571 genetic interactions and analyse how this network may help to define new therapeutic targets. Our approach used robust sequence homology mapping to derive the putative synthetic lethal genetic interactions, evolutionary conserved between A. fumigatus and at least one model eukaryotic organism including yeast, worm, fly and mouse. To avoid undesired off-target effects in the patient, the conserved SL pairs between human and A. fumigatus were removed. We further implemented a confidence score to measure the strength of our predictions and prioritize the validated pairs for experiments. Out of predicted 6571 synthetic lethal pairs we identified eleven pairs that were evolutionarily conserved in at least two model organisms.
Reference
Kaltdorf M, Srivastava M, Gupta SK, Liang C, Binder J, Dietl AM, Meir Z, Haas H, Osherov N, Krappmann S, Dandekar T. 2016. Systematic Identification of Anti-Fungal Drug Targets by a Metabolic Network Approach. Front Mol Biosci. 3:22.
|
Evolution and Phylogeny |
|
E11 |
Nevers Y*, Kress A, Ripp R, Poch O, Lecompte O
*CNRS - ICube UMR7357, France
OrthoInspector is one of the leading algorithms for pairwise orthologous relationship prediction. In addition to its independent software package, the OrthoInspector website (www.lbgi.fr/orthoinspector) offers precomputed databases, with a current sampling of 259 eukaryotes, 1568 bacteria and 120 archaea. Here, we present our latest developments on a new release of the OrthoInspector databases. We started from the Uniprot Reference Proteomes, a complete set of non-redundant proteomes representative of a wide range of taxa. Low-quality proteomes were filtered according to a statistical analysis of their protein content. The resulting set encompassed 4752 proteomes (87% of the original set), i.e. more than 23 million proteins. Considering these massive data sets, we developed different strategies to improve the computationally-expensive procedures of orthology prediction. We developed an original protocol to distribute BLASTP all-versus-all searches on the European Grid Infrastructure on segmented BLAST databases. The subsequent algorithmic steps were optimized for handling of large data sets. We generated three precomputed orthology databases covering 711 eukaryotes; 3682 bacteria and 179 archaea, a major breakthrough in term of available species. In parallel to the database extension, we are developing new features for the OrthoInspector software suite. This includes a SQLite support for simplification of local database installation. We are also developing an automatic update procedure to keep resources up to date with Uniprot References Proteomes without performing time-consuming global computations and enabling interoperability with other databases. Finally, we plan to introduce a definition of ortholog families in complement of the pairwise relationships currently supported. |
Evolution and Phylogeny |
|
E12 |
Begum T*, Serrano-Serrano ML, Robinson-Rechavi M
*Unil, Department of Ecology and Evolution & SIB, Switzerland
The “ortholog conjecture” is the widely used hypothesis that orthologs (i.e. genes originated by speciation) genes share function, whereas paralogs (i.e. genes originated by duplication) do not, allowing the second to evolve more rapidly. It has been recently controversial whether the evolution of gene expression follows this conjecture, with most recent results supporting the ortholog conjecture, including our own work on tissue specificity of gene expression [1]. However, a recent preprint by Dunn et al. [2] contradicted our results using phylogenetic independent contrasts on empirical gene trees. Using tissue specificity as the trait, and the expected number of substitutions (i.e. branch length) as a proxy for time, they found that the rates of gene function evolution are the same for orthologs and for paralogs. Dunn et al. make a correct point that the ortholog conjecture should be tested in a phylogenetic framework, but their analysis does not account for several features of paralog evolution. We repeated their simulations and analysis, with additional parameters. Notably, higher sequence evolutionary rates after duplication could lead to a rejection of the ortholog conjecture even in the presence of accelerated trait evolution after duplication. We also explore alternative models of evolution after duplication, and their impact on inference of the ortholog conjecture by phylogenetic contrasts or by pairwise comparison. The ratio of trait to sequence evolutionary rates plays an important role in the inference of function evolution, and makes it difficult to prove the ortholog conjecture using phylogenetic comparative methods in many cases.
Reference: [1] Kryuchkova-Mostacci N, Robinson-Rechavi M (2016). PLoS Comput Biol. 12: e1005274. [2] Dunn CW, Zapata F, Munro C, Siebert S, Hejnol A (2017). Biorxiv. doi: http://dx.doi.org/10.1101/107177
|
Evolution and Phylogeny |
|
E13 |
Bello C*, Kondrashov F
*Center for Genomic Regulation, Spain
Whole and partial gene duplications play key roles in the evolution of novel genes and generation of new phenotypes. A large portion of the human genome is enriched in segmental duplications that are absent in other primates. However, the contribution of long non-coding RNA (lncRNA) duplications in human evolution remains unclear. Here, we systematically addressed the rate and impact of lncRNA exon duplication in the human genome. We found that 11% of lncRNA exons had at least one highly similar copy in the genome and were significantly prevalent in alternatively spliced lncRNAs. Analysis of promoter single-nucleotide polymorphisms (SNPs) in flanking regions of lncRNAs showed evolutionary constraint indicative of a functional role of recent lncRNAs. Furthermore, the overrepresentation of specific classes of transposable elements (TEs) in exon flanking regions suggest a mechanism for the emergence and regulation of these genes. By integrating expression data and comparing primate genomes we identified 62 human-specific lncRNA genes that recently emerged through exon duplication, half of which were fixed in the human population. Some of these genes displayed tissue-specific expression patterns, including the brain. These results contribute to our understanding of the genomic events that have shaped the evolution of the human genome and prompt future studies of copy number variation in lncRNAs and their effects in disease and genome evolution. |
Evolution and Phylogeny |
|
E14 |
Grbic D*, Milinkovitch MC
*LANE, Departement of Genetics and Evolution & SIB Swiss Institute of Bioinformatics, Switzerland
Having a correct phylogenetic reconstruction is an important aspect of our understanding of the evolutionary past. In recent years, reconstructing phylogenetic relationships is most frequently done with probabilistic models like Maximum Likelihood and Bayesian methods due to their flexibility and statistical power. On the downside, this flexibility adds additional layers of parameter optimization on top of the tree topology optimization. Having that in mind, it's important to have an efficient and fast tool to reconstruct maximum likelihood trees. MetaPIGA 3 is a software package that employs meta-population Evolutionary Algorithm metaheuristics combined with fast numerical optimization methods. In this work we will present the comparative benchmark analysis between MetaPIGA and other state-of-the-art methods.
|
Evolution and Phylogeny |
|
E15 |
Marass F*, Beerenwinkel N, Yuan K
*ETHZ, D-BSSE & SIB Swiss Institute of Bioinformatics, Switzerland
According to the theory of clonal evolution, cancer cell populations evolve by mutation and selection, creating tumours that are composed of several distinct clones. On sequencing, the genomes of multiple subpopulations within the tumour are sampled and profiled at once. Analysis of this heterogeneity can correctly interpret these data and reconstruct the past evolutionary history of the tumour. Access to multiple samples of the same tumour, separated in space or time or consisting of different data types, increases the power to reconstruct their shared evolutionary history. However, as each sample only offers a partial view of the evolutionary process, it remains challenging to correctly integrate and interpret these data, especially across different data modalities. To address this problem, we developed a generative, probabilistic model to explain these partial views as subsets of the same evolutionary process. Leveraging the benefits of mixture and feature allocation models, we let a non-parametric tree form the prior distribution over hierarchically-related clones, and view samples as random draws from this distribution. We present a Markov chain Monte Carlo algorithm to obtain samples from the posterior distribution of the model, and show results based on a controlled biological dataset. Our model is suitable for joint deconvolutions of different data types, and it is applicable outside the realm of cancer, e.g. to sequencing of viruses. |
Evolution and Phylogeny |
|
E16 |
Noble R*, Burley J, Hochberg ME
*ETHZ, D-BSSE & SIB, Switzerland
Intra-tumour genetic heterogeneity is a product of evolution in spatially structured populations of cells. Whereas genetic heterogeneity has been proposed as a prognostic biomarker in cancer, its spatially dynamic nature makes accurate prediction of tumour progression challenging. We use a novel computational model of cell proliferation, competition, mutation and migration to assess when and how genetic diversity is predictive of tumour growth and evolution. We characterize how tissue architecture (cell-cell competition and cell migration) influences the potential for subclonal population growth, the prevalence of clonal sweeps, and the resulting pattern of intra-tumour heterogeneity. We further compare the accuracy of cancer growth forecasts generated using different virtual biopsy sampling strategies, in different tissue types, and when cancer evolution is characteristically neutral or non-neutral. We thus determine the conditions under which genetic diversity is most predictive of future tumour states. Our findings help explain the multiformity of tumour evolution and contribute to establishing a theoretical foundation for predictive oncology. |
Evolution and Phylogeny |
|
E17 |
Mateus I*, Blokesch M
*Swiss Federal Institute of Technology Lausanne (EPFL), Switzerland
Acinetobacter baumannii is a nosocomial pathogen that is capable of developing multidrug resistance (MDR). A. baumannii as well as certain other Gamma-proteobacteria have the capability to absorb DNA from the environment and to incorporate his DNA into their own genomes by homologous recombination. This process is called natural competence for transformation and is one of three major modes of horizontal gene transfer (HGT) in bacteria. HGT is known to lead to the spread of antibiotic resistance genes across bacteria, which is currently considered a major threat to human health. This study therefore asked the question whether there is a direct link between natural competence and resistance transmission and maintenance. Interestingly, recent reports provided evidence that selfish mobile genetic elements (MGE) such as integrative conjugative elements, prophages, insertion sequences, and others could inactivate natural competence and transformation in their bacterial host.
We analyzed 203 publicly available genome sequences of Acinetobacter baumannii for putative MGE that interrupt important natural competence genes. We then tested whether such competence-interrupted strains displayed a lower load of antimicrobial resistance loci compared to strains with fully functional competence regulons. Our data showed that 61 strains contained a complete competence-linked DNA-uptake machinery and that 142 strains displayed an incomplete machinery with one or several competence genes being interrupted by MGEs. In addition, we observed that strains with a complete DNA-uptake machinery display a lower load of antimicrobial resistance loci compared to strains with incomplete machineries.
These results therefore support a recent hypothesis, which proposed that natural competence for transformation contributes to genome purification. This also suggests an important role of MGE in competence inactivation and, as a consequence, in the maintenance of antimicrobial resistance. |
Evolution and Phylogeny |
|
E18 |
Marcionetti A*, Rossier V, Bertrand J, Salamin N
*Unil, Department of Ecology and Evolution & SIB Swiss Institute of Bioinformatics, Switzerland
Clownfishes (family: Pomacentridae) is a group of 28 described species and their distinctive characteristic is the mutualistic interaction they maintain with sea anemones. This mutualistic interaction was identified as the main “key innovation” that opened new ecological niches and triggered the adaptive radiation of clownfishes. Little is known about the genetic mechanisms that allowed clownfishes to adapt to their sea anemones hosts. There is also very little understanding of the process that led to the diversification of the clownfishes once they acquired the mutualistic interaction. Thanks to the advances in next generation sequencing technologies, we are investigating the genomic basis of the diversification process of the clownfishes. Based on predictions from theoretical models of speciation, we expect that the acquisition of a new phenotype, such as the ability to interact with sea anemones, should be driven by positive selection on existing single copy genes or neofunctionalization of duplicated genes that are specific to clownfishes.
We sequenced, assembled and annotated the genomes of 10 clownfish species. We used comparative methods and models of molecular evolution to infer the level of selection on all the one-to-one orthologous genes of the clownfishes and on clownfish-specific duplicated genes. We identified twelve genes under positive selection that are potentially associated with the adaptation of clownfishes to sea anemones and their diversification. These preliminary results suggest that at its early stages, this adaptive radiation was triggered by changes in few genes of large effect and our findings corroborate the expectations from theories of adaptive radiation. |
Evolution and Phylogeny |
|
E19 |
Loetscher A*, Hammer C, Fellay J, Zdobnov EM
*CMU & SIB Swiss Institute of Bioinformatics, Switzerland
Epstein-Barr Virus (EBV), is one of the pathogens responsible for infectious mononucleosis. Albeit usually asymptomatic, EBV infection has often been associated with Burkitt's lymphoma, Hogkin's lymphoma and naso-pharyngeal carcinomas. Our collaboration aims to find new key genomic features involved in this patchwork of diseases. Since EBV infection is lifelong, it is likely that EBV undergoes selective pressure from its host. Therefore, we are applying a genome-to-genome (G2G) approach that consists of associating EBV and human genotypes. An in-depth investigation of EBV genomic diversity study is central for such an analysis. My contribution to this analysis are i) a phylogenetic reconstruction of EBV lineages to correct for population stratification and ii) viral variant calling.
Since a higher viral load is found in immunodeficient patients, blood samples from 285 patients included in the Swiss-HIV cohort have been enriched for EBV and subsequently quality controlled and assembled. De novo assemblies were used for reconstructing a phylogeny alongside 31 published genomes. Phylogenies were assessed by bootstrap and by comparison with a reference tree. Variant calling has been performed against two references (AG876 (type 2 EBNA) and B95-8 (type 1)) using an in-house pipeline feature the consensus of multiple variant callers.
A comparison between the Swiss-HIV population and a previous worldwide diversity study shows, as expected, slightly less variability (~200 amino acid variants). Strikingly, we found evidence that our mostly European population contains type 2 EBNA, a trait found in almost exclusively in African EBV. |
Evolution and Phylogeny |
|
E20 |
Morgenstern B*, Leimeister C
*Georg-August-University, Germany
Alignment-free methods are often used for phylogeny reconstruction, since these methods are much faster than traditional alignment-based algorithms. In most of these methods, sequences are represented as word-frequency vectors and, instead of comparing sequences position-by-position to each other, their word-frequencies are compared. A disadvantage of these approaches is that distance values calculated in this way are not based on stochastic models of evolution; they are only rough measures of sequence dissimilarity.
Recently, we proposed to use `filtered spaced words matches' to estimate distances between two genomic sequences. For a fixed binary pattern representing `match' and `don't-care' positions, we search for pairs of words, one word from each of the input sequences, that have matching nucleotides at the `don't care positions' while mismatches are allowed at the `match positions'.
To reduce the noise of random background spaced-word matches, we calculate a score for each spaced-word match, and we discard all spaced-word matches below some threshold. This way, one can easily distinguish between homologous spaced-word matches and random background matches.
The spaced-word matches obtained in this way are then used to estimate the fraction of nucleotides mismatches in a full pairwise alignment of the input sequences. Finally, the usual Jukes-Cantor correction is applied to estimate the number of substitutions per sequence position since sequences separated in evolution.
Distances estimated with our approach are more accurate than distances calculated with other word-based methods, and that reliable phylogenetic trees can be calculated based on these distances.
|
Evolution and Phylogeny |
|
E21 |
Manni M*, Simao Neto FA, Misof B, Niehuis O, Zdobnov EM
*University of Geneva & SIB Swiss Institute of Bioinformatics, Switzerland
During the last few years, several genome sequencing projects have greatly expanded the number of arthropod species with sequenced genome, but genomic resources within Entognatha are still scarce. This limits the possibilities to study Hexapoda genome evolution. In the framework of the i5K initiative, we assembled the draft genome sequence of Campodea augens, a blind soil-dwelling hexapod belonging to the systematic order Diplura (two-pronged bristletails). Genomic information from species of this lineage is paramount for understanding the early evolution of hexapods and their genomes. DNA was sequenced using an Illumina HiSeq 2000 machine employing four paired-end short insert libraries and four mate-pair libraries. After testing several assemblers, the final assembly was performed using the software Platanus. Contigs were scaffolded with SSPACE and gaps were resolved using GapCloser. Additional tools, namely Redundans and AGOUTI, were employed for further improving the assembly quality. The final assembly spans around 1.14 Gbp, a value close to the genome size estimated via flow cytometry (ca. 1.2 Gbp) and harbors around 22,800 predicted genes. Contig/scaffold N50 are at 32 and 235 Kbp, respectively. BUSCO analysis indicates that the assembly contains 98% of 1,066 single-copy genes conserved across arthropods. The C. augens genome greatly expands our knowledge on the evolution of gene families and molecular pathways involved in perception, detoxification, and immunity in Hexapoda. Our project was enabled by Szucsich Nikolaus (Natural History Museum, Vienna) and Daniela Bartel (University of Vienna) who kindly provided us with samples of C. augens. |
Evolution and Phylogeny |
|
E22 |
Rama Ballesteros R*, Dib L, Meyer X, Dessimoz C, Robinson-Rechavi M, Salamin N
*Unil, Department of Ecology and Evolution & SIB Swiss Institute of Bioinformatics, Switzerland
Coevolution is an important phenomenon that can occur at all biological levels. At the molecular level, coevolution has proved to be an important process to detect protein interactions and to understand the functional relationships between proteins. Despite the large set of studies investigating coevolution and the processes associated, we are still lacking a solid understanding of the relevant features characterizing coevolution at the molecular level. We analysed the output of large-scale analyses that estimated coevolution on protein families available in the Selectome database. These analyses used a novel model to estimate coevolution that account for phylogenetic relationships between sequences and they were run on the BlueGen/Q computer available through CADMOS. The results were stored in a large database called CoevDB, which provides almost 1 Terabyte of coevolving pairs of sites in more than 8’000 protein families. Based on this data, features extracted from the gene sequences and structural information from UNIPROT and PDB, we use machine learning approaches to characterize the pairs involved in coevolution. The present study analyses the coevolving pairs of sites and recognises patterns in the extracted features involved in the biological process of coevolution.
|
Evolution and Phylogeny |
|
F01 |
Moretti S, Tran VD, Burdet F, Lefrançois L, Øyås O, Ganter M, Soldati T, Stelling J, Pagni M*
*Unil, CIG & SIB Swiss Institute of Bioinformatics, Switzerland
A genome-scale metabolic network (GSMN) is a set of biochemical and transport reactions in an organism, associated with their protein catalysers and the genes encoding them. A GSMN has a double purpose, as it is both a repository of knowledge about an organism's metabolism, and a model that can be simulated, using flux balance analysis (FBA) for example. We are proposing here a fully automated web service to create a GSMN given an organism's genome. This service was built on top of original algorithms and databases developed during the MetaNetX project (SystemsX.ch). The quality of the automatic reconstruction was benchmarked against experimental data for different organisms. In the framework of the HostPathX project (SystemsX.ch), a GSMN was constructed for Mycobacterium marinum. Gene essentiality was predicted through simulation of single gene knockout mutants and compared against experimental results from transposon mutagenesis experiments in different growth conditions. Our GSMN for M. marinum harbours a predictive power which is comparable to what is achieved by manually-curated GSMN for M. tuberculosis .
|
From Genotype to Phenotype and back, in health and disease |
|
F02 |
Milnik A*
*University of Basel, Switzerland
Studies assessing the existence and magnitude of epistatic effects on complex human traits provided inconclusive results. The study of such effects is complicated by considerable increase in computational burden, model complexity, and model uncertainty, which in concert decrease model stability. An additional source introducing significant uncertainty with regard to the detection of robust epistasis is the biological distance between the genetic variation and the trait under study. Here we studied CpG methylation, a genetically complex molecular trait that is particularly close to genomic variation, and performed an unbiased exhaustive search for two-locus epistatic effects on the CpG-methylation signal in two cohorts of healthy young subjects. We report the detection of robust epistatic effects for a small number of CpGs; these CpGs were more likely to be associated with gene-expression of nearby genes. Our results indicate that higher-order interactions explain only a minor part of variation in DNA-CpG methylation. Interestingly CpGs with a complex genetic background were more likely to be involved in the regulation of gene-expression in cis. |
From Genotype to Phenotype and back, in health and disease |
|
F03 |
Hashim M*
*United Arab Emirates University, United Arab Emirates
One of the goals of personalized medicine is to tailor therapy based a patient’s unique genetic information. Unfortunately, integration of genomic data such as cancer genetics or drug sensitivity into electronic health records (EHR) has been challenging. This has been due to several reasons including the complexity of genomic data as well as competing standards and ontologies. Indeed, most laboratories send genomic data using non-coded text reports in PDF files. We aimed to develop a simplified format for storing and displaying genomic data in EHRs. As part of a university-research grant funded project to develop an EHR, a new ‘simplified’ model for integration of genomic data was developed. Our model consists of four domains that store data in a clinically-relevant format. These domains of genomic information include: sensitivity to a pharmacologic agent, genetic variant (such as a cancer subtype) with an associated sensitivity to an agent, genetic variant (such as BRCA1) with prognostic significance, and any other genomic data (such as variants with unknown clinical significance). Each domain has text-based qualifiers including associated drug, level of sensitivity and clinical recommendations (‘actionability’). We have developed a simplified approach to storing and displaying genomic information in EHRs. Our model has the potential to enable personalized precision medicine in current clinical environments.
|
From Genotype to Phenotype and back, in health and disease |
|
F04 |
Yang M*, Vollmuth N, Rudel T, Liang C, Dandekar T
*Biozentrum of Universität Würzburg, Germany
Metabolic adaptation to the host cell is of vital importance for obligate intracellular pathogens such as Chlamydia trachomatis. Because of the unique biphasic developmental feature and difficulties in genetic manipulation, little is known about how metabolic activity of C. trachomatis really evolves in the host cell. Moreover, even less studies clarified exact differentiation of metabolic pathways between the infectious elementary body (EB) and reticulate body (RB). In our work, we reconstructed a genome scale model of the metabolic network for C. trachomatis and used the method of flux balance analysis (FBA) to identify the differences of pathways between EB and RB according to proteomics data in three different infectious time points. We find that regarding central pathways, the deficient tricarboxylic cycle (TCA) and fatty acid biosynthesis pathways are predicted to carry comparatively low fluxes. In contrast, the flux levels of pentose phosphate pathway (PPP) and gluconeogenesis (GNG) pathway are much higher. Glycolysis together with the glycophospholipid pathway show higher flux intensity. Moreover, we hypothesize that glutamate may play an important role in TCA anaperosis and energy source for folate biosynthesis by uptake from the host cell. We also compare the similarity and variation in current pathways between EB and RB. Experimental validation of the predictions is still ongoing. In conclusion, our study displays general metabolic activities of C. trachomatis and differences in EB and RB forms by bioinformatics analysis of different data sources. |
From Genotype to Phenotype and back, in health and disease |
|
F05 |
Srivatsa S*, Kuipers J, Schmich F, Beerenwinkel N
*ETHZ, D-BSSE & SIB Swiss Institute of Bioinformatics, Switzerland
Reconstructing signalling pathways from experimental measurements and biological prior knowledge is a key issue in computational biology. Nested effects models (NEMs) are a class of probabilistic graphical models, which have been designed to reconstruct pathways from high-dimensional perturbation screens. In RNA interference screens, NEMs assume that the short interfering RNAs (siRNAs) designed to perturb specific genes are strictly on-target. However, it has been shown that most siRNAs exhibit strong off-target effects, which further confound the data, resulting in unreliable reconstruction of networks by NEMs. Here, we present an extension of NEMs called probabilistic combinatorial nested effects models (pc-NEMs), which capitalise on the ancillary siRNA off-target effects information for network reconstruction from combinatorial gene knockdown data. We developed a parameter inference method based on the adaptive simulated annealing algorithm and evaluated the identifiability of pc-NEMs. An extensive simulation study examining the performance of network inference as a function of the number of effects and noise levels demonstrates that pc-NEMs improve the inference of networks over NEMs by utilising the supplementary siRNA off-target effects information. |
From Genotype to Phenotype and back, in health and disease |
|
F06 |
Huang R*, Schmidt TSB, von Mering C, Robinson MD, Soneson C
*UZH, Institute of Molecular Life Sciences & SIB Swiss Institute of Bioinformatics, Switzerland
Since the interest in microbiome-related effects on human health continues to rise, we would like to develop flexible and robust statistical methods to discover associations between phenotypic outcome (e.g., disease status) and observed microbial abundances. Community profiling of microbes is often based on 16S ribosomal RNA sequencing, from which operational taxonomic units (OTUs) can be defined as clusters of sequences with high similarity (e.g., thresholded at 97.5%)1. Current statistical analyses of 16S ribosomal RNA sequencing data are often aimed at pinpointing individual microbes that show differential abundance between predefined groups. However, the large number of microbes with differential abundance might lead to difficulties in interpretation. Under certain conditions, the abundances of a whole family of microbes may change. Such a family can correspond to an internal node in a phylogenetic tree, rather than a leaf (individual microbe). Performing test and interpretation on this hierarchical level would improve the interpretability of the results. Moreover, by aggregating the abundances across multiple species, the power to detect true signals increases. Based on the tools for differential analysis of quantitative RNA sequencing (RNA-seq) data (e.g., the edgeR package), we have developed a flexible bottom-up approach to find the hierarchical level on the tree where we would like to report the results. Applying our method to buccal mucosa samples from the Human Microbiome Project results in 195 OTUs and 528 internal nodes retained for interpretation, which can be compared to 1394 OTUs found to be significantly differentially abundant if the tree structure is not taken into account. 127 of the retained internal nodes contain OTUs that wouldn't have been detected individually. The algorithm is flexible, shows promising results in simulation studies and will have applications to other domains, such as differential analyses of hierarchies of cell types. |
From Genotype to Phenotype and back, in health and disease |
|
F07 |
Folkman L*, He X, Borgwardt K
*ETHZ, D-BSSE & SIB Swiss Institute of Bioinformatics, Switzerland
Large-scale screenings of cancer cell lines with detailed molecular profiles against libraries of pharmacological compounds are currently being performed in order to gain a better understanding of the genetic component of drug response and to enhance our ability to recommend therapies given a patient's molecular profile. These comprehensive screens differ from the clinical setting in which (1) medical records only contain the response of a patient to very few drugs, (2) drugs are recommended by doctors based on their expert judgment, and (3) selecting the most promising therapy is often more important than accurately predicting the sensitivity to all potential drugs. Current regression models for drug sensitivity prediction fail to account for these three properties. We present a machine learning approach, named Kernelized Rank Learning (KRL), that ranks drugs based on their predicted effect per patient, circumventing the difficult problem of precisely predicting the sensitivity to the given drug. Our approach outperforms several state-of-the-art predictors in drug recommendation, particularly if the training dataset is sparse. Our work phrases personalized drug recommendation as a new type of machine learning problem with translational potential to the clinic. The Python implementation of KRL and scripts for running our experiments are available at https://github.com/BorgwardtLab/Kernelized-Rank-Learning. |
From Genotype to Phenotype and back, in health and disease |
online |
F08 |
Alasoo K*
*University of Tartu, Estonia
Genetic variants regulating the relative abundance of specific transcripts of a gene (transcript ratio quantitative trait loci, trQTLs) are increasingly recognized to play an important role in both rare and common disease. However, many human transcripts are still missing from the Ensembl annotation database and up to 58% of those that are present, are truncated at either 5’ or 3’ ends. Consequently, state-of-the art methods such as LeafCutter focus on reads spanning exon-exon junctions that can be quantified without prior knowledge of transcripts. While powerful, such junction reads can detect only a subset of alternative transcription events, omitting variants that regulate intron retention as well as alternative 5’ and 3’ UTRs.
To overcome these limitations, we have developed a new approach, reviseAnnotations, to preprocess incomplete transcript annotations and split them into independent transcription events corresponding to alternative 5’ ends, alternative middle parts or alternative 3’ ends of transcripts. We apply this method to a RNA-seq data from human macrophages stimulated with three immunological stimuli (interferon-gamma, Salmonella and interferon-gamma + Salmonella), one metabolic stimulus (acetylated LDL) as well as untreated controls from 84 individuals.
We show that annotation preprocessing increases the power of trQTL detection by ~30% and identify novel associations that are missed by either whole-transcript or junction-level analyses. In contrast to eQTLs, we find that over 95% of the trQTLs are shared between conditions. Further colocalisation with 32 immune-mediated and metabolic traits revealed that ~70% of overlaps with GWAS hits were detected only at the transcript ratio level. Our method is freely available as R package at https://github.com/kauralasoo/reviseAnnotations. |
From Genotype to Phenotype and back, in health and disease |
|
F09 |
Buljan M*, Vychalkovskiy A, van Drogen A, Ciuffa R, Rosenberger G, Gstaiger M, Boutros M, Aebersold R
*Swiss Federal Institute of Technology Zürich (ETH Zurich), Switzerland
Cancer genomics data have enabled detection of candidates with mutation patterns which indicate their likely role in the disease. However, individual proteins usually exhibit their function in the context of interactions with other molecules. Here, we developed an approach for uncovering binding interfaces that accumulate disease mutations at a high rate. These can hence indirectly point to interactions affected during the cancer development. For this, we collected stable protein interactors reported in different public resources and analyzed the pairs for which associated structures or structural models were available. In addition, we extended the approach to a novel collection encompassing nearly 7,000 interactions of soluble human kinases. This large dataset was created by affinity purifying the tagged protein kinases and identifying their interactors with the mass spectrometry-based proteomics. Following, we assessed mutation enrichments in the assigned interfaces using a logistic regression model. Based on the cancer genomics data for more than 10,000 patients, we uncovered a number of instances that are as yet not classified as cancer census genes but that displayed mutational patterns typical of cancer drivers. Among these, several interfaces in epigenetic regulators and proteins involved in ubiquitin signaling exhibited a strong signal for presence of mutation clusters. In addition, we observed a number of kinase homologs related to cancer-relevant processes that had mutation clusters at the equivalent interfaces. In this work we show how by combining different resources it is possible to prioritize novel cancer gene candidates and contextualize their roles in the cell. |
From Genotype to Phenotype and back, in health and disease |
|
F10 |
Kaoma T*, Moussay E, Tranchevent L, Nazarov P, Sarry J, Dittmar G, Janji B, Azuaje F
*Luxembourg Institute of Health, Luxembourg
One of the main limitations of building anti-cancer drug sensitivity models is the large number of genes measured over low number of samples. To overcome this limitation different statistical approaches for feature selection have been applied. Feature selection can be improved by restricting the analysis to genes belonging to a biological process that is known to be relevant in the response induced by one or many anti-cancer drugs. Autophagy is a cellular mechanism involved in the degradation and the recycling of proteins and organelles. Autophagy has been associated with chemosensitivity, as well as with chemoresistance in many cancers. We used autophagy genes (www.autophagy.lu) to predict drug sensitivity based on the analysis of transcriptomic data originating from untreated cancer cell lines and their corresponding drug sensitivity information (IC50 values) available at http://www.cancerrxgene.org/. We first combined principal component analysis, k-means and linear regression to find a subset of autophagy genes that best reproduce the variability observed in the entire transcriptomic data. We then used this reduced gene list as an input to LASSO regression. Prediction models were built for each drug and subsequently validated using concordance index. We obtained on average a concordance index of 0.67±0.04. Our results provide evidence of the potential predictive capacity of autophagy gene expression data and machine learning for accurately predicting anti-cancer drug sensitivity. It motivates us to continue investigating such an approach and its application. |
From Genotype to Phenotype and back, in health and disease |
online |
F11 |
Farahbod M*, Pavlidis P
*University of British Columbia, Canada
Context Specific Coexpression Networks (CSCN) are RNA coexpression networks derived from condition-specific data sets. Previous work has suggested such networks improve performance in gene function prediction. One factor that likely contributes to the properties of CSCNs is the presence or absence of expression of genes, as opposed to changes in their coexpression profiles, but this has not been studied in much detail. In the present study of human tissues, we were interested in identifying coexpression “links” between genes that are moderately expressed in all tissues, but coexpressed in only one, and studying their functional properties as compared to other types of cross-tissue coexpression differences.
We collected several expression datasets from five human tissues and constructed Aggregated Tissue coexpression Networks (ATN) for each, and validated the quality of our ATNs using networks built from external datasets. We then defined a method for identification of Tissue Specific Links (TSL). We next stratified TSLs into classes based on the mean expression levels of the genes involved, focusing our interest on TSLs between the genes showing moderate expression across tissues. Using a Network Based Functional Identity (NBFI) measure (based on Gene Ontology annotations) for genes, we show that although these links are a small portion of the TSLs, they do contribute to the NBFI of the genes in many cases. From a slightly different angle, we also study the NBFI shifts in different ATNs for individual genes expressed in multiple tissues.
|
From Genotype to Phenotype and back, in health and disease |
|
F12 |
Koletou M*, Gabrani M, Aebersold R, Wild P, Rodriguez Martinez M
*IBM Research / USZ / ETH, Switzerland
Personalized medicine relies heavily on the patients’ data analysis including but not limited to the genomic datasets that are becoming increasingly more available nowadays. Prostate cancer is the second most frequent cancer type in men, but it is not always possible to make an accurate survival prognosis. Taking these into consideration, prostate cancer will be used as a case study to develop a novel computational framework to search for prostate cancer specific genomic alterations and study how they could improve the stratification of prostate cancer in two classes, significant and insignificant disease. The new framework will employ machine learning techniques, mainly focusing on pattern detection. A very promising method to be applied is dictionary learning with sparse coding, an efficient tool that has been used in image processing. Briefly, what it can accomplish is to identify genomic alterations that make substantial contributions to variation of complex traits that is not based on exhaustive search and therefore is computationally efficient and can be applied to smaller patient cohorts. Additionally, the framework will be used to integrate different types of genomic alterations from the TCGA Prostate Adenocarcinoma datasets and an independent cohort, the Zurich Prostate Cancer Outcome Cohort study.
|
From Genotype to Phenotype and back, in health and disease |
|
F13 |
Srivastava M*, Bencurova E, Dandekar T
*University of Würzburg, Germany
Dendritic cells (DCs) serve as a bridge between the innate and the acquired immunity. In fungal pathogen Aspergillus fumigatus DCs express the pattern recognition receptors (PRRs), toll-like receptor (TLR) 2 and TLR 4 and the C-type lectin receptor (CLR) Dectin–1 that recognise A. fumigatus and release inflammatory mediators including various cytokines and chemokines to guide other immune cells to the site of infection [1]. Aspergillosis is one of the major lethal conditions in immunocompromised patients caused by versatile saprophytic fungus A. fumigatus [2]. While the healthy human immune system is able to ward off A. fumigatus infections in general, immune-deficient patients are highly vulnerable against invasive aspergillosis. Conflict against regulated immune responses of human DCs is one of the vital processes for survival of A. fumigatus during infection. In this context, the study of metabolic behavior of fungal cell could reflect the survival strategies of pathogen. Here, we established a metabolic model of A. fumigatus central metabolism during infection of dendritic cells and identified the elementary modes. Transcriptome data was integrated to identify the pathways activated when A. fumigatus is challenged with DCs. For both A. fumigatus and DCs, we are able to outline specific metabolic changes in distpute against each other involving different lipid pathways. Finally, we also tested if there are additional regulatory pathways activated in the DCs apart from interleukine regulation. These pathways involve chemokine receptors and inflammatory responses. Validation of the predictions was done by qRT-PCR using RNA extracted under these different conditions. |
From Genotype to Phenotype and back, in health and disease |
|
F14 |
Dimitrakopoulos C*, Kumar Hindupur S, Haefliger L, Behr J, Montazeri H, Hall M, Beerenwinkel N
*Swiss Federal Institute of Technology Zürich (ETH Zurich), Switzerland
Several aberration events at the molecular level such as genetic aberrations, differential methylation at the gene promoter region, and microRNA (miRNA) differential expression have been associated with cancer. These aberration events are very heterogeneous across cancer patients and it is poorly understood how they affect the molecular makeup of the cell, such as the transcriptome and the proteome. Protein interaction networks can help decode the functional relationship between aberration events and changes in the expression of genes and proteins. We developed NetICS (Network-based Integration of multi-omICS data), a new graph diffusion-based method that integrates disparate molecular data sources on a directed functional interaction network in order to prioritize cancer genes. NetICS prioritizes genes by their mediator effect, defined as the proximity of the gene to upstream aberration events and to downstream differentially expressed genes and proteins in an interaction network. Genes are prioritized for individual tumor samples separately and integrated using a robust rank aggregation technique. NetICS provides a comprehensive computational framework that can aid in explaining the heterogeneity of cancer-related aberration events and the way they affect not only the expression of the genes they hit but also the expression of other genes due to gene interactions. We demonstrate that NetICS, compared to other methods, is superior in predicting known cancer genes when tested in several TCGA datasets. |
From Genotype to Phenotype and back, in health and disease |
|
F15 |
Mueller M*, Gfeller D, Coukos G, Xenarios I, Bassani M
*UNIL, CIG & SIB Swiss Institute of Bioinformatics, Switzerland
Neoantigens derived from mutated proteins and presented on the cell surface as HLA binding peptides are targets for T-cell recognition of cancer cells and their identification is crucial for the development of innovative cancer immunotherapies. While genome alterations can be identified by genome sequencing, the prediction of which of these alterations are presented on HLA molecules remains a challenge. Direct identification of neoantigens by mass spectrometry is technically feasible but lacks the sensitivity required for small quantity of tissue samples. Most attempts to computationally identify potentially immunogenic neoantigens are based on their predicted affinity to HLA molecules and other features such as RNA abundance, ignoring intracellular peptide processing and loading. Here we propose a new data driven method which makes use of the in-vivo presented repertoire of HLA class I and class II binding peptides. We collected and reanalyzed many MS/MS datasets of HLA peptides eluted from various tissues, cell lines and HLA alleles (ipMSDB). Analysis of these data reveals that proteins presented as HLA class I, class II or both, stem from different subsets of cellular proteins. ipMSDB peptides are not randomly scattered along the protein sequences, but tend to accumulate in ‘hot-spots’ which reflect the propensity of peptide sequences to be presented, ‘averaged’ over many HLA alleles and cell types. We define features which evaluate how well a potential neoantigen lies within these hotspots and combine them with other commonly used features for neoantigen immunogenicity ranking. We show that ipMSDB features significantly improve the prioritization of potential neoantigens. |
From Genotype to Phenotype and back, in health and disease |
online |
F16 |
Ernst C*, Hahnen E, Beyer A, Schmutzler RK
*University Hospital Cologne, Germany
Targeted sequencing, which is restricted to the exons of genes known or assumed to be implicated in a special phenotype, decreases costs, storage requirements, and computation times significantly in comparison to whole genome and whole exome approaches. Hence, so-called multi gene panel approaches have become a widely-used tool in clinical diagnostics. Targeted sequencing data is typically characterized by strong biases based on local mappability, GC-content, and further factors affecting capture efficiency, as well as non-linear effects at target edges resulting in noisy read abundance data. We present an approach for copy number variant (CNV) detection which is tailored to the challenges of multi gene panel analysis. Our method relies on a generalized additive model (GAM), which models the mean of observed read count frequencies as a product of two smooth functions, namely, a generic background function that contributes to all samples under consideration and a sample-specific smooth function. The latter function is used for final CNV calling. It is assumed to deviate significantly from zero in case a CNV exists. We validated our approach on 583 samples from seven sequencing runs that were analyzed on the diagnostic TruRisk® gene panel comprising 48 genes known or assumed to be implicated in hereditary breast and/or ovarian cancer. We compared the performance of our method to the performance of three other tools adapted to CNV analysis on panel sequencing data, namely, panelcn.mops, VisCap, and CoNVaDING. Evaluation revealed that our approach achieves sensitivities and specificities higher or close to the values achieved by existing tools.
|
From Genotype to Phenotype and back, in health and disease |
|
F17 |
Germain P*, Vitriolo A, Livi CM, Farina L, Ferrari F, Testa G
*European Institute of Oncology, Italy
When analyzing the transcriptomic impact of perturbations (e.g. mutations), a key challenge is to elucidate the pathways and intermediates linking the trigger to the observed transcriptomic changes. To this end, we developed a novel computational approach circumventing two problems of traditional reverse engineering, namely the poor reproducibility of interactions and the lack of interpretability/actionability of the resulting networks. The approach consists in, first, generating a matrix of the probability of regulator-target interactions from multiple sources, and then calculating the shortest probabilistic paths connecting the initial perturbation to each differentially-expressed gene. The interaction matrix is the weighted combination of interaction scores from 1) co-expression data (strengthening local interactions by leveraging evidence fromexternal datasets); and 2) protein-DNA interactions. For the latter, we first trained a dual level random forest to predict, on the basis of transcription factor binding, eQTLs, Hi-C interactions, topological domains, conservation, accessibility, etc., the genes functionally affected by perturbations in the factor. After training on key validation datasets, the model is applied to all available ChIPseq data to populate the protein-DNA interaction matrix and enable the network reconstruction. We provide proof-of-principle by defining, in patient-derived models, the pathways mediating the transcriptomic effect of CNV associated to neurodevelopmetnal disorders. |
From Genotype to Phenotype and back, in health and disease |
|
F18 |
Milchevskaya V*, Tödt G, Gibson T
*EMBL, Germany
Rapid and continuous improvement of the genomic information affects gene and transcript definition, which in turn has an impact on the genome-wide expression profiling measured by Affymetrix GeneChips, a platform widely used in such assays. The original probe groupings used to measure transcript abundance on the GeneChips often become outdated, as genome annotations are refined: presence of probes not matching the target sequence, or matching unspecifically; multiple probe sets being assigned to the same gene. These issues lead to inconsistent results and dependence on the strategy used to aggregate redundant expression measurements into a gene-level value.
Here we show that variability in the aggregation methods used in literature may affect results of the data analysis - both, for the outcome of the differential expression tests and downstream analysis. The genes assigned to multiple probe sets in the original GeneChips annotations are those most affected by the difference in processing. Mapping of the probe sequences to the most recent genomes for a number of Affymetrix platforms also supported that accurate re-grouping of the probes is nesessary.
As a solution, we have developed a pipeline that generates a novel probe grouping for an Affymetrix GeneChip based on the genome reference provided by the user, and builds an annotation package compatible with standard Bioconductor libraries for further processing. The designed probe sets are gene- or transcript- specific, and contain only probes with specific mappings. Thus, erroneous and redundant measurements frequently present in the original annotations are eliminated, and often the gene coverage of GeneChips is increased. Moreover, using the same gene definition facilitates cross-platform analysis and comparative studies involving different Affymetrix platforms as well as RNA-seq data.
|
From Genotype to Phenotype and back, in health and disease |
online |
F19 |
Giannoula A*, Gutierrez A, Bravo A, Sanz F, Furlong L
*Medical Research Institute of the Hospital del Mar, Spain
Time is a crucial parameter in the study of comorbidities in patients, as it permits to identify complex disease patterns, thereby allowing the prediction of the disease progression along time. A novel time-analysis framework is presented for large-scale comorbidity studies. The disease-history vectors of 643,358 hospitalized patients were extracted from a Catalan health registry and were represented as time sequences of ordered disease diagnoses. Pair-wise comparisons of all extracted sequences resulted in 3,153 and 3,864 statistically-significant comorbidity pairs for men and women. Their temporal directionality was assessed using the binomial test. Subsequently, a novel unsupervised clustering algorithm was applied based on the Dynamic Time Warping technique that grouped the disease trajectories according to the temporal characteristics that they shared, irrespective of their lengths, durations and time scales. The disease similarity metric (distance) employed, reflected differences in the disease trajectories as defined by the International Classification of Diseases (ICD-9). Alternative disease similarity metrics could be also used, revealing different aspects of similitude, such as, semantic, phenotypic, genotypic or a combination of these. Between the most highly-populated clusters, were those involving trajectories of respiratory and/or circulatory diseases. To summarize, it was demonstrated through a data-mining approach, how patient health information, collected in routine clinical practice, could be exploited in order to discover complex disease patterns and facilitate the prediction of the course of a disease given previous diagnoses. The proposed methodology could serve as the basis for developing disease prediction systems, leading, altogether, to more efficient and cost-effective clinical management and healthcare. |
From Genotype to Phenotype and back, in health and disease |
|
G01 |
Jayaprakash N*, Surolia A
*Indian Institute of Science, India
Protein-carbohydrate interactions play a pivotal role in mediating biomolecular recognition. We attempt to unravel its intricacies by understanding how the glycan code is interpreted by a myriad of carbohydrate binding proteins. Our objective is to decipher lectin-mediated recognition in the ER, which plays a crucial role in the ER-mediated Quality Control. The QC functions in three phases - protein folding, transport, and degradation. Altered protein quality control leads to ER related storage disorders. Cargo transport proteins - ERGIC-53 and VIP36; which are necessary for maintaining the cellular homeostasis is our primary focus. They recognize complex high mannose N-glycans on the folded glycoproteins. In the present study, we have employed temperature based Replica Exchange Molecular Dynamics (T-REMD) to decipher the inherent conformational heterogeneity and the binding mechanism of the N-linked glycans with the lectins. The study involves extensive simulations for the two proteins complexed with three high mannose glycans - Man8B, Man9, and Mono-glycosylated glycan. The process of recognition is captured using MD simulations to achieve mechanistic insights and to characterize the dynamics of glycans in their native and bound states via dihedral angle analysis. Results indicate that the flipped conformation of the glycans was crucial in differentiating their interaction with the proteins. Similar conformers of the glycans are preferred for ERGIC-53 and VIP-36 in their glycan recognition events. ERGIC-53 preferred Man8B while it was Man9 for VIP36, in coherence with the previous experimental reports. These simulations provide a computational microscopic preview of the mechanism at both spatial and temporal scales.
|
Macromolecular Structure, Dynamics and Function |
|
G02 |
Roehrig U*, Chaskar P, Zoete V
*Unil, CIG & SIB Swiss Institute of Bioinformatics, Switzerland
We present a hybrid quantum mechanical/molecular mechanical (QM/MM) on-the-fly docking algorithm, which addresses the challenges of treating polarization and metal interactions in docking. The algorithm is based on our classical docking algorithm Attracting Cavities and relies on the CHARMM force field and the semiempirical SCC-DFTB method. We tested the performance of this approach on three very diverse data sets: (1) the Astex Diverse set of common noncovalent drug/target complexes formed both by hydrophobic and electrostatic interactions; (2) a zinc metalloprotein data set, where polarization is strong and ligand/protein interactions are dominated by electrostatic interactions; and (3) a heme protein data set, where ligand/protein interactions are dominated by covalent ligand/iron binding. Redocking performance of the on-the-fly QM/MM docking algorithm was compared to the performance of classical Attracting Cavities and other popular docking codes. The results demonstrate that the QM/MM code preserves the high accuracy of most classical scores on the Astex Diverse set, while it yields significant improvements on both sets of metalloproteins at moderate computational cost [1,2].
[1] P. Chaskar, V. Zoete, U.F. Röhrig, J. Chem. Inf. Model. 57, 73-84 (2017) https://www.ncbi.nlm.nih.gov/pubmed/27983849
[2] P. Chaskar, V. Zoete, U.F. Röhrig, J. Chem. Inf. Model. 54, 3137-3152 (2014) https://www.ncbi.nlm.nih.gov/pubmed/25296988
|
Macromolecular Structure, Dynamics and Function |
|
G03 |
Haas J*, Behringer D, Gumienny R, Barbato A, Roth S, Schwede T
*University of Basel - Biozentrum & SIB Swiss Institute of Bioinformatics, Switzerland
Continuously monitoring tools for structure, structure quality and residue-residue contact prediction allows users to retrospectively select the best tool for a given scientific question. The Continuous Automated Model EvaluatiOn (CAMEO) platform has been running for over five years and has added innovative measures developed by the community and the CAMEO team. New categories requested by the community have been included over the years. Here, we would like to present the latest progress on structural similarity of protein-protein interfaces and superposition-free model confidence assessment. Several methods assessing structural similarity of protein-protein interfaces have been developed in recent years(MMAlign by S. Mukherjee, QS-score by M. Bertoni) which led us to start adding support for interface analyses in homomers. We focused on adding Distance metrics developed in the context of protein-protein docking that are not focused on binary interactions, as decomposing the comparison of assemblies into binary interactions can result in a factorial number of comparisons and missing interfaces (e.g. comparing a dimer to a tetramer) remain unaccounted. Apart from new scores and categories we also added a common subset selection to compare a range of servers on a common target set, modernized the web interface and introduced speed improvements. |
Macromolecular Structure, Dynamics and Function |
|
G04 |
Sanchez-Garcia R*, Sorzano COS, Carazo JM, Segura J
*CNB/CSIC, Spain
Position Specific Scoring Matrices (PSSM) profiles have been widely employed in order to characterize residues of protein structures and to predict a broad variety of protein properties. Although the computational cost of calculating a single PSSM profile is affordable, for most studies thousands of profiles are required, which leads to prohibitive computational costs. An important amount of these studies focus on properties that are derived from the three-dimensional structure of the proteins. However, in each of these studies, the PSSM profiles of the selected Protein Data Bank (PDB) proteins have had to be recalculated and mapped to the structure residues. In this work, we present a new database compiling PSSM profiles for the proteins of the Protein Data Bank (PDB). Currently, the database contains 333,532 protein chain profiles involving 123,135 different PDB entries. Each profile has been mapped to the structure residues using the SIFTS service. As a result, profiles can be directly employed and thus the development of methods involving PSSMs and protein structures can be sped up. A web application providing different methods for data access is freely available at http://3dcons.cnb.csic.es. |
Macromolecular Structure, Dynamics and Function |
|
G05 |
Benedetti F*, Racko D, Dorier J, Stasiak A, Burnier Y
*Unil, CIG & SIB Swiss Institute of Bioinformatics, Switzerland
The question of how self-interacting chromatin domains (TADs) in interphase chromosomes are structured and generated dominates current discussions on eukaryotic chromosomes. Numerical simulations using standard polymer models have been helpful in testing the validity of various models of chromosome organization. Experimental contact maps can be compared with simulated contact maps and thus verify how reliable is the model. With increasing resolution of experimental contact maps, it become apparent that active processes need to be introduced into models to recapitulate the experimental data. Since transcribing RNA polymerases are very strong molecular motors that induce axial rotation of transcribed DNA, we present here models that include such rotational motors. We also include into our models swivels and sites for intersegmental passages that account for passive action of DNA topoisomerases releasing torsional stress. Using these elements in our models,we show that supercoiling generation in the regions with divergent-transcription and supercoiling relaxation occurring between these regions are sufficient to explain formation of self-interacting chromatin domains in chromosomes of fission yeast (S.pombe). |
Macromolecular Structure, Dynamics and Function |
|
G06 |
Behringer D*, Haas J, Roth S, Schwede T
*Biozentrum & SIB Swiss Institute of Bioinformatics, Switzerland
The application of macromolecular models in life science research projects has rapidly increased in the last 6-8 years thanks to improved performance of the prediction algorithms and to methods to estimate model quality (1,2). Archiving of macromolecular models is crucial for interpretation and reproducibility of published results, yet the Protein Data Bank (PDB) does no longer accept theoretical models determined purely in silico (3). Following recommendations of community workshops (4), the macromolecular ModelArchive (MMA, ModelArchive.org) has been created, aiming at supplying long term archival of theoretical models for the scientific community. To date MMA contains 1440 searchable theoretical model depositions, each linked to a unique Digital Object Identifier (DOI).
MMA currently supports interactive deposition of theoretical 3D models originating from de novo to comparative protein structure modeling approaches. MMA thereby aims at adhering to the FAIR principles (Findable, Accessible, Inter-operable, Re-usable) offering interactive searches and public data download.
MMA is working toward re-usable models ensuring interoperability ("I") across the structural biology communities culminating in a common data standard (based on an extended mmCIF format) developed in collaboration with the PDB. Re-usability ("R") requires model validation, which is automatically performed upon every model deposition.
The MMA deposition system is fully automated through a template based dynamic web page design, which automatically extracts content from the extended mmCIF user uploaded files, thereby alleviating the manual data input.
1. A. Krystafovych et.al Proteins, 84 (2016), 349 - 369; 2. J. Haas et.al. Database (2013) 2013:bat031; 3. H. Berman et.al. Structure (2006) 8(14): 1211–1217; 4. T. Schwede et.al. Structure (2009) 17(2):151-9; |
Macromolecular Structure, Dynamics and Function |
|
G07 |
Bienert S*, Schwede T, Waterhouse A, Lepore A, Gumienny R, Tauriello G, Studer G
*Unibas, Biozentrum & SIB Swiss Institute of Bioinformatics, Switzerland
Three-dimensional protein structures inform a broad spectrum of questions in molecular life science research. However, compared to DNA sequencing, experimental structure determination is an expensive and laborious process. As a consequence, known protein sequences outnumber experimentally determined structures by orders of magnitude. Computational methods for generating 3D models of proteins from their amino acid sequence based on information from evolutionary related proteins is therefore an attractive alternative.
We are developing SWISS-MODEL as a robust and fully automated computational protein structure modelling workflow, complemented by an interactive user-friendly web-based graphical user interface. Latest developments include the ability to automatically model homo- and heteromeric assemblies of molecules by exploiting potential quaternary structure arrangements in the templates and a ML based scoring approach for biologically relevant interfaces. The implementation of a new modelling engine (ProMod3) based on a modern open source software framework (OpenStructure) enables rapid development of new algorithmic approaches. Presenting a new way of user guidance, SWISS-MODEL automatically recognises antibody sequences and offers a redirection to PIGS, a service specialised on immunoglobulin modelling.
Realistic estimates of model quality are crucial to select the most accurate model within an ensemble of alternatives, as well as be able to judge whether a model is suitable for a specific question. To address this problem, we have developed QMEANDisCo to estimate the local quality of a model on a per residue level.
All methods are implemented in our automated web-based interactive modelling system SWISS-MODEL (https://swissmodel.expasy.org). |
Macromolecular Structure, Dynamics and Function |
|
G08 |
Lang S*
*Friedrich Schiller University Jena, Germany
Molecular mimicry is the formation of specific molecules by parasites to avoid recognition and suppression by the immune system of the host. This is analogous to uniforms misused by villains. Several pathogenic Ascomycota and Zygomycota show such a behaviour, deceiving, in particular, the innate immune system. For example, Candida albicans binds human regulators like complement factor H and, thus, hides from the complement system. Such a camouflage can reach a point where the immune system can no longer clearly distinguish between self and non-self. This implies that a trade-off between attacking possible pathogens and host cells has to be made, which can in turn lead to autoimmunity. Based on methods from signalling theory and protein-interaction modelling, we here present a model of molecular mimicry by C. albicans involving human immune regulatory factor H. The main questions are to which extent pathogenic microbes can deceive the host immune system and how the host can respond to enable a distinction between host cells and camouflaged microbes. Results predict three discriminable regimes for molecular mimicry of pathogenic microbes, mainly depending on the concentration of pathogens in the blood. In the first regime, molecular mimicry is not effective and the host is able to clearly discriminate between self and non-self. In the second regime, molecular mimicry is successful and microbes can hide among the host cells in the blood. In the third regime, autoimmunity may occur. |
Macromolecular Structure, Dynamics and Function |
online |
G09 |
Ghosh S*, Sen S
*University of Basel - Biozentrum, Switzerland
The strength of binding between two molecules can be attributed to the type of interaction prevailing in their interaction interface. A meticulous analysis of interaction behavior in the docked interface of α-L-fucosidase and each of its bioisosterically modified inhibitors, together with binding strength analysis prompted synthesis of a small library of compounds ultimately leading to the discovery of a pharmacologically potent and biologically relevant α-L-fucosidase inhibitor with anti-breast cancer properties. Known α-L-fucosidase inhibitors, A and B were bioisosterically modified resulting in three new types of molecules, 4b, 5c and 6a, belonging to furopyridinedione, thiohydantoin and hydantoin chemotypes. The new molecules, particularly 4b and 6a were found to exhibit stronger binding after a comparison of the binding strengths of A, B, 4b, 5c and 6a following molecular docking with α-L-fucosidase. Enzymatic profiling of the library based on these new scaffolds showed that one of the compounds, 4e, inhibits α-L-fucosidase with an IC50 of ~0.7µM (RSC Adv., 2017, 7, 3563). Such bioisosteric modification of existing nonspecific drugs followed by binding strength analysis has tremendous potential to improve their specificity and potency. In a separate study attempting to uncover the stability of host-pathogen protein-protein complexes and address the hypothesis that they outcompete host-host protein-protein complexes, large-scale statistical analysis of differences in interface features was undertaken showing few noticeable differences thereby suggesting roughly similar interaction patterns and hence similar binding strengths. An SVM classifier based on such interface features, developed as an offshoot of this work, has immense prospect of identifying true biological interactions. |
Macromolecular Structure, Dynamics and Function |
|
G10 |
Bashardanesh Z*, van der Spoel D
*Uppsala University, Sweden
Intracellular environments are densely packed with macromolecules such as nucleic acids, proteins and sugars. High macromolecular concentrations influences thermodynamics, kinetics and dynamics of cellular processes. However, in vitro experiments are performed in dilute solutions, where steric effects and non specific interactions are small and their contribution to biomolecular properties such as binding constants and diffusion coefficients are not correctly captured. Some modeling studies have tried to quantify and explain these effects, however always with simplified models or models without explicit water. In our work, we have performed molecular dynamics (MD) simulations in full atomistic detail for crowded systems and measured biomolecular properties as a function of water fraction. We performed systematic computational studies of biomolecules with different crowding agents such as small proteins, RNA and DNA hairpins or DNA double helix. Regarding dynamical properties, our results show that translational diffusion coefficient is reduced to 10% for biologically relevant biomolecular mass density and that rotational diffusion coefficient is even more retarded. Further analysis and possible explanations will be presented. |
Macromolecular Structure, Dynamics and Function |
|
H01 |
Moretti S*, Martin O, Bridge A, Pagni M
*Unil, Department of Ecology and Evolution & SIB Swiss Institute of Bioinformatics, Switzerland
MetaNetX (http://www.metanetx.org) is a web site that provides tools to create, analyse and compare genome-scale metabolic networks. It is built on top of a repository of genome-scale metabolic networks and biochemical pathways imported from major public resources into a common namespace of chemical compounds, reactions, cellular compartments and proteins.
The new release of the MNXref namespace brings improvements in the reconciliation algorithm, as well as new data sources:
* The use of molecular structures to reconcile chemical compounds was systematised * Biochemical and transport reactions are now described with their metabolites placed into generic compartments * Two new data sources have been added: SABIO-RK and SwissLipids * Previous data sources have been updated: BiGG, ChEBI, enviPath, HMDB, KEGG, LipidMaps, MetaCyc, Reactome, Rhea and Model SEED
This project is developed in close collaboration with the RHEA and SwissLipids teams in Geneva and benefited from numerous exchanges with the external data source providers. |
Proteins, lipids & sugars |
|
H02 |
Hallal M*, Heller M, Bruggmann R, Allam R, Joncourt R, Oppliger Leibundgut E, Simillion C, Bonadies N
*University of Bern & SIB Swiss Institute of Bioinformatics, Switzerland
Myelodysplastic syndromes (MDS) are heterogeneous clonal haematopoietic disorders caused by the sequential accumulation of genetic lesions in haematopoietic stem cells (HSC) where approximately 30% of MDS patients evolve towards secondary acute myeloid leukemia. The aim of this project was to build a bioinformatic pipeline to integrate phosphogenomic data with genetic information in order to characterize kinase activity of involved oncogenic pathways. Here, we present data on the ongoing phosphoproteome characterization and enrichment analysis of kinase-activity in five myeloid cell lines. K562, NB4, THP1, OCI-AML3 and MOLM-13 are myeloid cell lines with established driver oncogenes. They were cultured and analyzed in triplicates by reversed-phase nano liquid chromatography coupled to tandem-mass spectrometry (nanoLC-MS2). Kinase enrichment analysis was performed using R-package SetRank with kinase-substrate datasets from five different databases. 15’698, 14’087, 13’969, 13’993 and 14’201 unique phosphopeptides corresponding to 3’536, 3’363, 3’411, 3’410 and 3’403 unique phosphoproteins were identified, respectively. Kinase enrichment lead to the detection of 77 different kinases. Phenotypically related cell lines clustered together and unique kinase activity patterns emerged for each cell line. A signal from the driver kinase ABL1 was detectable from 2 different databases in K562 as well as additional downstream kinases of ABL1. We could not enrich for the driver kinase FLT3 in MOLM-13, probably due to lack of representation in the currently available substrate-kinase databases. However, downstream kinases of FLT3 were detected. We expect to further improve quantification and annotation by using heavy labeled cell lines (SILAC) as well as kinase motif analysis, respectively.
|
Proteins, lipids & sugars |
online |
H03 |
Lehmann F*
*Unil, CIG & SIB Swiss Institute of Bioinformatics, Switzerland
Hundreds to thousands of metabolites can be simultaneously monitored in biological matrices using untargeted LC-MS experiments. Unambiguous compound identification remains mandatory to draw relevant biological conclusions from the data. As several hits can match a molecular formula, unique molecular identity can be difficult to obtain. Retention time constitutes an essential information to complement HRMS and MSn spectra for positional and constitutional steroid isomer identification [1]. An automatic steroid annotation based on retention time and HRMS has been developed using the annotation level established by the MSI/COSMOS initiatives [3]. The annotation is available through DynaStI (dynamic steroid Identification), an expert-curated endogenous steroid database designed for LC-MS steroidomic studies. DynaStI collects experimental and in silico (Quantitative Structure Retention Relationship) linear solvent strength (LSS) parameters to predict dynamically the retention time of steroids in any gradient conditions. To date, the database contains 198 endogenous molecules and each steroid entry includes key chemical information (IUPAC name, CAS number, a human curated SMILES, the most abundant ion detected in HRMS) as well as links to major databases, i.e. HMDB, LipidMaps and SwissLipids. DynaStI was validated using a case study involving the H295R reference cell line incubated with forskolin, a compound known to stimulate steroidogenesis [2]. The database is publicly available at https://steroid.vital-it.ch. [1] Randazzo GM et al. Anal Chim Acta 2016 [2] Randazzo GM et al. J Chromatogr B Analyt Technol Biomed Life Sci. 2017 [3] Salek et al. GigaScience 2013
|
Proteins, lipids & sugars |
|
H04 |
Galochkina T*, Nesterenko A, Zlenko D
*ICJ, University Claude Bernard Lyon 1 & Biological Faculty, Lomonosov Moscow State University, France
Resistance of Gram-negative bacteria to the action of different external agents such as antibiotics is provided by the cell wall with complex structure, which comprises an additional outer membrane covering the peptidoglycan layer. The major components of the outer leaflet of the outer membrane are lipopolysaccharides (LPS). Each LPS molecule consists of the lipid part (lipid A), of the negatively charged core part and may also bring a polysaccharide chain called O-antigen, which covers the cell surface forming an additional protective barrier. High variability of the O-antigen length and composition significantly hinder experimental investigation of their spatial arrangement, and molecular dynamics (MD) models become an important tool for such studies.
We analyzed the conformational dynamics of the single O12-antigen of LPS from Salmonella typhimurium serotype B in two different force fields: OPLS-AA and GLYCAM. We demonstrated that the choice of the force field has crucial influence on the MD simulation results for the long polysaccharide chains even though the effect is not pronounced for shorter oligosaccharides.
The conformational dynamics of the O-antigens was also analyzed for two models of the outer membrane fragment: a pure LPS bilayer and a bilayer with incorporated barrels of OmpA. In order to reproduce the close to native arrangement of the polysaccharide chains on the cell surface we performed partial high-temperature MD simulations to deepen conformational scanning volume. The resulting model of the O-antigen layer allows us to bring important conclusions about the details of its 3D organization and the underlying mechanisms of the O-antigen tangling. |
Proteins, lipids & sugars |
|
H05 |
Gastaldello A*, Alocci D, Mariethoz J, Lisacek F
*CUI - Battelle - bâtiment A & SIB Swiss Institute of Bioinformatics, Switzerland
The glycomics tab of the SIB Swiss Institute of Bioinformatics server (www.expasy.org/glycomics) was created in 2016 to centralise web-based glycoinformatics resources developed within an international network of glycoscientists. The philosophy of our toolbox is to be {glycoscientist AND protein scientist}–friendly with the aim of popularising (a) the use of bioinformatics in glycobiology and (b) the relation between glycobiology and protein-oriented bioinformatics resources. Here, we introduce Glyconnect a web application to explore associations between glycan structures, their binding characteristics (glyco-epitopes) and glycoproteins. Glyconnect is built on top of curated data relative to glycans and glycoproteins and when available, their respective expression. This data can be searched by glycoprotein (name or ID) as well as glycan composition, (sub)structure, or characteristic features (e.g., “fucosylated”). Search results highlight contextual information showing relations between proteins, glycosylation sites and glycans. They are shown in a series of interactive and cross-referencing charts and maps in a way to ease and encourage knowledge exploration. Glyconnect is developed in modular framework, which allows the development of separated GUI components usable as building blocks in the deployment of new tools. We chose the Web Component standard created by W3C and Google named Google Polymer. It offers an easy composition and reuse of GUI components. |
Proteins, lipids & sugars |
|
H06 |
Hartler J, Triebl A, Zieg A, Trötzmüller M, Rechberger GN, Zeleznik OA, Zierler KA, Torta F, Cazenave-Gassiot A, Wenk MR, Fauland A, Wheelock CE, Armando AM, Quehenberger O, Zhang Q, Wakelam MJO, Haemmerle G, Spener F, Köfeler HC, Thallinger GG*
*Graz University of Technology, Austria
Appropriate methods for analyzing lipids in high throughput fashion are needed in fundamental and applied research to accelerate biomedical, clinical and nutritional research including intervention studies. The method of choice for measuring quantitative changes of hundreds to thousands of lipids in complex mixtures is chromatography-coupled tandem mass spectrometry, e.g. LC MS/MS. Simultaneous automated identification of lipids at molecular species level, i.e. structural information such as identification of constituent fatty acyl- and ether-chains and determination of their sn-positions at the glycerol backbone, relies currently on spectral libraries. Yet, variables such as the type of mass spectrometer, the collision energy applied, the type of adduct, and the charge state influence heavily the pattern of lipid MS/MS spectra.
To solve these problems, we have developed Lipid Data Analyzer 2 (LDA 2), enabling automated annotation of lipid species and of their molecular structures in high-throughput LC-MS/MS data. The software interprets spectra based on intuitive decision rule sets, and flexibly accommodates changes in fragmentation behavior. Platform independence was proven in experiments with eight different mass spectrometric set-ups, comprising low- and high-resolution instruments at various collision energies and use of several adduct ions. With LDA 2, the number of correctly identified lipid molecular species increased by 40 %, and the reliability, i.e., the positive predictive value, increased from 58 to 92 % compared to present state of the art. Moreover, 111 novel lipid molecular species and 6 novel regio isomeric species were detected in samples from the well investigated mouse model. These results demonstrate a substantial advance over current state-of-the-art spectral libraries.
Support by the Austrian Science Fund (FWF Project Grant P26148) is gratefully acknowledged. |
Proteins, lipids & sugars |
|
I01 |
Crowell H*, Chevrier S, Zanotelli V, Engler S, Robinson M, Bodenmiller B
*ETHZ / UZH, Switzerland
Mass cytometry (CyTOF) addresses the limit of measurable fluorescent parameters due to instrumentation and spectral overlap through use of metal-tagged antibodies, thereby enabling simultaneous analysis of over 50 proteins and their modifications at the single cell level. While significantly less pronounced in CyTOF compared to flow cytometry, spillover due to detection sensitivity, isotopic impurities, and oxide formation still exists and can impede data interpretability. We have developed a bead-based compensation workflow, which enables correction for interference between channels, and have demonstrated utility in suspension and imaging mass cytometry. Our approach greatly simplifies the development of new antibody panels, increases options for antibody-metal pairing, increases overall data quality, and facilitates analysis of complex samples for which antigen abundances are unknown. The CATALYST R package (Cytometry dATa anALYSis Tools) developed in this study is available through Bioconductor, and provides a complete pipeline for preprocessing of cytometry data including file editing and concatenation, data normalization (Finck et al. 2013), an improved implementation of single-cell debarcoding (Zunder et al. 2015), and compensation. To date, these processing steps have only been available through different platforms, and compensation in mass cytometry has not been addressed. To make our pipeline accessible to novice users, a user-friendly browser-based graphical user interface is currently being finalized. |
Reproducibility and robustness of large scale biological analyses |
online |
I02 |
Schlitt T*
*Novartis, Switzerland
Prior to publication of gene expression analyses using RNA-sequencing sequence information (FASTQ files) is usually shared via public repositories such as GEO or ArrayExpress, allowing others to re-examine the data using their own methods. For clinical studies we are often not allowed to share sequence files publicly since they contain privacy relevant genetic information. Gene expression counts derived from the initial sequence data do not contain genetic information and thus can be shared, but how much does lack of access to sequence files limit reproducibility and reuse of the data? Several benchmark papers have been published on various data processing methods, but it is not obvious what the impact of choosing different tools and data annotations is on the downstream analysis. Can we combine gene counts files resulting from different RNA-seq processing pipelines for analysis or is it necessary to rerun all FASTQ sequences through the same tool before reusing data in a combined analysis? Here we report on our internal guidelines to ensure data reproducibility and on a comparison of our RNA sequencing pipelines to assess reproducibility over different versions. We suggest some best practices and list a number of open questions. |
Reproducibility and robustness of large scale biological analyses |
|
I03 |
Hembach KM*, Souza V, Polymenidou M, Robinson MD
*UZH, Institute of Molecular Life Sciences & SIB Swiss Institute of Bioinformatics, Switzerland
Alternative splicing can create a large number of complex transcripts from a single gene. Splicing patterns change under different conditions, such as between healthy and disease state or across stages of development. Microexons are alternatively spliced exons that are shorter than 27 nucleotides, are mostly found in neurological tissue, where they are important for neurogenesis and importantly, their misregulation has been linked to autism spectrum disorder.
The accurate identification of short (novel) exon events from RNA-seq data depends not only on the read mapping, but also on the exon quantification. However, the most commonly used quantification tools apply the union count method, i.e., simply counting reads that overlap with the feature of interest. In case of overlapping exon annotations, this approach is prone to overestimate the true counts. Therefore, we developed a quantification method based on Salmon transcript abundance estimates that solves this problem.
Using real datasets as a guide, we simulated RNA-seq data to assess the influence of the read mapping on exon quantification. In addition, we compare different quantification tools to find the best combination of tools and parameters for the discovery and quantification of microexons. In the mapping step, we compare STAR and tophat2 with different parameters to find the optimal mapper for short exons. For exon quantification, we assess the performance of featureCounts, the Exon quantification pipeline - quantification module (EQP-QM) and the tool developed in our lab using the true number of simulated reads for each exon. |
Reproducibility and robustness of large scale biological analyses |
|
I04 |
Kanitz A*, Gypas F, Köberle J, Schurr J, Zavolan M
*Unibas, Biozentrum & SIB Swiss Institute of Bioinformatics, Switzerland
Analyzing the large volume of biological data that is typically generated today in a reproducible and scalable manner has become a critical challenge. Faced with an overwhelming variety of protocols, research questions and tools to address them, many biologists and bioinformaticians still tackle many problems in an ad hoc manner. While in the short term, choosing one-off solutions and “throwaway” custom scripts is tempting and understandable, this approach fails in the longer term, coming at the cost of poor reusability and reproducibility, and thus efficiency and reliability.
We are currently developing Krini, a web-based data analysis server that allows massive parallel execution of curated workflows for the analysis of biological datasets. Using modern web technologies and user experience (UX) design, the service aims to reduce hands-on time by guiding users and allowing them to focus on project-specific aspects, while enforcing good practices in regard to reproducibility, documentation and the sharing of raw, analyzed and meta-data, with very little or no time overhead.
At the core of the application lies the Common Workflow Language (CWL), an open-source specification for the implementation of component-based workflows. CWL-compliant workflows are containerized and executed on HPC clusters or cloud services. Fully described and documented workflows are then automatically rendered on a Django/REST/Angular-powered web service with a project-centric dashboard. As a pilot, a Krini instance at the Biozentrum will offer workflows for the analysis of the most common types of next generation sequencing datasets such as RNA-Seq, ChIP, and CLIP. |
Reproducibility and robustness of large scale biological analyses |
|
I05 |
Roelli P*
*Unil, CIG & SIB Swiss Institute of Bioinformatics, Switzerland
With the reduction in cost of single cell sequencing technologies, scRNAseq is slowly replacing bulk RNAseq for general purpose experiments. In order to be able to extract mRNA from thousands of cell at once, the use of cell barcodes and UMI with droplet based protocols is proving to be one of the best approaches to distinguish unique cells and unique molecules captured. Although there is a wide variety of available tools for downstream analysis scRNAseq data, not a lot of efforts have been put into data filtering, extraction and validation. There are still a lot of challenges related to detecting “real” cells from doublets and real UMI from sequencing errors or protocol generated biases. Although there are commercial protocol specific pipelines (10x, inDrop), none of them is protocol agnostic and reproducibility is an issue. We propose here a first step towards a general purpose tool for quality control, filtering, mapping and extraction for scRNAseq data. We present dropSeqPipe, a flexible open-source pipeline based on snakemake. It allows easy extraction of gene expression matrices from raw fastq files in a few command lines. It produces a lot of quality control checkpoints with automatic plot generations for each step of the process. Configuration is easy and only based on two files. It is compatible with any scRNAseq based on cell barcode and UMI structures similar to dropseq or 10x and is actively developed to integrate Cellular Indexing of Transcriptomes and Epitopes (CITE-seq) single cell data.
Github: https://github.com/Hoohm/dropSeqPipe/ |
Reproducibility and robustness of large scale biological analyses |
online |
I06 |
Jeitziner R*
*EPFL SV IBI UPNAE & SIB Swiss Institute of Bioinformatics, Switzerland
I will present a new method to analyze biological data, based on topological data analysis. I will introduce the mathematics underlying this new tool, illustrate its utility through examples, and describe theoretical aspects of its stability. This method provides a first approximation to the variability in a dataset, describing divergences from sample to sample. It comprises a visualization tool that distinguishes the various clusters, giving an easy-tograsp presentation of the variation between samples in the dataset as a colored graph. The method, which is based on the well-known Mapper algorithm, can be applied reliably to both small and large datasets, which is a clear advantage in comparison with standard statistical tools, which perform reliably only on datasets of at least a certain minimal size. All parameters are determined either in a datadriven manner or by choosing reliable, user-independent default parameters. This new tool could be interesting for persons willing to make a differential analysis of a small sample sized data set. It however is useful for a broad audience and can be used by everybody. The method is stable and works for different type of data. Since it is developed as an R package it should be also available for testing in September. It is based on a field of mathematics not used until now for application and is hence innovative. |
Reproducibility and robustness of large scale biological analyses |
|
I07 |
Singer J*, Ruscheweyh H, Moore AL, Singer F, Beerenwinkel N
*ETHZ, D-BSSE & SIB Swiss Institute of Bioinformatics, Switzerland
Next-generation sequencing is now a cost efficient and widely used method in cancer genomics and starts to enter the daily routine in clinics. However, the analysis of the generated data is typically performed by lab-specific in-house solutions, and general standards for quality control, reproducibility, and documentation are missing. Here we present NGS-pipe, a flexible, transparent, and easy-to-use platform for the analysis of whole-exome, whole-genome, and transcriptome sequencing data to facilitate the harmonisation of genomic data analysis. NGS-Pipe provides modules to analyse large scale DNA and RNA sequencing experiments. Pre-configured workflows allow the detection of germline variants, somatic single nucleotide variants and insertion and deletion (indel) identification, copy number event detection, and differential expression analyses. This way, the final mutational information as well as quality control measures and intermediate results can be generated quickly also by inexperienced users. It is a highly flexible and easily extendable framework, which can be used on a single computer or in a cluster environment where independent steps are executed in parallel. |
Reproducibility and robustness of large scale biological analyses |
|
J01 |
Deprez M*, Paquet A, Lebrigand K, Nottet N, Waldmann R, Barbry P
*IPMC CNRS, France
High-throughput single cell gene expression profiling (scRNA-seq) is becoming an important component of the molecular biologist toolkit, allowing the rapid generation of vast amounts of complex high-dimensional data. Interpretation of the resulting datasets however requires well validated frameworks. SCsim is a novel R package for simulation of scRNA-seq data matching real dataset characteristics. Its parameters allow the simulation of the following biological processes:
(i) Transcriptional bursting, modeled using a two-state model with genes switching between ‘on’ and ‘off’ states with a certain probability, and when in an ‘active’ state producing varying quantity of mRNAs.
(ii) Technical and biological mRNA expression variability due to sequencing depth, batch effects or cell type dependent mRNA levels, are simulated based on a mix of Gaussian distributions.
(iii) Within each cell population, differential gene expression can be simulated from multiple distribution patterns (unimodal and/or bimodal, with different means), with parameters that realistically mimic cellular progression along a dynamic process such as differentiation.
(iv) Dropout events are modeled by a normalized logistic function, based on the hypothesis that the probability of failure to detect a gene is inversely correlated with its expression level.
A set of summary graphs describing the simulated dataset together with a count table are provided for each simulation. SCsim provides a comprehensive simulation of scRNA-seq data where data complexity and descriptive QC metrics resemble well with real scRNA-seq datasets. SCsim should facilitate the development and testing of new scRNA-seq analysis methods.
|
Stochasticity, heterogeneity, and single cells |
|
J02 |
Widmer L*, Stelling J
*ETHZ, D-BSSE & SIB Swiss Institute of Bioinformatics, Switzerland
Spatiotemporal models of cells usually apply either deterministic partial differential equation, stochastic reaction-diffusion master equation (RDME), or stochastic point-particle reaction-diffusion frameworks. Our RDME simulation engine in C++ (RDMEcpp) enables simulation of RDME models, which are a coarse-grained approximation of microscopic Smoluchowski dynamics. Currently, there is no cross-platform solution for building and simulating RDME models on unstructured meshes. Inspired by the previously-described URDME pipeline, we developed an equally high-performance, cross-platform solution in modern C++, with support for data analysis and visualization in MATLAB, Python and COMSOL Multiphysics. The simulation core allows for modular implementation of different types of solvers and diffusion matrix construction methods; it runs on Linux, Windows and OS X. Additionally, models that require mapping between real space and dual mesh elements during the simulation can be simulated, because, in contrast to other solvers, the simulation engine is coordinate-aware, and can efficiently determine the subvolume in the dual mesh associated with a point in real space. This allows, for example, for the simulation of arbitrarily-oriented microtubules in arbitrary geometries. We demonstrate this capability by implementing an arbitrarily instantiable microtubule model in a yeast cell geometry. Finally, we are currently working on enabling hybrid deterministic / stochastic simulation, such that stochastic simulations, which require the dominant computational effort, can be focused on chemical species of interest. |
Stochasticity, heterogeneity, and single cells |
|
J03 |
Cepeda Humerez SA*, Granados AA, Pietsch JMJ, Tkačik G, Swain PS
*Institute of science and technology Austria, Austria
Studies of regulatory pathways at the single-cell level suggest that the information about environmental conditions is often encoded in the temporal dynamics of the relevant intracellular signals. Such encoding may involve not one, but several signals that respond to the environmental conditions jointly, and in turn co-regulate their downstream targets. A principled way of quantifying how much information the cells’ regulatory networks carry about the environment is provided by information theory, and information-theoretic approaches are rapidly gaining visibility in systems biology. Rigorously quantifying the information carried by multiple dynamic signals is, however, not an easy task: experimental constraints on the number of observed cells severely limit any attempt to estimate the mutual information directly. Here we propose a general method for information estimation from single or multiple dynamical signals when the space of environmental conditions probed is of relatively low dimension, as is usually the case in controlled experiments. Our estimator is based on the application of support vector machine classifiers and, in addition to providing information estimates, directly addresses several biologically-relevant questions: How well can different environments be discriminated by the cell? Which temporal features of the intracellular signals are informative about the environment? What is the contribution of individual signals towards encoding of the environment? We demonstrate the applicability of the method on the newly acquired single-cell dynamic data on nuclear localization responses of five regulatory proteins of Saccharomyces cerevisiae exposed to different stress conditions. |
Stochasticity, heterogeneity, and single cells |
|
J04 |
Dirmeier S*, Beerenwinkel N
*ETHZ, D-BSSE & SIB Swiss Institute of Bioinformatics, Switzerland
Extrinsic and intrinsic factors are considered to determine heterogeneity in cell subpopulations. Cell-to-cell variability has, for instance, been observed in gene expression or in pathogen infection using RNA interference experiments. In these screens, small interfering RNAs potentially induce cell-to-cell variability, such as differences in cell size or the density of a cell's local neighborhood. Cell-to-cell variability might also impact the capacity of infection of a pathogen. Here, we analyze to which extent pathogen infection depends on subpopulational factors using the framework of probabilistic graphical models. We represent the joint probability distribution of the multivariate single-cell feature vectors from an RNAi screen as a hybrid Bayesian network. Hybrid networks are elegant graphical models that combine the distributions of continuous and discrete random variables. Conditional dependencies between single cell features are encoded as edges which are defined over a graph of nodes of random variables. We estimate the parameters of the local probability distributions and conditional dependencies of the network's random variables using fluorescence data from genome-wide image-based RNAi experiments on a group of different bacterial and viral pathogens. The probabilistic model allows to answer queries on posterior distributions of single variables which can provide novel insights in the interplay of pathogen infection and single-cell variability. |
Stochasticity, heterogeneity, and single cells |
|
J05 |
Sankar M*, Faget J, Xenarios I, Meylan E, Guex N, Garcia M
*SIB Swiss Institute of Bioinformatics, Switzerland
Technological advances facilitate multiple concurrent measurements of single cells features, at the cost of increasing the parameters space. This rises the difficulty in traditional manual analysis through supervised gating approach, leading to reproducibility issues. MEGACLUST (megaclust.vital-it.ch) proposes a solution to analyze flow and mass cytometry data produced by technological platforms. The core of MEGACLUST implements a high-performance unsupervised classification algorithm allowing fast and robust prediction of cell-type population. Up to date, it is the only full deterministic and reproducible algorithm able to cope with large amount of data without subsampling or dimensionality reduction. In addition, MEGACLUST unique features are (i) the quality control (QC) of the data acquisition through enhanced visualization, (ii) an automatic and systematic evaluation of the population predictions, (iii) a personalized analytical service (i.e., prediction results post-processing, statistics and visualization) provided to technological platform and researchers & (iv) an innovative visualization chart, called the dreamcatcher plot displaying inherent large compounded results in a single summarized view. All those features are embedded in a modular pipeline allowing reproducibility of parameters settings and clustering runs. Of important note, each of the above-mentioned modules can be run independently by any flow cytometry facilities. The poster will cover a description of MEGACLUST and its features, a comparison with the most used public and commercial software and typical use cases of MEGACLUST application in various single cell studies at the flow cytometry core facility of the EPFL (FCCF). |
Stochasticity, heterogeneity, and single cells |
|
K01 |
Prytuliak R*, Volkmer M, Meier M, Habermann B
*Max Planck Institute of Biochemistry, Germany
Short linear motifs (SLiMs) in proteins are self-sufficient functional sequences that specify interaction sites for other molecules and thus mediate a multitude of functions. Computational, as well as experimental biological research would significantly benefit, if SLiMs in proteins could be correctly predicted de novo with high sensitivity. However, de novo SLiM prediction is a difficult computational task. When considering recall and precision, the performances of published methods indicate remaining challenges in SLiM discovery. We have developed HH-MOTiF, a web-based method for SLiM discovery in sets of mainly unrelated proteins. HH-MOTiF makes use of evolutionary information by creating Hidden Markov Models (HMMs) for each input sequence and its closely related orthologs. HMMs are compared against each other to retrieve short stretches of homology that represent potential SLiMs. These are transformed to hierarchical structures, which we refer to as motif trees, for further processing and evaluation. Our approach allows us to identify degenerate SLiMs, while still maintaining a reasonably high precision. When considering a balanced measure for recall and precision, HH-MOTiF performs better on test data compared to other SLiM discovery methods. HH-MOTiF is freely available as a web-server at http://hh-motif.biochem.mpg.de . |
Technology track: Software and technology, Demos and tutorials |
online |
K02 |
Segura J*
*CNB-CSIC, Spain
With the advent of next generation sequencing methods, the amount of proteomic and genomic information is growing faster than ever. Several projects have been undertaken to annotate the genomes of most important organisms, including human. For example, the GENECODE project seeks to enhance all human genes including protein-coding loci with alternatively splices variants, non-coding loci and pseudogenes. Another example is the 1000 genomes, a repository of human genetic variations, including SNPs and structural variants, and their haplotype contexts. These projects feed most relevant biological databases as UNIPROT and ENSEMBL, extending the amount of available annotation for genes and proteins.
Genomic and proteomic annotations are a valuable contribution in the study of protein and gene functions. However, structural information is an essential key for a deeper understanding of the molecular properties that allow proteins and genes to perform specific tasks. Therefore, depicting genomic and proteomic information over structural data would offer a very complete picture in order to understand how proteins and genes behave in the different cellular processes. In this work we present the second version of a web platform -3DBIONOTES- that aims to merge the different levels of molecular biology information, including genomics, proteomics and interactomics data into a unique graphical environment. Current development offers a unified view of three of the most relevant protein databases: UniProt, PDB, EMDB, and ENSEMBL onto which other sources of biological annotations are also provided, such as PhosphoSitePlus, Immune Epitope DB, BioMuta and dSysMap.
|
Technology track: Software and technology, Demos and tutorials |
|
K03 |
Schmeing S*, Robinson M
*UZH, Institute of Molecular Life Sciences & SIB Swiss Institute of Bioinformatics, Switzerland
Currently comparison studies, e.g., for error correction, assembly or variant calling, face the problem that synthetic datasets resemble the real output of high-throughput sequencers only in very limited ways, resulting in much better estimated performance of programs run on simulated data compared to real data. Therefore, comparison studies are often based on real data. However, this approach holds its own difficulties since the ground truth is unknown and can only be estimated from available reference or variant files, which often still contain a noticeable amount of errors for non-model organisms. Even for model organism with perfect references, variations between the sequenced individual and the reference are counted as errors and reduce the estimated performance of tested tools. ReSequenceR fills the gap between simulated and real data evaluations by reproducing key statistics of real data. When these characteristics are translated into new synthetic computational experiments (i.e., simulated data), the performance can be more accurately estimated. Therefore, our simulator provides better information to developers about which features of real data their methods are struggling with.
The poster will show several important features of real Illumina data, highlighting that existing simulators fail to capture many important attributes. We show that ReSequenceR is able to capture a wider variety of data features of sequencing datasets to a higher degree. |
Technology track: Software and technology, Demos and tutorials |
online |
K04 |
Ricart Altimiras E*, Chevalier M, Pupin M, Leclere V, Flahaut C, Lisacek F
*CUI - Battelle - bâtiment A & SIB Swiss Institute of Bioinformatics, Switzerland
Nonribosomal Peptides (NRPs) are natural compounds enzymatically synthetized by microorganisms such as bacteria and fungi. These peptides have shown a wide range of biological properties such as antibiotics, antitumor or immunosuppressant, being of great importance to the pharmacological and agricultural industries. Due to its high sensitivity and accuracy, Mass Spectrometry (MS) is crucial for the identification of these biomolecules. However, the unusual chemical structures of NRPs (cyclic, polycyclic, branched…) and the presence of highly modified non-proteogenic amino-acids complicate the interpretation of the MS/MS spectra. Tools for the identification of some simple NRPs already exist, but they do not cover all NRP specificities, lack flexibility, efficient scoring and statistical validation as well as user friendliness. Here we present a new bioinformatics tool to match predicted MS/MS spectra against their experimental counterparts, either exactly or in a modification tolerant way. Our software is presented as a web application developed in Javascript, CSS and HTML for the client side and Java for the server side. It provides a highly interactive interface and it is able to perform a configurable and complete computational fragmentation of NRPs, including those presenting complex structures containing multicycles and several branches. Preliminary tests with experimental MS/MS data show positive results: the tool is able to match all the high intensity peaks. Furthermore, this is the first NRP fragmentation tool that includes modification tolerant searches, which will be very useful for the identification of new peptides.
|
Technology track: Software and technology, Demos and tutorials |
|
K05 |
Marchand A*, Anastasakis D, Jossinet F, Stathopoulos C
*Institut de Biologie Moléculaire et Cellulaire, France
Well known for their essential role in protein synthesis, tRNAs are also at the origin of a wide and varied population of fragments which are generally less than 50 nucleotides long. These so-called tRFs (short for tRNA fragments) have long been considered as products of random degradation, but recent studies indicate that they play significant regulatory roles in numerous biological processes, including development of many cancers. Our project is to decipher the idiosyncrasies of the tRF population of non-small cell lung cancer (NSCLC) which is the most common form of lung cancer. More specifically, our aim is to identify tRFs with high potential to be used as biomarkers for the early detection of the disease. For this purpose, we generated small RNA-Seq data both from NSCLC and normal tissues and initiated an in-depth study of the results. Since the tRF research field is still in its infancy, there is an important lack in terms of suitable RNA-Seq data analysis solutions. To fill this lack, we conceived tREFL, a computer program dedicated to both discover new tRFs from RNA-Seq data and analyse their differential expression between different experimental conditions (NSCLC and normal tissues in our dataset). tREFL features numerous striking functionalities which make it a high-value solution for any researchers interested into exploring the tRF population of their small RNA-Seq data. These functionalities notably include a validation method using trusted public databases, a classification system based on definitions from recent literature, and the possibility to seamlessly correlate tRF with tRNA data. |
Technology track: Software and technology, Demos and tutorials |
|
K06 |
Dohmen E*, Kremer LPM, Bornberg-Bauer E, Kemena C
*University of Münster, Germany
Improvements in NGS technologies and automatization of sequence assembly and genome annotation result in a huge amount of data that can differ enormously in quality. Since the quality of assemblies and annotations is a crucial point for all downstream analyses special attention should be paid to these steps. We developed DOGMA, a program for fast and easy quality assessment of transcriptome and proteome data based on conserved protein domains. Protein domains are functional and structural building blocks of proteins that can be recombined to form various domain arrangements. Other tools such as BUSCO are usually based on whole genes. Because genes often evolve modularly in pieces, which can be reconstructed via protein domains, it is reasonable to take an approach based on these. Additionally, protein domains are very conserved sequence motifs, which are well annotated and combine a high sensitivity and selectivity, making them a good candidate for quality assessment. The only time consuming step for quality assessment with DOGMA is the domain annotation. For this purpose we developed RADIANT (RApid DomaIn ANnoTation), a program to rapidly annotate Pfam domains in sequence data. Our tests show that annotation with RADIANT gives results similar to the original PfamScan and offers more information than the fast annotation tool UProC, while requiring less main memory. DOGMA introduces a new efficient quality assessment that is qualitatively comparable to other methods and greatly outperforms other programs in terms of speed if combined with RADIANT. |
Technology track: Software and technology, Demos and tutorials |
|
K08 |
Sehnal D*
*CEITEC, Czech Republic
Recent advances in 3D structure determination techniques such as Cryo-EM have facilitated the study of large macromolecular machines, leading to a rapid increase in the number, size, and complexity of biomacromolecular structures available in the Protein Data Bank (PDB). As a result, the online archives face a major challenge in enabling access to this diverse and rich data in informative and intuitive ways to more than 250M users, who view the data in PDB each year.
To address this challenge, we have developed the LiteMol suite, a comprehensive open-source solution for the fast delivery and interactive 3D visualization of large-scale structures, experimental data, and biological context annotations from resources such as Pfam or UniProt. The solution includes a next-generation web browser-based 3D molecular viewer (LiteMol Viewer), supported by CoordinateServer and DensityServer services for near-instant delivery of model and experimental data using the newly developed BinaryCIF format. The format is compatible with the existing standards used by PDB and the wider community while substantially reducing the file size. Our innovative approach works in all modern web browsers and mobile devices, and is up to orders of magnitude faster than its competitors. LiteMol suite is integrated into the Protein Data Bank in Europe (PDBe) with thousands of daily users. In parallel, the LiteMol suite also became a part of SIB and CNRS services, and its integration into other key life science web applications is planned. |
Technology track: Software and technology, Demos and tutorials |
|
K09 |
Moerman T*, Aibar S, Bravo González-Blas C, Aerts J, Aerts S
*KU Leuven, Belgium
Inferring gene regulatory networks (GRN) from high-throughput expression data remains an obstinate challenge in computational biology. A variety of methods to infer GRNs from gene expression data have been assessed in context of the DREAM network challenges. We favoured the simplicity and effectiveness of GENIE3, a strong contender in the DREAM 4 and 5 challenges. However, we found that its practicality suffers with respect to increasing sizes of single cell RNA-seq data sets. This problem ultimately motivated the conception of a new system: GRNBoost.
GENIE3 breaks up the inference of a regulatory network into a number of tree-based ensemble regressions (using Random Forest or Extra-Trees) equal to the number of genes in the data set, in function of a predefined set of transcription factors (TF). From the regression models, the transcription factors with highest importance and their target gene are aggregated into a putative regulatory network.
GRNBoost replaces the regression algorithm by XGBoost, a highly performant and currently state-of-the-art learner based on gradient boosted tree ensembles. Additionally, it leverages the so-called "embarrassingly parallel" nature of the multiple-regression approach by casting it into a MapReduce pipeline, allowing regressions to be distributed across multiple compute nodes.
In summary, we propose GRNBoost: a scalable GRN inference system using XGBoost for candidate regulator inference and Apache Spark for the distributed computation capability. GRNBoost achieves promising GRN inference quality and computational performance results. It is written in the Scala programming language and available at https://github.com/aertslab/GRNBoost/. |
Technology track: Software and technology, Demos and tutorials |
|
K10 |
Spagnuolo J*, de Libero G
*University of Basel, Switzerland
Analysis of fluorescence-activated cell sorting (FACS) data has long been a painstakingly slow process requiring a high level of experience and prior knowledge to identify interesting cell populations. The introduction of sophisticated machine learning algorithms has enhanced analytical pipelines, enabling hypothesis-driven, bespoke data exploration – previously an impractically slow process. However, these tools still require a high level of experience in one or more programming languages and fitting such independent analyses into a cohesive workflow remains challenging. FACSkit, integrates several state-of-the-art algorithms and aims to enable “wet-bench” scientists to perform semi-supervised analysis of FACS datasets. The combination of machine-learning (tSNE, SOM) and clustering algorithms powers data exploration and processing, allowing identification of unique sub-populations in complex experimental designs. A standardised workflow is implemented simplifying the steps required to process data and define parameters of dimensional reduction and clustering algorithms. First, summary statistics for variables are used to scale and transform data prior to dimensional reduction and clustering. Second, data exploration is enabled by visualisation of the low-dimensional space and detailed analysis of cluster quality. Finally, linear models can be fit to populations of interest allowing statistical differences to be determined. A key advantage of this toolkit over those currently existing is that it has been purpose designed for flexibility and transparency, allowing fine-tuning of algorithms without over-complicating the process. Additionally, it provides enhanced visualisation of multi-dimensional data to effectively communicate and summarise findings. Examples are provided with particular emphasis on analyses performed on cells from clinical samples. |
Technology track: Software and technology, Demos and tutorials |
|
K11 |
Satagopam V, Becker R, Gerloff D*, Ostaszewski M, Gu W, Krause R, May P, Tréfois C, Schneider R
*Foundation for Applied Molecular Evolution, United States of America
The lack of suitable data management systems that collect and integrate the various types of data remains a major hurdle for using biomedical data effectively. Fragmentation and lack of standards lead to poor interoperability between platforms and projects. The ELIXIR intergovernmental initiative currently unites > 20 national nodes in its aims to bring together life science resources from across Europe. ELIXIR-LU aims to facilitate long-term access to translational medical data by integrating clinical features with molecular and cellular data and thus creating a large resource for biomedical research. While ELIXIR-LU operates independently of a particular disease a priori, a major focus of our activities is centred around Parkinson’s disease (PD) and other neurodegenerative diseases.
How multiple genetic variants combinatorially contribute to the PD phenotype remains unexplored. Large sets of genetic data with accurate clinical characterisation are needed to identify significant genotype-phenotype correlations. For developing a better understanding of the disease it is key that we map datasets onto repositories of domain knowledge, and explore clinical and genetic data in the context of disease-related pathways. Our approaches aim to lower the technical barrier for this step, for example by enabling easy visual overlay of genetic data on the PD map knowledge resource (http://pdmap.uni.lu), the largest manually curated resource of PD-related pathways. We are able to identify corresponding drug targets, chemicals and miRNAs known to affect the function of the visualised genes and their variants. These tools have been instrumental for formulating new hypotheses and for mapping research results from the literature.
|
Technology track: Software and technology, Demos and tutorials |
|
K12 |
Spies D*, Renz P, Beyer TA, Ciaudo C
*Swiss Federal Institute of Technology Zürich (ETH Zurich), Switzerland
RNA sequencing (RNA-seq) has become a standard procedure to investigate transcriptional changes between conditions and is routinely used in research and clinics. While standard differential expression analysis between two conditions has been extensively studied, and improved over the last decades, RNA-seq time course (TC) differential expression analysis algorithms are still in their early stages. In this study, we compare, for the first time, existing TC RNA-seq tools on an extensive simulation data set and validated the best performing tools on published data. Surprisingly, TC tools were outperformed by the classical pairwise comparison approach on short time series (<8 time points) in terms of overall performance and robustness to noise, mostly due to high number of false positives. Overlapping of candidate lists between tools improved this shortcoming, as the majority of false positive, but not true positive, candidates were unique for each method. On longer time series, pairwise approach was less efficient on the overall performance compared to splineTC and maSigPro, which did not identify any false positive candidate. |
Technology track: Software and technology, Demos and tutorials |
|
K13 |
Berger S, Omidi S, Pachkov M, Arnold P, Kelley N, Salatino S, Krämer A*, van Nimwegen E
*Biozentrum & SIB Swiss Institute of Bioinformatics, Switzerland
With growing availability of high-throughput data, researchers experience problems with extracting concrete, reliable and biologically meaningful results from large data volumes. To help the scientific community we developed CRUNCH, a completely automated pipeline for the analysis of ChIP-Seq data. The pipeline provides rigorous standardization of all steps in ChIP-Seq analysis from quality control, to read mapping, fragment length estimation, peak identification including automated regulatory motif discovery and annotation at the peak locations. The peak detection itself is based on a Bayesian mixture model and detects enriched regions by fitting a noise model to the read distribution, followed by a Gaussian mixture model which fits the read distribution inside each region to find individual binding peaks. Having identified the peak locations, CRUNCH makes use of a combination of de-novo motif finding and binding site prediction of already known regulatory motifs to model the observed signal in terms of of novel and known regulatory motifs. Each contribution of a motif to the peak signal is given by a score and allows to address the importance this sequence for the explanation of the data.
To make this pipeline available for everyone dealing with sequencing data, CRUNCH is implemented as ready-to-use web-tool (crunch.unibas.ch), which only requires users to upload raw sequencing files. Currently CRUNCH supports analysis for mouse, human and drosophila datasets. The results are available in an integrated web-interface as well as in downloadable flat form. |
Technology track: Software and technology, Demos and tutorials |
|
K14 |
Peña-Reyes C*, Mungloo-Dilmohamud Z, Jaufeerally-Fakim Y
*HEIG-VD, Institute for Information and Communication Technology & SIB Swiss Institute of Bioinformatics, Switzerland
Microarray technologies produce very large amounts of data that need to be classified for interpretation. Large data, combined with small sample sizes, make it challenging for researchers to get useful information and therefore a lot of effort goes into the design and testing of various feature selection (FS) tools. This work aims at critically analysing some selected reviews in terms of how they classify the FS methods. The set of all classification criteria used by a review constitute, thus, its taxonomy. Both the implicit and explicit taxonomy were considered. A taxonomy was proposed for each paper and based on this, an extended taxonomy for categorizing feature selection techniques was proposed. The proposed taxonomy is based on six top-level criteria identified in the review papers. They are: selection management, type of evaluation, training approach, class dimensionality, model linearity and additional knowledge required. Selection management refers to how the different methods interact with the full set of features in order to perform selection. Type of evaluation refers to the way in which features are marked for maintain or elimination. The training approach refers to whether the selection is driven by presence or absence of class information. Class dimensionality refers to the number of classes being dealt with. Model linearity refers to whether the model used is linear or not. Knowledge use refers to whether or not some prior information is available to guide the selection. The proposed taxonomy was then used to classify the main FS methods presented in the selected reviews. |
Technology track: Software and technology, Demos and tutorials |
|
K15 |
Simao Neto FA*
*CMU & SIB Swiss Institute of Bioinformatics, Switzerland
The advent of high-throughput genomics has brought about a veritable paradigm shift in biological research. Due to rising demands and increasing volumes of data, technologies and downstream analysis tools have been rapidly evolving. This makes thorough quality control of the ‘products’ of sequencing data, e.g. genomes, genes, or transcriptomes, essential. Addressing this need, the Benchmarking Universal Single-Copy Orthologues (BUSCO) assessment tool provides intuitive quantitative measures of genomic data completeness in terms of expected gene content (Simão et al, 2015, PMID:26059717, http://busco.ezlab.org). BUSCO assessments identify complete, duplicated, fragmented, and missing genes and enable like-for-like quality comparisons of different datasets. These features mean that BUSCO has rapidly become established as an essential genomics tool, using up-to-date data from many species and with broader utilities than the popular but now discontinued Core Eukaryotic Genes Mapping Approach (Parra et al, 2007, PMID:17332020). Selected from major species clades (amongst prokaryota and eukaryota) of the OrthoDB catalog of orthologs, 44 clade-specific datasets can be used with BUSCO v3, permitting analysis using a large number of highly specific single-copy genes across all domains of life. Here we present a summary of the latest BUSCO features along with a variety of scenarios highlighting the wide range of uses of BUSCO assessments, designed primarily for (i) performing genomics data quality control, but also applicable for (ii) building robust training sets for gene predictors, (iii) selecting high-quality reference strains or species for comparative analyses, (iv) identifying reliable markers for large-scale phylogenomics studies, and (v) separating haplotypes in highly heterozygous assemblies. |
Technology track: Software and technology, Demos and tutorials |
|
K16 |
Upton A*, EnhanceR Project Consortium
*Swiss Federal Institute of Technology Zürich (ETH Zurich), Switzerland
Research presents unique challenges in IT, making its support challenging. Supporting researchers in meeting these challenges has led to the creation of specialist units at Swiss research organisations to provide Research IT support, allowing researchers to concentrate on their core tasks and accelerating time to results. Through the EnhanceR project, these units are federated into a national cooperative eScience and Research IT support community to provide support, transfer knowledge and build skills and capacities in the Swiss research academic sector.
EnhanceR is the follow-up to the successful eSCT project, which assisted a wide number of researchers across Switzerland through the delivery of over 50 support projects. These ranged from the development of a scalable neuroscience imaging analysis pipeline, to the benchmarking of sequencing alignment tools, to the creation of a resource-sharing platform across institutes that now has over 4,000 users.
In this poster, we outline the services that EnhanceR offers researchers, as well as detailing two examples of previous support projects. In the first example, we detail the development of a scalable imaging analysis pipeline that directly aided researchers at the Laboratory of Neural Circuit Dynamics at the Brain institute at the University of Zurich. In the second, we present the development of a technology stack to run the same workflow in different Swiss HPC clusters, aimed at projects with sensitive and confidential data where code has to move to the data. Finally, we present an overview of the support process, providing details on how interested researchers can access support. |
Technology track: Software and technology, Demos and tutorials |
online |
K17 |
Dylus D*, Solovyev A, Trafford J, Duvaud S, Artimo P, Ioannidis V, Stockinger H, Dessimoz C
*Unil, Department of Ecology and Evolution & SIB Swiss Institute of Bioinformatics, Switzerland
Phylogenetic trees allow to study the evolutionary history of species and can be inferred using many different approaches. Unfortunately, in many cases the resulting topology is effected by method and/or data. To spot topological differences we previously developed Phylo.io, an interactive tree visualization and visual comparison tool. Recently, in collaboration with the SIB technology group, we made several key improvements. Using web-workers we improved the speed of computation 10-fold and the user is now able to work with trees instantly while the tree comparison is computed in the background. Moreover, we improved the UI in order to have a more tree centered visualization and cleaner representation. We added several key functionalities, such as computation of ladderized trees, mirrored representation in compare mode allowing to easier spot differences, pruning of branches and others. Finally, we added the ability to also compute global tree distance metrics, i.e. Robinson Foulds distance, Euclidean distance and an approximation of the SPR distance, making Phylo.io the ultimate tree comparison tool. Phylo.io is freely accessible at http://phylo.io and can easily be embedded in other web servers (e.g. http://swisstree.vital-it.ch/). The code for the associated JavaScript library is available at https://github.com/DessimozLab/phylo-io under an MIT open source license. |
Technology track: Software and technology, Demos and tutorials |
|
K18 |
Tackmann J*, Matias Rodrigues JF, von Mering C
*University of Zurich & SIB Swiss Institute of Bioinformatics, Switzerland
The recent explosion of metagenomic sequencing data makes tools for rapid computational analysis essential. While such software is becoming increasingly available for mapping and clustering of Operational Taxonomic Units (OTUs), prediction of microbial interactions based on co-occurrence is still lagging behind. While simple correlation-based tools scaling to large numbers of OTUs and samples exist, these do not distinguish between direct and indirect interactions, resulting in high numbers of false positives. Approaches with better resolution, on the other hand, are so far highly limited in the size of data sets they can process. Furthermore, environmental factors, while being important modulators of microbial interactions, are usually not considered by available software. Finally, we observe traditional approaches to produce large numbers of false positives due to double-zero inflation caused by environmental niches in composite datasets. We adopt a machine learning framework based on Probabilistic Graphical Models and inspired by causal theory to infer highly resolved microbial interactions from large data sets, with seamless integration of environmental variables and optional adjustment for sub-niches. The method is highly optimised for speed, scaling to tens of thousands of OTUs and samples, surpassing state-of-the art methods by two to three orders of magnitude. In benchmarks on 8 synthetic data sets, it provides accuracy comparable to or surpassing current methods. We apply this approach to a massive meta-dataset of publicly available human gut samples (>46.000 samples), resulting in the largest and most diverse survey of microbial interactions in the human gut to date. |
Technology track: Software and technology, Demos and tutorials |
|