Search | VHL Regional Portal

1.

Towards reproducible computational drug discovery.

Schaduangrat, Nalini; Lampa, Samuel; Simeon, Saw; Gleeson, Matthew Paul; Spjuth, Ola; Nantasenamat, Chanin.

J Cheminform ; 12(1): 9, 2020 Jan 28.

Article in English | MEDLINE | ID: mdl-33430992

ABSTRACT

The reproducibility of experiments has been a long standing impediment for further scientific progress. Computational methods have been instrumental in drug discovery efforts owing to its multifaceted utilization for data collection, pre-processing, analysis and inference. This article provides an in-depth coverage on the reproducibility of computational drug discovery. This review explores the following topics: (1) the current state-of-the-art on reproducible research, (2) research documentation (e.g. electronic laboratory notebook, Jupyter notebook, etc.), (3) science of reproducible research (i.e. comparison and contrast with related concepts as replicability, reusability and reliability), (4) model development in computational drug discovery, (5) computational issues on model development and deployment, (6) use case scenarios for streamlining the computational drug discovery protocol. In computational disciplines, it has become common practice to share data and programming codes used for numerical calculations as to not only facilitate reproducibility, but also to foster collaborations (i.e. to drive the project further by introducing new ideas, growing the data, augmenting the code, etc.). It is therefore inevitable that the field of computational drug design would adopt an open approach towards the collection, curation and sharing of data/code.

2.

Software engineering for scientific big data analysis.

Grüning, Björn A; Lampa, Samuel; Vaudel, Marc; Blankenberg, Daniel.

Gigascience ; 8(5)2019 05 01.

Article in English | MEDLINE | ID: mdl-31121028

ABSTRACT

The increasing complexity of data and analysis methods has created an environment where scientists, who may not have formal training, are finding themselves playing the impromptu role of software engineer. While several resources are available for introducing scientists to the basics of programming, researchers have been left with little guidance on approaches needed to advance to the next level for the development of robust, large-scale data analysis tools that are amenable to integration into workflow management systems, tools, and frameworks. The integration into such workflow systems necessitates additional requirements on computational tools, such as adherence to standard conventions for robustness, data input, output, logging, and flow control. Here we provide a set of 10 guidelines to steer the creation of command-line computational tools that are usable, reliable, extensible, and in line with standards of modern coding practices.

Subject(s)

Big Data , Practice Guidelines as Topic , Software/standards , Biomedical Research/methods

3.

SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines.

Lampa, Samuel; Dahlö, Martin; Alvarsson, Jonathan; Spjuth, Ola.

Gigascience ; 8(5)2019 05 01.

Article in English | MEDLINE | ID: mdl-31029061

ABSTRACT

BACKGROUND: The complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation, and aid reproducibility of analyses. Many contemporary workflow tools are specialized or not designed for highly complex workflows, such as with nested loops, dynamic scheduling, and parametrization, which is common in, e.g., machine learning. FINDINGS: SciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics, and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamic scheduling and parametrization of downstream tasks. SciPipe builds on flow-based programming principles to support agile development of workflows based on a library of self-contained, reusable components. It supports running subsets of workflows for improved iterative development and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow, which can be converted to other formats such as HTML, TeX, and PDF on demand. The utility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline. CONCLUSIONS: SciPipe provides a solution for agile development of complex and dynamic pipelines, especially in machine learning, through a flexible application programming interface suitable for scientists used to programming or scripting.

Subject(s)

Computational Biology , Genomics , Software , Gene Library , Machine Learning , Programming Languages , Workflow

4.

PhenoMeNal: processing and analysis of metabolomics data in the cloud.

Peters, Kristian; Bradbury, James; Bergmann, Sven; Capuccini, Marco; Cascante, Marta; de Atauri, Pedro; Ebbels, Timothy M D; Foguet, Carles; Glen, Robert; Gonzalez-Beltran, Alejandra; Günther, Ulrich L; Handakas, Evangelos; Hankemeier, Thomas; Haug, Kenneth; Herman, Stephanie; Holub, Petr; Izzo, Massimiliano; Jacob, Daniel; Johnson, David; Jourdan, Fabien; Kale, Namrata; Karaman, Ibrahim; Khalili, Bita; Emami Khonsari, Payam; Kultima, Kim; Lampa, Samuel; Larsson, Anders; Ludwig, Christian; Moreno, Pablo; Neumann, Steffen; Novella, Jon Ander; O'Donovan, Claire; Pearce, Jake T M; Peluso, Alina; Piras, Marco Enrico; Pireddu, Luca; Reed, Michelle A C; Rocca-Serra, Philippe; Roger, Pierrick; Rosato, Antonio; Rueedi, Rico; Ruttkies, Christoph; Sadawi, Noureddin; Salek, Reza M; Sansone, Susanna-Assunta; Selivanov, Vitaly; Spjuth, Ola; Schober, Daniel; Thévenot, Etienne A; Tomasoni, Mattia.

Gigascience ; 8(2)2019 02 01.

Article in English | MEDLINE | ID: mdl-30535405

ABSTRACT

BACKGROUND: Metabolomics is the comprehensive study of a multitude of small molecules to gain insight into an organism's metabolism. The research field is dynamic and expanding with applications across biomedical, biotechnological, and many other applied biological domains. Its computationally intensive nature has driven requirements for open data formats, data repositories, and data analysis tools. However, the rapid progress has resulted in a mosaic of independent, and sometimes incompatible, analysis methods that are difficult to connect into a useful and complete data analysis solution. FINDINGS: PhenoMeNal (Phenome and Metabolome aNalysis) is an advanced and complete solution to set up Infrastructure-as-a-Service (IaaS) that brings workflow-oriented, interoperable metabolomics data analysis platforms into the cloud. PhenoMeNal seamlessly integrates a wide array of existing open-source tools that are tested and packaged as Docker containers through the project's continuous integration process and deployed based on a kubernetes orchestration framework. It also provides a number of standardized, automated, and published analysis workflows in the user interfaces Galaxy, Jupyter, Luigi, and Pachyderm. CONCLUSIONS: PhenoMeNal constitutes a keystone solution in cloud e-infrastructures available for metabolomics. PhenoMeNal is a unique and complete solution for setting up cloud e-infrastructures through easy-to-use web interfaces that can be scaled to any custom public and private cloud environment. By harmonizing and automating software installation and configuration and through ready-to-use scientific workflow user interfaces, PhenoMeNal has succeeded in providing scientists with workflow-driven, reproducible, and shareable metabolomics data analysis platforms that are interfaced through standard data formats, representative datasets, versioned, and have been tested for reproducibility and interoperability. The elastic implementation of PhenoMeNal further allows easy adaptation of the infrastructure to other application areas and 'omics research domains.

Subject(s)

Metabolomics/methods , Software , Cloud Computing , Humans , Workflow

5.

Predicting Off-Target Binding Profiles With Confidence Using Conformal Prediction.

Lampa, Samuel; Alvarsson, Jonathan; Arvidsson Mc Shane, Staffan; Berg, Arvid; Ahlberg, Ernst; Spjuth, Ola.

Front Pharmacol ; 9: 1256, 2018.

Article in English | MEDLINE | ID: mdl-30459617

ABSTRACT

Ligand-based models can be used in drug discovery to obtain an early indication of potential off-target interactions that could be linked to adverse effects. Another application is to combine such models into a panel, allowing to compare and search for compounds with similar profiles. Most contemporary methods and implementations however lack valid measures of confidence in their predictions, and only provide point predictions. We here describe a methodology that uses Conformal Prediction for predicting off-target interactions, with models trained on data from 31 targets in the ExCAPE-DB dataset selected for their utility in broad early hazard assessment. Chemicals were represented by the signature molecular descriptor and support vector machines were used as the underlying machine learning method. By using conformal prediction, the results from predictions come in the form of confidence p-values for each class. The full pre-processing and model training process is openly available as scientific workflows on GitHub, rendering it fully reproducible. We illustrate the usefulness of the developed methodology on a set of compounds extracted from DrugBank. The resulting models are published online and are available via a graphical web interface and an OpenAPI interface for programmatic access.

6.

A confidence predictor for logD using conformal regression and a support-vector machine.

Lapins, Maris; Arvidsson, Staffan; Lampa, Samuel; Berg, Arvid; Schaal, Wesley; Alvarsson, Jonathan; Spjuth, Ola.

J Cheminform ; 10(1): 17, 2018 Apr 03.

Article in English | MEDLINE | ID: mdl-29616425

ABSTRACT

Lipophilicity is a major determinant of ADMET properties and overall suitability of drug candidates. We have developed large-scale models to predict water-octanol distribution coefficient (logD) for chemical compounds, aiding drug discovery projects. Using ACD/logD data for 1.6 million compounds from the ChEMBL database, models are created and evaluated by a support-vector machine with a linear kernel using conformal prediction methodology, outputting prediction intervals at a specified confidence level. The resulting model shows a predictive ability of [Formula: see text] and with the best performing nonconformity measure having median prediction interval of [Formula: see text] log units at 80% confidence and [Formula: see text] log units at 90% confidence. The model is available as an online service via an OpenAPI interface, a web page with a molecular editor, and we also publish predictive values at 90% confidence level for 91 M PubChem structures in RDF format for download and as an URI resolver service.

7.

RDFIO: extending Semantic MediaWiki for interoperable biomedical data management.

Lampa, Samuel; Willighagen, Egon; Kohonen, Pekka; King, Ali; Vrandecic, Denny; Grafström, Roland; Spjuth, Ola.

J Biomed Semantics ; 8(1): 35, 2017 Sep 04.

Article in English | MEDLINE | ID: mdl-28870259

ABSTRACT

BACKGROUND: Biological sciences are characterised not only by an increasing amount but also the extreme complexity of its data. This stresses the need for efficient ways of integrating these data in a coherent description of biological systems. In many cases, biological data needs organization before integration. This is not seldom a collaborative effort, and it is thus important that tools for data integration support a collaborative way of working. Wiki systems with support for structured semantic data authoring, such as Semantic MediaWiki, provide a powerful solution for collaborative editing of data combined with machine-readability, so that data can be handled in an automated fashion in any downstream analyses. Semantic MediaWiki lacks a built-in data import function though, which hinders efficient round-tripping of data between interoperable Semantic Web formats such as RDF and the internal wiki format. RESULTS: To solve this deficiency, the RDFIO suite of tools is presented, which supports importing of RDF data into Semantic MediaWiki, with metadata needed to export it again in the same RDF format, or ontology. Additionally, the new functionality enables mash-ups of automated data imports combined with manually created data presentations. The application of the suite of tools is demonstrated by importing drug discovery related data about rare diseases from Orphanet and acid dissociation constants from Wikidata. The RDFIO suite of tools is freely available for download via pharmb.io/project/rdfio . CONCLUSIONS: Through a set of biomedical demonstrators, it is demonstrated how the new functionality enables a number of usage scenarios where the interoperability of SMW and the wider Semantic Web is leveraged for biomedical data sets, to create an easy to use and flexible platform for exploring and working with biomedical data.

Subject(s)

Information Storage and Retrieval/methods , Software , Humans , Internet , Intersectoral Collaboration , Metabolomics , Rare Diseases/genetics , User-Computer Interface

8.

SweGen: a whole-genome data resource of genetic variability in a cross-section of the Swedish population.

Ameur, Adam; Dahlberg, Johan; Olason, Pall; Vezzi, Francesco; Karlsson, Robert; Martin, Marcel; Viklund, Johan; Kähäri, Andreas Kusalananda; Lundin, Pär; Che, Huiwen; Thutkawkorapin, Jessada; Eisfeldt, Jesper; Lampa, Samuel; Dahlberg, Mats; Hagberg, Jonas; Jareborg, Niclas; Liljedahl, Ulrika; Jonasson, Inger; Johansson, Åsa; Feuk, Lars; Lundeberg, Joakim; Syvänen, Ann-Christine; Lundin, Sverker; Nilsson, Daniel; Nystedt, Björn; Magnusson, Patrik Ke; Gyllensten, Ulf.

Eur J Hum Genet ; 25(11): 1253-1260, 2017 11.

Article in English | MEDLINE | ID: mdl-28832569

ABSTRACT

Here we describe the SweGen data set, a comprehensive map of genetic variation in the Swedish population. These data represent a basic resource for clinical genetics laboratories as well as for sequencing-based association studies by providing information on genetic variant frequencies in a cohort that is well matched to national patient cohorts. To select samples for this study, we first examined the genetic structure of the Swedish population using high-density SNP-array data from a nation-wide cohort of over 10 000 Swedish-born individuals included in the Swedish Twin Registry. A total of 1000 individuals, reflecting a cross-section of the population and capturing the main genetic structure, were selected for whole-genome sequencing. Analysis pipelines were developed for automated alignment, variant calling and quality control of the sequencing data. This resulted in a genome-wide collection of aggregated variant frequencies in the Swedish population that we have made available to the scientific community through the website https://swefreq.nbis.se. A total of 29.2 million single-nucleotide variants and 3.8 million indels were detected in the 1000 samples, with 9.9 million of these variants not present in current databases. Each sample contributed with an average of 7199 individual-specific variants. In addition, an average of 8645 larger structural variants (SVs) were detected per individual, and we demonstrate that the population frequencies of these SVs can be used for efficient filtering analyses. Finally, our results show that the genetic diversity within Sweden is substantial compared with the diversity among continental European populations, underscoring the relevance of establishing a local reference data set.

Subject(s)

Genome, Human , Polymorphism, Single Nucleotide , Registries , Datasets as Topic , Genome-Wide Association Study , Humans , Sweden , Twins/genetics

9.

Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles.

Lampa, Samuel; Alvarsson, Jonathan; Spjuth, Ola.

J Cheminform ; 8: 67, 2016.

Article in English | MEDLINE | ID: mdl-27942268

ABSTRACT

Predictive modelling in drug discovery is challenging to automate as it often contains multiple analysis steps and might involve cross-validation and parameter tuning that create complex dependencies between tasks. With large-scale data or when using computationally demanding modelling methods, e-infrastructures such as high-performance or cloud computing are required, adding to the existing challenges of fault-tolerant automation. Workflow management systems can aid in many of these challenges, but the currently available systems are lacking in the functionality needed to enable agile and flexible predictive modelling. We here present an approach inspired by elements of the flow-based programming paradigm, implemented as an extension of the Luigi system which we name SciLuigi. We also discuss the experiences from using the approach when modelling a large set of biochemical interactions using a shared computer cluster.Graphical abstract.

10.

Large-scale ligand-based predictive modelling using support vector machines.

Alvarsson, Jonathan; Lampa, Samuel; Schaal, Wesley; Andersson, Claes; Wikberg, Jarl E S; Spjuth, Ola.

J Cheminform ; 8: 39, 2016.

Article in English | MEDLINE | ID: mdl-27516811

ABSTRACT

The increasing size of datasets in drug discovery makes it challenging to build robust and accurate predictive models within a reasonable amount of time. In order to investigate the effect of dataset sizes on predictive performance and modelling time, ligand-based regression models were trained on open datasets of varying sizes of up to 1.2 million chemical structures. For modelling, two implementations of support vector machines (SVM) were used. Chemical structures were described by the signatures molecular descriptor. Results showed that for the larger datasets, the LIBLINEAR SVM implementation performed on par with the well-established libsvm with a radial basis function kernel, but with dramatically less time for model building even on modest computer resources. Using a non-linear kernel proved to be infeasible for large data sizes, even with substantial computational resources on a computer cluster. To deploy the resulting models, we extended the Bioclipse decision support framework to support models from LIBLINEAR and made our models of logD and solubility available from within Bioclipse.

11.

Experiences with workflows for automating data-intensive bioinformatics.

Spjuth, Ola; Bongcam-Rudloff, Erik; Hernández, Guillermo Carrasco; Forer, Lukas; Giovacchini, Mario; Guimera, Roman Valls; Kallio, Aleksi; Korpelainen, Eija; Kandula, Maciej M; Krachunov, Milko; Kreil, David P; Kulev, Ognyan; Labaj, Pawel P; Lampa, Samuel; Pireddu, Luca; Schönherr, Sebastian; Siretskiy, Alexey; Vassilev, Dimitar.

Biol Direct ; 10: 43, 2015 Aug 19.

Article in English | MEDLINE | ID: mdl-26282399

ABSTRACT

High-throughput technologies, such as next-generation sequencing, have turned molecular biology into a data-intensive discipline, requiring bioinformaticians to use high-performance computing resources and carry out data management and analysis tasks on large scale. Workflow systems can be useful to simplify construction of analysis pipelines that automate tasks, support reproducibility and provide measures for fault-tolerance. However, workflow systems can incur significant development and administration overhead so bioinformatics pipelines are often still built without them. We present the experiences with workflows and workflow systems within the bioinformatics community participating in a series of hackathons and workshops of the EU COST action SeqAhead. The organizations are working on similar problems, but we have addressed them with different strategies and solutions. This fragmentation of efforts is inefficient and leads to redundant and incompatible solutions. Based on our experiences we define a set of recommendations for future systems to enable efficient yet simple bioinformatics workflow construction and execution.

Subject(s)

Computational Biology/methods , Electronic Data Processing/methods , Workflow , High-Throughput Nucleotide Sequencing , Reproducibility of Results

12.

Lessons learned from implementing a national infrastructure in Sweden for storage and analysis of next-generation sequencing data.

Lampa, Samuel; Dahlö, Martin; Olason, Pall I; Hagberg, Jonas; Spjuth, Ola.

Gigascience ; 2(1): 9, 2013 Jun 25.

Article in English | MEDLINE | ID: mdl-23800020

ABSTRACT

: Analyzing and storing data and results from next-generation sequencing (NGS) experiments is a challenging task, hampered by ever-increasing data volumes and frequent updates of analysis methods and tools. Storage and computation have grown beyond the capacity of personal computers and there is a need for suitable e-infrastructures for processing. Here we describe UPPNEX, an implementation of such an infrastructure, tailored to the needs of data storage and analysis of NGS data in Sweden serving various labs and multiple instruments from the major sequencing technology platforms. UPPNEX comprises resources for high-performance computing, large-scale and high-availability storage, an extensive bioinformatics software suite, up-to-date reference genomes and annotations, a support function with system and application experts as well as a web portal and support ticket system. UPPNEX applications are numerous and diverse, and include whole genome-, de novo- and exome sequencing, targeted resequencing, SNP discovery, RNASeq, and methylation analysis. There are over 300 projects that utilize UPPNEX and include large undertakings such as the sequencing of the flycatcher and Norwegian spruce. We describe the strategic decisions made when investing in hardware, setting up maintenance and support, allocating resources, and illustrate major challenges such as managing data growth. We conclude with summarizing our experiences and observations with UPPNEX to date, providing insights into the successful and less successful decisions made.

13.

Linking the Resource Description Framework to cheminformatics and proteochemometrics.

Willighagen, Egon L; Alvarsson, Jonathan; Andersson, Annsofie; Eklund, Martin; Lampa, Samuel; Lapins, Maris; Spjuth, Ola; Wikberg, Jarl Es.

J Biomed Semantics ; 2 Suppl 1: S6, 2011 Mar 07.

Article in English | MEDLINE | ID: mdl-21388575

ABSTRACT

BACKGROUND: Semantic web technologies are finding their way into the life sciences. Ontologies and semantic markup have already been used for more than a decade in molecular sciences, but have not found widespread use yet. The semantic web technology Resource Description Framework (RDF) and related methods show to be sufficiently versatile to change that situation. RESULTS: The work presented here focuses on linking RDF approaches to existing molecular chemometrics fields, including cheminformatics, QSAR modeling and proteochemometrics. Applications are presented that link RDF technologies to methods from statistics and cheminformatics, including data aggregation, visualization, chemical identification, and property prediction. They demonstrate how this can be done using various existing RDF standards and cheminformatics libraries. For example, we show how IC50 and Ki values are modeled for a number of biological targets using data from the ChEMBL database. CONCLUSIONS: We have shown that existing RDF standards can suitably be integrated into existing molecular chemometrics methods. Platforms that unite these technologies, like Bioclipse, makes this even simpler and more transparent. Being able to create and share workflows that integrate data aggregation and analysis (visual and statistical) is beneficial to interoperability and reproducibility. The current work shows that RDF approaches are sufficiently powerful to support molecular chemometrics workflows.

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL