Búsqueda | Portal Regional de la BVS

1.

Scalable de novo classification of antibiotic resistance of Mycobacterium tuberculosis.

Serajian, Mohammadali; Marini, Simone; Alanko, Jarno N; Noyes, Noelle R; Prosperi, Mattia; Boucher, Christina.

Bioinformatics ; 40(Supplement_1): i39-i47, 2024 Jun 28.

Artículo en Inglés | MEDLINE | ID: mdl-38940175

RESUMEN

MOTIVATION: World Health Organization estimates that there were over 10 million cases of tuberculosis (TB) worldwide in 2019, resulting in over 1.4 million deaths, with a worrisome increasing trend yearly. The disease is caused by Mycobacterium tuberculosis (MTB) through airborne transmission. Treatment of TB is estimated to be 85% successful, however, this drops to 57% if MTB exhibits multiple antimicrobial resistance (AMR), for which fewer treatment options are available. RESULTS: We develop a robust machine-learning classifier using both linear and nonlinear models (i.e. LASSO logistic regression (LR) and random forests (RF)) to predict the phenotypic resistance of Mycobacterium tuberculosis (MTB) for a broad range of antibiotic drugs. We use data from the CRyPTIC consortium to train our classifier, which consists of whole genome sequencing and antibiotic susceptibility testing (AST) phenotypic data for 13 different antibiotics. To train our model, we assemble the sequence data into genomic contigs, identify all unique 31-mers in the set of contigs, and build a feature matrix M, where M[i, j] is equal to the number of times the ith 31-mer occurs in the jth genome. Due to the size of this feature matrix (over 350 million unique 31-mers), we build and use a sparse matrix representation. Our method, which we refer to as MTB++, leverages compact data structures and iterative methods to allow for the screening of all the 31-mers in the development of both LASSO LR and RF. MTB++ is able to achieve high discrimination (F-1 >80%) for the first-line antibiotics. Moreover, MTB++ had the highest F-1 score in all but three classes and was the most comprehensive since it had an F-1 score >75% in all but four (rare) antibiotic drugs. We use our feature selection to contextualize the 31-mers that are used for the prediction of phenotypic resistance, leading to some insights about sequence similarity to genes in MEGARes. Lastly, we give an estimate of the amount of data that is needed in order to provide accurate predictions. AVAILABILITY: The models and source code are publicly available on Github at https://github.com/M-Serajian/MTB-Pipeline.

Asunto(s)

Aprendizaje Automático , Mycobacterium tuberculosis , Mycobacterium tuberculosis/genética , Mycobacterium tuberculosis/efectos de los fármacos , Farmacorresistencia Bacteriana/genética , Pruebas de Sensibilidad Microbiana , Antibacterianos/farmacología , Secuenciación Completa del Genoma/métodos , Genoma Bacteriano , Humanos

2.

An Experimental Performance Assessment of Galileo OSNMA.

Hammarberg, Toni; García, José M Vallet; Alanko, Jarno N; Bhuiyan, M Zahidul H.

Sensors (Basel) ; 24(2)2024 Jan 09.

Artículo en Inglés | MEDLINE | ID: mdl-38257496

RESUMEN

We present Galileo Open Service Navigation Message Authentication (OSNMA) observed operational information and key performance indicators (KPIs) from the analysis of a ten-day-long dataset collected in static open-sky conditions in southern Finland and using our in-house-developed OSNMA implementation. In particular, we present a timeline with authentication-related events, such as authentication status and type, dropped navigation pages, and failed cyclic redundancy checks. We also report other KPIs, such as the number of simultaneously authenticated satellites over time, time to first authenticated fix, and percentage of authenticated fixes, and we evaluate the accuracy of the authenticated position solution. We also study how satellite visibility affects these figures. Finally, we analyze situations where it was not possible to reach an authenticated fix, and offer our findings on the observed patterns.

3.

Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time.

Schmidt, Sebastian; Alanko, Jarno N.

Algorithms Mol Biol ; 18(1): 5, 2023 Jul 04.

Artículo en Inglés | MEDLINE | ID: mdl-37403080

RESUMEN

A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. For maximum performance of downstream applications it is important to store the k-mers in small space, while keeping the representation easy and efficient to use (i.e. without k-mer repetitions and in plain text). Recently, heuristics were presented to compute a near-minimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. Our algorithm first constructs the de Bruijn graph in linear time and then uses a Eulerian-cycle-based algorithm to compute the minimum representation, in time linear in the size of the output.

4.

Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes.

Alanko, Jarno N; Vuohtoniemi, Jaakko; Mäklin, Tommi; Puglisi, Simon J.

Bioinformatics ; 39(39 Suppl 1): i260-i269, 2023 06 30.

Artículo en Inglés | MEDLINE | ID: mdl-37387143

RESUMEN

MOTIVATION: Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures-that are both scalable and provide rapid query throughput-are paramount. RESULTS: Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 h. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 000 genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets. AVAILABILITY AND IMPLEMENTATION: Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.

Asunto(s)

Genoma Bacteriano , Nanoporos , Genómica , Metagenómica

5.

Matchtigs: minimum plain text representation of k-mer sets.

Schmidt, Sebastian; Khan, Shahbaz; Alanko, Jarno N; Pibiri, Giulio E; Tomescu, Alexandru I.

Genome Biol ; 24(1): 136, 2023 Jun 09.

Artículo en Inglés | MEDLINE | ID: mdl-37296461

RESUMEN

We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.

Asunto(s)

Algoritmos , Programas Informáticos , Análisis de Secuencia de ADN , Bacterias

6.

HaploBlocks: Efficient Detection of Positive Selection in Large Population Genomic Datasets.

Kirsch-Gerweck, Benedikt; Bohnenkämper, Leonard; Henrichs, Michel T; Alanko, Jarno N; Bannai, Hideo; Cazaux, Bastien; Peterlongo, Pierre; Burger, Joachim; Stoye, Jens; Diekmann, Yoan.

Mol Biol Evol ; 40(3)2023 03 04.

Artículo en Inglés | MEDLINE | ID: mdl-36790822

RESUMEN

Genomic regions under positive selection harbor variation linked for example to adaptation. Most tools for detecting positively selected variants have computational resource requirements rendering them impractical on population genomic datasets with hundreds of thousands of individuals or more. We have developed and implemented an efficient haplotype-based approach able to scan large datasets and accurately detect positive selection. We achieve this by combining a pattern matching approach based on the positional Burrows-Wheeler transform with model-based inference which only requires the evaluation of closed-form expressions. We evaluate our approach with simulations, and find it to be both sensitive and specific. The computational resource requirements quantified using UK Biobank data indicate that our implementation is scalable to population genomic datasets with millions of individuals. Our approach may serve as an algorithmic blueprint for the era of "big data" genomics: a combinatorial core coupled with statistical inference in closed form.

Asunto(s)

Genética de Población , Metagenómica , Genómica , Genoma , Haplotipos

7.

Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time.

Schmidt, Sebastian; Alanko, Jarno N.

Res Sq ; 2023 Feb 16.

Artículo en Inglés | MEDLINE | ID: mdl-36824947

RESUMEN

A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. For maximum performance of downstream applications it is important to store the k-mers in small space, while keeping the representation easy and efficient to use (i.e. without k-mer repetitions and in plain text). Recently, heuristics were presented to compute a near-minimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. Our algorithm first constructs the de Bruijn graph in linear time and then uses a Eulerian-cycle-based algorithm to compute the minimum representation, in time linear in the size of the output.

8.

Syotti: scalable bait design for DNA enrichment.

Alanko, Jarno N; Slizovskiy, Ilya B; Lokshtanov, Daniel; Gagie, Travis; Noyes, Noelle R; Boucher, Christina.

Bioinformatics ; 38(Suppl 1): i177-i184, 2022 06 24.

Artículo en Inglés | MEDLINE | ID: mdl-35758776

RESUMEN

MOTIVATION: Bait enrichment is a protocol that is becoming increasingly ubiquitous as it has been shown to successfully amplify regions of interest in metagenomic samples. In this method, a set of synthetic probes ('baits') are designed, manufactured and applied to fragmented metagenomic DNA. The probes bind to the fragmented DNA and any unbound DNA is rinsed away, leaving the bound fragments to be amplified for sequencing. Metsky et al. demonstrated that bait-enrichment is capable of detecting a large number of human viral pathogens within metagenomic samples. RESULTS: We formalize the problem of designing baits by defining the Minimum Bait Cover problem, show that the problem is NP-hard even under very restrictive assumptions, and design an efficient heuristic that takes advantage of succinct data structures. We refer to our method as Syotti. The running time of Syotti shows linear scaling in practice, running at least an order of magnitude faster than state-of-the-art methods, including the method of Metsky et al. At the same time, our method produces bait sets that are smaller than the ones produced by the competing methods, while also leaving fewer positions uncovered. Lastly, we show that Syotti requires only 25 min to design baits for a dataset comprised of 3 billion nucleotides from 1000 related bacterial substrains, whereas the method of Metsky et al. shows clearly super-linear running time and fails to process even a subset of 17% of the data in 72 h. AVAILABILITY AND IMPLEMENTATION: https://github.com/jnalanko/syotti. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Algoritmos , Programas Informáticos , ADN , Humanos , Metagenómica/métodos , Análisis de Secuencia de ADN/métodos

9.

Bacterial genomic epidemiology with mixed samples.

Mäklin, Tommi; Kallonen, Teemu; Alanko, Jarno; Samuelsen, Ørjan; Hegstad, Kristin; Mäkinen, Veli; Corander, Jukka; Heinz, Eva; Honkela, Antti.

Microb Genom ; 7(11)2021 11.

Artículo en Inglés | MEDLINE | ID: mdl-34779765

RESUMEN

Genomic epidemiology is a tool for tracing transmission of pathogens based on whole-genome sequencing. We introduce the mGEMS pipeline for genomic epidemiology with plate sweeps representing mixed samples of a target pathogen, opening the possibility to sequence all colonies on selective plates with a single DNA extraction and sequencing step. The pipeline includes the novel mGEMS read binner for probabilistic assignments of sequencing reads, and the scalable pseudoaligner Themisto. We demonstrate the effectiveness of our approach using closely related samples in a nosocomial setting, obtaining results that are comparable to those based on single-colony picks. Our results lend firm support to more widespread consideration of genomic epidemiology with mixed infection samples.

Asunto(s)

Genoma Bacteriano , Genómica , Análisis de Secuencia , Secuenciación Completa del Genoma

10.

Buffering updates enables efficient dynamic de Bruijn graphs.

Alanko, Jarno; Alipanahi, Bahar; Settle, Jonathen; Boucher, Christina; Gagie, Travis.

Comput Struct Biotechnol J ; 19: 4067-4078, 2021.

Artículo en Inglés | MEDLINE | ID: mdl-34377371

RESUMEN

MOTIVATION: The de Bruijn graph has become a ubiquitous graph model for biological data ever since its initial introduction in the late 1990s. It has been used for a variety of purposes including genome assembly (Zerbino and Birney, 2008; Bankevich et al., 2012; Peng et al., 2012), variant detection (Alipanahi et al., 2020b; Iqbal et al., 2012), and storage of assembled genomes (Chikhi et al., 2016). For this reason, there have been over a dozen methods for building and representing the de Bruijn graph and its variants in a space and time efficient manner. RESULTS: With the exception of a few data structures (Muggli et al., 2019; Holley and Melsted, 2020; Crawford et al.,2018), compressed and compact de Bruijn graphs do not allow for the graph to be efficiently updated, meaning that data can be added or deleted. The most recent compressed dynamic de Bruijn graph (Alipanahi et al., 2020a), relies on dynamic bit vectors which are slow in theory and practice. To address this shortcoming, we present a compressed dynamic de Bruijn graph that removes the necessity of dynamic bit vectors by buffering data that should be added or removed from the graph. We implement our method, which we refer to as BufBOSS, and compare its performance to Bifrost, DynamicBOSS, and FDBG. Our experiments demonstrate that BufBOSS achieves attractive trade-offs compared to other tools in terms of time, memory and disk, and has the best deletion performance by an order of magnitude.

11.

Finding all maximal perfect haplotype blocks in linear time.

Alanko, Jarno; Bannai, Hideo; Cazaux, Bastien; Peterlongo, Pierre; Stoye, Jens.

Algorithms Mol Biol ; 15: 2, 2020.

Artículo en Inglés | MEDLINE | ID: mdl-32055252

RESUMEN

Recent large-scale community sequencing efforts allow at an unprecedented level of detail the identification of genomic regions that show signatures of natural selection. Traditional methods for identifying such regions from individuals' haplotype data, however, require excessive computing times and therefore are not applicable to current datasets. In 2019, Cunha et al. (Advances in bioinformatics and computational biology: 11th Brazilian symposium on bioinformatics, BSB 2018, Niterói, Brazil, October 30 - November 1, 2018, Proceedings, 2018. 10.1007/978-3-030-01722-4_3) suggested the maximal perfect haplotype block as a very simple combinatorial pattern, forming the basis of a new method to perform rapid genome-wide selection scans. The algorithm they presented for identifying these blocks, however, had a worst-case running time quadratic in the genome length. It was posed as an open problem whether an optimal, linear-time algorithm exists. In this paper we give two algorithms that achieve this time bound, one conceptually very simple one using suffix trees and a second one using the positional Burrows-Wheeler Transform, that is very efficient also in practice.

12.

A framework for space-efficient variable-order Markov models.

Cunial, Fabio; Alanko, Jarno; Belazzougui, Djamal.

Bioinformatics ; 35(22): 4607-4616, 2019 11 01.

Artículo en Inglés | MEDLINE | ID: mdl-31004473

RESUMEN

MOTIVATION: Markov models with contexts of variable length are widely used in bioinformatics for representing sets of sequences with similar biological properties. When models contain many long contexts, existing implementations are either unable to handle genome-scale training datasets within typical memory budgets, or they are optimized for specific model variants and are thus inflexible. RESULTS: We provide practical, versatile representations of variable-order Markov models and of interpolated Markov models, that support a large number of context-selection criteria, scoring functions, probability smoothing methods, and interpolations, and that take up to four times less space than previous implementations based on the suffix array, regardless of the number and length of contexts, and up to ten times less space than previous trie-based representations, or more, while matching the size of related, state-of-the-art data structures from Natural Language Processing. We describe how to further compress our indexes to a quantity related to the redundancy of the training data, saving up to 90% of their space on very repetitive datasets, and making them become up to 60 times smaller than previous implementations based on the suffix array. Finally, we show how to exploit constraints on the length and frequency of contexts to further shrink our compressed indexes to half of their size or more, achieving data structures that are a hundred times smaller than previous implementations based on the suffix array, or more. This allows variable-order Markov models to be used with bigger datasets and with longer contexts on the same hardware, thus possibly enabling new applications. AVAILABILITY AND IMPLEMENTATION: https://github.com/jnalanko/VOMM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Asunto(s)

Algoritmos , Programas Informáticos , Genoma , Probabilidad

13.

A framework for space-efficient read clustering in metagenomic samples.

Alanko, Jarno; Cunial, Fabio; Belazzougui, Djamal; Mäkinen, Veli.

BMC Bioinformatics ; 18(Suppl 3): 59, 2017 Mar 14.

Artículo en Inglés | MEDLINE | ID: mdl-28361710

RESUMEN

BACKGROUND: A metagenomic sample is a set of DNA fragments, randomly extracted from multiple cells in an environment, belonging to distinct, often unknown species. Unsupervised metagenomic clustering aims at partitioning a metagenomic sample into sets that approximate taxonomic units, without using reference genomes. Since samples are large and steadily growing, space-efficient clustering algorithms are strongly needed. RESULTS: We design and implement a space-efficient algorithmic framework that solves a number of core primitives in unsupervised metagenomic clustering using just the bidirectional Burrows-Wheeler index and a union-find data structure on the set of reads. When run on a sample of total length n, with m reads of maximum length â each, on an alphabet of total size σ, our algorithms take O(n(t+logσ)) time and just 2n+o(n)+O(max{â σlogn,K logm}) bits of space in addition to the index and to the union-find data structure, where K is a measure of the redundancy of the sample and t is the query time of the union-find data structure. CONCLUSIONS: Our experimental results show that our algorithms are practical, they can exploit multiple cores by a parallel traversal of the suffix-link tree, and they are competitive both in space and in time with the state of the art.

Asunto(s)

Algoritmos , Biología Computacional/métodos , Fragmentación del ADN , Metagenómica , Análisis por Conglomerados , Modelos Teóricos , Análisis de Secuencia de ADN

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

RESUMEN

RESUMEN

RESUMEN

Asunto(s)

RESUMEN

Asunto(s)

ENVIAR RESULTADO:

SELECCIÓN DE REFERENCIAS

DETALLE DE LA BÚSQUEDA