Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 41
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Bioinformatics ; 40(7)2024 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-38984796

RESUMO

MOTIVATION: The introduction of Deep Minds' Alpha Fold 2 enabled the prediction of protein structures at an unprecedented scale. AlphaFold Protein Structure Database and ESM Metagenomic Atlas contain hundreds of millions of structures stored in CIF and/or PDB formats. When compressed with a general-purpose utility like gzip, this translates to tens of terabytes of data, which hinders the effective use of predicted structures in large-scale analyses. RESULTS: Here, we present ProteStAr, a compressor dedicated to CIF/PDB, as well as supplementary PAE files. Its main contribution is a novel approach to predicting atom coordinates on the basis of the previously analyzed atoms. This allows efficient encoding of the coordinates, the largest component of the protein structure files. The compression is lossless by default, though the lossy mode with a controlled maximum error of coordinates reconstruction is also present. Compared to the competing packages, i.e. BinaryCIF, Foldcomp, PDC, our approach offers a superior compression ratio at established reconstruction accuracy. By the efficient use of threads at both compression and decompression stages, the algorithm takes advantage of the multicore architecture of current central processing units and operates with speeds of about 1 GB/s. The presence of Python and C++ API further increases the usability of the presented method. AVAILABILITY AND IMPLEMENTATION: The source code of ProteStAr is available at https://github.com/refresh-bio/protestar.


Assuntos
Algoritmos , Bases de Dados de Proteínas , Proteínas , Software , Proteínas/química , Conformação Proteica , Compressão de Dados/métodos , Biologia Computacional/métodos
2.
bioRxiv ; 2024 May 25.
Artigo em Inglês | MEDLINE | ID: mdl-38826472

RESUMO

Most plant genomes and their regulation remain unknown. We used SPLASH - a new, reference-genome free sequence variation detection algorithm - to analyze transcriptional and post-transcriptional regulation from RNA-seq data. We discovered differential homolog expression during maize pollen development, and imbibition-dependent cryptic splicing in Arabidopsis seeds. SPLASH enables discovery of novel regulatory mechanisms, including differential regulation of genes from hybrid parental haplotypes, without the use of alignment to a reference genome.

3.
Sci Rep ; 14(1): 2232, 2024 01 26.
Artigo em Inglês | MEDLINE | ID: mdl-38278837

RESUMO

The paper focuses on the correction of Illumina WGS sequencing reads. We provide an extensive evaluation of the existing correctors. To this end, we measure an impact of the correction on variant calling (VC) as well as de novo assembly. It shows, that in selected cases read correction improves the VC results quality. We also examine the algorithms behaviour in a processing of Illumina NovaSeq reads, with different reads quality characteristics than in older sequencers. We show that most of the algorithms are ready to cope with such reads. Finally, we introduce a new version of RECKONER, our read corrector, by optimizing it and equipping with a new correction strategy. Currently, RECKONER allows to correct high-coverage human reads in less than 2.5 h, is able to cope with two types of reads errors: indels and substitutions, and utilizes a new, based on a two lengths of oligomers, correction verification technique.


Assuntos
Algoritmos , Sequenciamento de Nucleotídeos em Larga Escala , Humanos , Idoso , Análise de Sequência de DNA/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Mutação INDEL
4.
bioRxiv ; 2024 Mar 30.
Artigo em Inglês | MEDLINE | ID: mdl-36993432

RESUMO

SPLASH is an unsupervised, reference-free, and unifying algorithm that discovers regulated sequence variation through statistical analysis of k-mer composition, subsuming many application-specific methods. Here, we introduce SPLASH2, a fast, scalable implementation of SPLASH based on an efficient k-mer counting approach. SPLASH2 enables rapid analysis of massive datasets from a wide range of sequencing technologies and biological contexts, delivering unparalleled scale and speed. The SPLASH2 algorithm unveils new biology (without tuning) in single-cell RNA-sequencing data from human muscle cells, as well as bulk RNA-seq from the entire Cancer Cell Line Encyclopedia (CCLE), including substantial unannotated alternative splicing in cancer transcriptome. The same untuned SPLASH2 algorithm recovers the BCR-ABL gene fusion, and detects circRNA sensitively and specifically, underscoring SPLASH2's unmatched precision and scalability across diverse RNA-seq detection tasks.

5.
Curr Opin Struct Biol ; 80: 102577, 2023 06.
Artigo em Inglês | MEDLINE | ID: mdl-37012200

RESUMO

Large-scale genomics requires highly scalable and accurate multiple sequence alignment methods. Results collected over this last decade suggest accuracy loss when scaling up over a few thousand sequences. This issue has been actively addressed with a number of innovative algorithmic solutions that combine low-level hardware optimization with novel higher-level heuristics. This review provides an extensive critical overview of these recent methods. Using established reference datasets we conclude that albeit significant progress has been achieved, a unified framework able to consistently and efficiently produce high-accuracy large-scale multiple alignments is still lacking.


Assuntos
Algoritmos , Genômica , Genômica/métodos , Sequência de Aminoácidos , Alinhamento de Sequência , Software
6.
Bioinformatics ; 39(3)2023 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-36864624

RESUMO

MOTIVATION: High-quality sequence assembly is the ultimate representation of complete genetic information of an individual. Several ongoing pangenome projects are producing collections of high-quality assemblies of various species. Each project has already generated assemblies of hundreds of gigabytes on disk, greatly impeding the distribution of and access to such rich datasets. RESULTS: Here, we show how to reduce the size of the sequenced genomes by 2-3 orders of magnitude. Our tool compresses the genomes significantly better than the existing programs and is much faster. Moreover, its unique feature is the ability to access any contig (or its part) in a fraction of a second and easily append new samples to the compressed collections. Thanks to this, AGC could be useful not only for backup or transfer purposes but also for routine analysis of pangenome sequences in common pipelines. With the rapidly reduced cost and improved accuracy of sequencing technologies, we anticipate more comprehensive pangenome projects with much larger sample sizes. AGC is likely to become a foundation tool to store, distribute and access pangenome data. AVAILABILITY AND IMPLEMENTATION: The source code of AGC is available at https://github.com/refresh-bio/agc. The package can be installed via Bioconda at https://anaconda.org/bioconda/agc. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Genoma , Software , Análise de Sequência de DNA , Sequenciamento de Nucleotídeos em Larga Escala
7.
Genome Biol ; 23(1): 190, 2022 09 08.
Artigo em Inglês | MEDLINE | ID: mdl-36076275

RESUMO

The de Bruijn graph is a key data structure in modern computational genomics, and construction of its compacted variant resides upstream of many genomic analyses. As the quantity of genomic data grows rapidly, this often forms a computational bottleneck. We present Cuttlefish 2, significantly advancing the state-of-the-art for this problem. On a commodity server, it reduces the graph construction time for 661K bacterial genomes, of size 2.58Tbp, from 4.5 days to 17-23 h; and it constructs the graph for 1.52Tbp white spruce reads in approximately 10 h, while the closest competitor requires 54-58 h, using considerably more memory.


Assuntos
Algoritmos , Decapodiformes , Animais , Genoma Bacteriano , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA , Software
8.
Bioinformatics ; 38(18): 4423-4425, 2022 09 15.
Artigo em Inglês | MEDLINE | ID: mdl-35904548

RESUMO

SUMMARY: Bioinformatics applications increasingly rely on ad hoc disk storage of k-mer sets, e.g. for de Bruijn graphs or alignment indexes. Here, we introduce the K-mer File Format as a general lossless framework for storing and manipulating k-mer sets, realizing space savings of 3-5× compared to other formats, and bringing interoperability across tools. AVAILABILITY AND IMPLEMENTATION: Format specification, C++/Rust API, tools: https://github.com/Kmer-File-Format/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Análise de Sequência de DNA , Discos Compactos
9.
Nat Methods ; 19(4): 441-444, 2022 04.
Artigo em Inglês | MEDLINE | ID: mdl-35347321

RESUMO

The cost of maintaining exabytes of data produced by sequencing experiments every year has become a major issue in today's genomic research. In spite of the increasing popularity of third-generation sequencing, the existing algorithms for compressing long reads exhibit a minor advantage over the general-purpose gzip. We present CoLoRd, an algorithm able to reduce the size of third-generation sequencing data by an order of magnitude without affecting the accuracy of downstream analyses.


Assuntos
Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Algoritmos , Genoma , Análise de Sequência de DNA , Software
10.
Bioinformatics ; 38(5): 1447-1449, 2022 02 07.
Artigo em Inglês | MEDLINE | ID: mdl-34904625

RESUMO

SUMMARY: Phage-Host Interaction Search Tool (PHIST) predicts prokaryotic hosts of viruses based on exact matches between viral and host genomes. It improves host prediction accuracy at species level over current alignment-based tools (on average by 3 percentage points) as well as alignment-free and CRISPR-based tools (by 14-20 percentage points). PHIST is also two orders of magnitude faster than alignment-based tools making it suitable for metagenomics studies. AVAILABILITY AND IMPLEMENTATION: GNU-licensed C++ code wrapped in Python API available at: https://github.com/refresh-bio/phist. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Bacteriófagos , Vírus , Bacteriófagos/genética , Metagenômica , Vírus/genética , Metagenoma , Software
11.
Bioinformatics ; 37(19): 3358-3360, 2021 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-33787870

RESUMO

SUMMARY: Variant Call Format (VCF) files with results of sequencing projects take a lot of space. We propose the VCFShark, which is able to compress VCF files up to an order of magnitude better than the de facto standards (gzipped VCF and BCF). The advantage over competitors is the greatest when compressing VCF files containing large amounts of genotype data. The processing speeds up to 100 MB/s and main memory requirements lower than 30 GB allow to use our tool at typical workstations even for large datasets. AVAILABILITY AND IMPLEMENTATION: https://github.com/refresh-bio/vcfshark. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

12.
Sci Rep ; 10(1): 3460, 2020 Feb 21.
Artigo em Inglês | MEDLINE | ID: mdl-32081952

RESUMO

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

13.
Sci Rep ; 10(1): 578, 2020 01 17.
Artigo em Inglês | MEDLINE | ID: mdl-31953467

RESUMO

The amount of data produced by modern sequencing instruments that needs to be stored is huge. Therefore it is not surprising that a lot of work has been done in the field of specialized data compression of FASTQ files. The existing algorithms are, however, still imperfect and the best tools produce quite large archives. We present FQSqueezer, a novel compression algorithm for sequencing data able to process single- and paired-end reads of variable lengths. It is based on the ideas from the famous prediction by partial matching and dynamic Markov coder algorithms known from the general-purpose-compressors world. The compression ratios are often tens of percent better than offered by the state-of-the-art tools. The drawbacks of the proposed method are large memory and time requirements.

14.
Biol Direct ; 14(1): 20, 2019 11 13.
Artigo em Inglês | MEDLINE | ID: mdl-31722729

RESUMO

BACKGROUND: Nowadays, not only are single genomes commonly analyzed, but also metagenomes, which are sets of, DNA fragments (reads) derived from microbes living in a given environment. Metagenome analysis is aimed at extracting crucial information on the organisms that have left their traces in an investigated environmental sample.In this study we focus on the MetaSUB Forensics Challenge (organized within the CAMDA 2018 conference) which consists in predicting the geographical origin of metagenomic samples. Contrary to the existing methods for environmental classification that are based on taxonomic or functional classification, we rely on the similarity between a sample and the reference database computed at a reads level. RESULTS: We report the results of our extensive experimental study to investigate the behavior of our method and its sensitivity to different parameters. In our tests, we have followed the protocol of the MetaSUB Challenge, which allowed us to compare the obtained results with the solutions based on taxonomic and functional classification. CONCLUSIONS: The results reported in the paper indicate that our method is competitive with those based on taxonomic classification. Importantly, by measuring the similarity at the reads level, we avoid the necessity of using large databases with annotated gene sequences. Hence our main finding is that environmental classification of metagenomic data can be proceeded without using large databases required for taxonomic or functional classification. REVIEWERS: This article was reviewed by Eran Elhaik, Alexandra Bettina Graf, Chengsheng Zhu, and Andre Kahles.


Assuntos
Impressões Digitais de DNA , Metagenoma , Metagenômica/métodos , Microbiota
15.
Bioinformatics ; 35(22): 4791-4793, 2019 11 01.
Artigo em Inglês | MEDLINE | ID: mdl-31225861

RESUMO

SUMMARY: Nowadays large sequencing projects handle tens of thousands of individuals. The huge files summarizing the findings definitely require compression. We propose a tool able to compress large collections of genotypes almost 30% better than the best tool to date, i.e. squeezing human genotype to less than 62 KB. Moreover, it can also compress single samples in reference to the existing database achieving comparable results. AVAILABILITY AND IMPLEMENTATION: https://github.com/refresh-bio/GTShark. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Genômica , Genótipo , Humanos , Software
16.
Bioinformatics ; 35(1): 133-136, 2019 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-29986074

RESUMO

Summary: Kmer-db is a new tool for estimating evolutionary relationship on the basis of k-mers extracted from genomes or sequencing reads. Thanks to an efficient data structure and parallel implementation, our software estimates distances between 40 715 pathogens in <7 min (on a modern workstation), 26 times faster than Mash, its main competitor. Availability and implementation: https://github.com/refresh-bio/kmer-db and http://sun.aei.polsl.pl/REFRESH/kmer-db. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Evolução Biológica , Biologia Computacional , Software , Genoma
17.
Bioinformatics ; 35(2): 227-234, 2019 01 15.
Artigo em Inglês | MEDLINE | ID: mdl-30010777

RESUMO

Motivation: Bioinformatics databases grow rapidly and achieve values hardly to imagine a decade ago. Among numerous bioinformatics processes generating hundreds of GB is multiple sequence alignments of protein families. Its largest database, i.e. Pfam, consumes 40-230 GB, depending of the variant. Storage and transfer of such massive data has become a challenge. Results: We propose a novel compression algorithm, CoMSA, designed especially for aligned data. It is based on a generalization of the positional Burrows-Wheeler transform for non-binary alphabets. CoMSA handles FASTA, as well as Stockholm files. It offers up to six times better compression ratio than other commonly used compressors, i.e. gzip. Performed experiments resulted in an analysis of the influence of a protein family size on the compression ratio. Availability and implementation: CoMSA is available for free at https://github.com/refresh-bio/comsa and http://sun.aei.polsl.pl/REFRESH/comsa. Supplementary material: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Bases de Dados de Proteínas , Genômica , Alinhamento de Sequência , Algoritmos , Biologia Computacional , Análise de Sequência de DNA
18.
Bioinformatics ; 35(12): 2043-2050, 2019 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-30407485

RESUMO

MOTIVATION: Mapping reads to a reference genome is often the first step in a sequencing data analysis pipeline. The reduction of sequencing costs implies a need for algorithms able to process increasing amounts of generated data in reasonable time. RESULTS: We present Whisper, an accurate and high-performant mapping tool, based on the idea of sorting reads and then mapping them against suffix arrays for the reference genome and its reverse complement. Employing task and data parallelism as well as storing temporary data on disk result in superior time efficiency at reasonable memory requirements. Whisper excels at large NGS read collections, in particular Illumina reads with typical WGS coverage. The experiments with real data indicate that our solution works in about 15% of the time needed by the well-known BWA-MEM and Bowtie2 tools at a comparable accuracy, validated in a variant calling pipeline. AVAILABILITY AND IMPLEMENTATION: Whisper is available for free from https://github.com/refresh-bio/Whisper or http://sun.aei.polsl.pl/REFRESH/Whisper/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Software , Sequência de Bases , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA
19.
Bioinformatics ; 34(16): 2748-2756, 2018 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-29617939

RESUMO

Motivation: The affordability of DNA sequencing has led to the generation of unprecedented volumes of raw sequencing data. These data must be stored, processed and transmitted, which poses significant challenges. To facilitate this effort, we introduce FaStore, a specialized compressor for FASTQ files. FaStore does not use any reference sequences for compression and permits the user to choose from several lossy modes to improve the overall compression ratio, depending on the specific needs. Results: FaStore in the lossless mode achieves a significant improvement in compression ratio with respect to previously proposed algorithms. We perform an analysis on the effect that the different lossy modes have on variant calling, the most widely used application for clinical decision making, especially important in the era of precision medicine. We show that lossy compression can offer significant compression gains, while preserving the essential genomic information and without affecting the variant calling performance. Availability and implementation: FaStore can be downloaded from https://github.com/refresh-bio/FaStore. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados/métodos , Genômica/métodos , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Análise de Sequência de DNA/métodos , Software , Algoritmos , Humanos
20.
Bioinformatics ; 34(11): 1834-1840, 2018 06 01.
Artigo em Inglês | MEDLINE | ID: mdl-29351600

RESUMO

Motivation: Nowadays, genome sequencing is frequently used in many research centers. In projects, such as the Haplotype Reference Consortium or the Exome Aggregation Consortium, huge databases of genotypes in large populations are determined. Together with the increasing size of these collections, the need for fast and memory frugal ways of representation and searching in them becomes crucial. Results: We present GTC (GenoType Compressor), a novel compressed data structure for representation of huge collections of genetic variation data. It significantly outperforms existing solutions in terms of compression ratio and time of answering various types of queries. We show that the largest of publicly available database of about 60 000 haplotypes at about 40 million SNPs can be stored in <4 GB, while the queries related to variants are answered in a fraction of a second. Availability and implementation: GTC can be downloaded from https://github.com/refresh-bio/GTC or http://sun.aei.polsl.pl/REFRESH/gtc. Contact: sebastian.deorowicz@polsl.pl. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Compressão de Dados , Genômica/métodos , Genótipo , Polimorfismo de Nucleotídeo Único , Análise de Sequência de DNA/métodos , Software , Algoritmos , Bases de Dados Genéticas , Técnicas de Genotipagem/métodos , Haplótipos , Humanos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...