Pesquisa | Portal Regional da BVS

1.

Concentration of inverted repeats along human DNA.

Bastos, Carlos A C; Afreixo, Vera; Rodrigues, João M O S; Pinho, Armando J.

J Integr Bioinform ; 20(2)2023 Jun 01.

Artigo em Inglês | MEDLINE | ID: mdl-37486620

RESUMO

This work aims to describe the observed enrichment of inverted repeats in the human genome; and to identify and describe, with detailed length profiles, the regions with significant and relevant enriched occurrence of inverted repeats. The enrichment is assessed and tested with a recently proposed measure (z-scores based measure). We simulate a genome using an order 7 Markov model trained with the data from the real genome. The simulated genome is used to establish the critical values which are used as decision thresholds to identify the regions with significant enriched concentrations. Several human genome regions are highly enriched in the occurrence of inverted repeats. This is observed in all the human chromosomes. The distribution of inverted repeat lengths varies along the genome. The majority of the regions with severely exaggerated enrichment contain mainly short length inverted repeats. There are also regions with regular peaks along the inverted repeats lengths distribution (periodic regularities) and other regions with exaggerated enrichment for long lengths (less frequent). However, adjacent regions tend to have similar distributions.

2.

AlcoR: alignment-free simulation, mapping, and visualization of low-complexity regions in biological data.

Silva, Jorge M; Qi, Weihong; Pinho, Armando J; Pratas, Diogo.

Gigascience ; 122022 Dec 28.

Artigo em Inglês | MEDLINE | ID: mdl-38091509

RESUMO

BACKGROUND: Low-complexity data analysis is the area that addresses the search and quantification of regions in sequences of elements that contain low-complexity or repetitive elements. For example, these can be tandem repeats, inverted repeats, homopolymer tails, GC-biased regions, similar genes, and hairpins, among many others. Identifying these regions is crucial because of their association with regulatory and structural characteristics. Moreover, their identification provides positional and quantity information where standard assembly methodologies face significant difficulties because of substantial higher depth coverage (mountains), ambiguous read mapping, or where sequencing or reconstruction defects may occur. However, the capability to distinguish low-complexity regions (LCRs) in genomic and proteomic sequences is a challenge that depends on the model's ability to find them automatically. Low-complexity patterns can be implicit through specific or combined sources, such as algorithmic or probabilistic, and recurring to different spatial distances-namely, local, medium, or distant associations. FINDINGS: This article addresses the challenge of automatically modeling and distinguishing LCRs, providing a new method and tool (AlcoR) for efficient and accurate segmentation and visualization of these regions in genomic and proteomic sequences. The method enables the use of models with different memories, providing the ability to distinguish local from distant low-complexity patterns. The method is reference and alignment free, providing additional methodologies for testing, including a highly flexible simulation method for generating biological sequences (DNA or protein) with different complexity levels, sequence masking, and a visualization tool for automatic computation of the LCR maps into an ideogram style. We provide illustrative demonstrations using synthetic, nearly synthetic, and natural sequences showing the high efficiency and accuracy of AlcoR. As large-scale results, we use AlcoR to unprecedentedly provide a whole-chromosome low-complexity map of a recent complete human genome and the haplotype-resolved chromosome pairs of a heterozygous diploid African cassava cultivar. CONCLUSIONS: The AlcoR method provides the ability of fast sequence characterization through data complexity analysis, ideally for scenarios entangling the presence of new or unknown sequences. AlcoR is implemented in C language using multithreading to increase the computational speed, is flexible for multiple applications, and does not contain external dependencies. The tool accepts any sequence in FASTA format. The source code is freely provided at https://github.com/cobilab/alcor.

3.

On the Impact of the Data Acquisition Protocol on ECG Biometric Identification.

Ramos, Mariana S; Carvalho, João M; Pinho, Armando J; Brás, Susana.

Sensors (Basel) ; 21(14)2021 Jul 07.

Artigo em Inglês | MEDLINE | ID: mdl-34300385

RESUMO

Electrocardiographic (ECG) signals have been used for clinical purposes for a long time. Notwithstanding, they may also be used as the input for a biometric identification system. Several studies, as well as some prototypes, are already based on this principle. One of the methods already used for biometric identification relies on a measure of similarity based on the Kolmogorov Complexity, called the Normalized Relative Compression (NRC)-this approach evaluates the similarity between two ECG segments without the need to delineate the signal wave. This methodology is the basis of the present work. We have collected a dataset of ECG signals from twenty participants on two different sessions, making use of three different kits simultaneously-one of them using dry electrodes, placed on their fingers; the other two using wet sensors placed on their wrists and chests. The aim of this work was to study the influence of the ECG protocol collection, regarding the biometric identification system's performance. Several variables in the data acquisition are not controllable, so some of them will be inspected to understand their influence in the system. Movement, data collection point, time interval between train and test datasets and ECG segment duration are examples of variables that may affect the system, and they are studied in this paper. Through this study, it was concluded that this biometric identification system needs at least 10 s of data to guarantee that the system learns the essential information. It was also observed that "off-the-person" data acquisition led to a better performance over time, when compared to "on-the-person" places.

Assuntos

Identificação Biométrica , Compressão de Dados , Algoritmos , Eletrocardiografia , Dedos , Humanos , Processamento de Sinais Assistido por Computador

4.

AC2: An Efficient Protein Sequence Compression Tool Using Artificial Neural Networks and Cache-Hash Models.

Silva, Milton; Pratas, Diogo; Pinho, Armando J.

Entropy (Basel) ; 23(5)2021 Apr 26.

Artigo em Inglês | MEDLINE | ID: mdl-33925812

RESUMO

Recently, the scientific community has witnessed a substantial increase in the generation of protein sequence data, triggering emergent challenges of increasing importance, namely efficient storage and improved data analysis. For both applications, data compression is a straightforward solution. However, in the literature, the number of specific protein sequence compressors is relatively low. Moreover, these specialized compressors marginally improve the compression ratio over the best general-purpose compressors. In this paper, we present AC2, a new lossless data compressor for protein (or amino acid) sequences. AC2 uses a neural network to mix experts with a stacked generalization approach and individual cache-hash memory models to the highest-context orders. Compared to the previous compressor (AC), we show gains of 2-9% and 6-7% in reference-free and reference-based modes, respectively. These gains come at the cost of three times slower computations. AC2 also improves memory usage against AC, with requirements about seven times lower, without being affected by the sequences' input size. As an analysis application, we use AC2 to measure the similarity between each SARS-CoV-2 protein sequence with each viral protein sequence from the whole UniProt database. The results consistently show higher similarity to the pangolin coronavirus, followed by the bat and human coronaviruses, contributing with critical results to a current controversial subject. AC2 is available for free download under GPLv3 license.

5.

Efficient DNA sequence compression with neural networks.

Silva, Milton; Pratas, Diogo; Pinho, Armando J.

Gigascience ; 9(11)2020 11 11.

Artigo em Inglês | MEDLINE | ID: mdl-33179040

RESUMO

BACKGROUND: The increasing production of genomic data has led to an intensified need for models that can cope efficiently with the lossless compression of DNA sequences. Important applications include long-term storage and compression-based data analysis. In the literature, only a few recent articles propose the use of neural networks for DNA sequence compression. However, they fall short when compared with specific DNA compression tools, such as GeCo2. This limitation is due to the absence of models specifically designed for DNA sequences. In this work, we combine the power of neural networks with specific DNA models. For this purpose, we created GeCo3, a new genomic sequence compressor that uses neural networks for mixing multiple context and substitution-tolerant context models. FINDINGS: We benchmark GeCo3 as a reference-free DNA compressor in 5 datasets, including a balanced and comprehensive dataset of DNA sequences, the Y-chromosome and human mitogenome, 2 compilations of archaeal and virus genomes, 4 whole genomes, and 2 collections of FASTQ data of a human virome and ancient DNA. GeCo3 achieves a solid improvement in compression over the previous version (GeCo2) of $2.4\%$, $7.1\%$, $6.1\%$, $5.8\%$, and $6.0\%$, respectively. To test its performance as a reference-based DNA compressor, we benchmark GeCo3 in 4 datasets constituted by the pairwise compression of the chromosomes of the genomes of several primates. GeCo3 improves the compression in $12.4\%$, $11.7\%$, $10.8\%$, and $10.1\%$ over the state of the art. The cost of this compression improvement is some additional computational time (1.7-3 times slower than GeCo2). The RAM use is constant, and the tool scales efficiently, independently of the sequence size. Overall, these values outperform the state of the art. CONCLUSIONS: GeCo3 is a genomic sequence compressor with a neural network mixing approach that provides additional gains over top specific genomic compressors. The proposed mixing method is portable, requiring only the probabilities of the models as inputs, providing easy adaptation to other data compressors or compression-based data analysis tools. GeCo3 is released under GPLv3 and is available for free download at https://github.com/cobilab/geco3.

Assuntos

Sequenciamento de Nucleotídeos em Larga Escala , Software , Algoritmos , Sequência de Bases , Redes Neurais de Computação , Análise de Sequência de DNA

6.

Multimodal Emotion Evaluation: A Physiological Model for Cost-Effective Emotion Classification.

Pinto, Gisela; Carvalho, João M; Barros, Filipa; Soares, Sandra C; Pinho, Armando J; Brás, Susana.

Sensors (Basel) ; 20(12)2020 Jun 21.

Artigo em Inglês | MEDLINE | ID: mdl-32575894

RESUMO

Emotional responses are associated with distinct body alterations and are crucial to foster adaptive responses, well-being, and survival. Emotion identification may improve peoples' emotion regulation strategies and interaction with multiple life contexts. Several studies have investigated emotion classification systems, but most of them are based on the analysis of only one, a few, or isolated physiological signals. Understanding how informative the individual signals are and how their combination works would allow to develop more cost-effective, informative, and objective systems for emotion detection, processing, and interpretation. In the present work, electrocardiogram, electromyogram, and electrodermal activity were processed in order to find a physiological model of emotions. Both a unimodal and a multimodal approach were used to analyze what signal, or combination of signals, may better describe an emotional response, using a sample of 55 healthy subjects. The method was divided in: (1) signal preprocessing; (2) feature extraction; (3) classification using random forest and neural networks. Results suggest that the electrocardiogram (ECG) signal is the most effective for emotion classification. Yet, the combination of all signals provides the best emotion identification performance, with all signals providing crucial information for the system. This physiological model of emotions has important research and clinical implications, by providing valuable information about the value and weight of physiological signals for emotional classification, which can critically drive effective evaluation, monitoring and intervention, regarding emotional processing and regulation, considering multiple contexts.

Assuntos

Emoções/fisiologia , Modelos Biológicos , Redes Neurais de Computação , Análise Custo-Benefício , Eletrocardiografia , Eletromiografia , Humanos

7.

Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements.

Hosseini, Morteza; Pratas, Diogo; Morgenstern, Burkhard; Pinho, Armando J.

Gigascience ; 9(5)2020 05 01.

Artigo em Inglês | MEDLINE | ID: mdl-32432328

RESUMO

BACKGROUND: The development of high-throughput sequencing technologies and, as its result, the production of huge volumes of genomic data, has accelerated biological and medical research and discovery. Study on genomic rearrangements is crucial owing to their role in chromosomal evolution, genetic disorders, and cancer. RESULTS: We present Smash++, an alignment-free and memory-efficient tool to find and visualize small- and large-scale genomic rearrangements between 2 DNA sequences. This computational solution extracts information contents of the 2 sequences, exploiting a data compression technique to find rearrangements. We also present Smash++ visualizer, a tool that allows the visualization of the detected rearrangements along with their self- and relative complexity, by generating an SVG (Scalable Vector Graphics) image. CONCLUSIONS: Tested on several synthetic and real DNA sequences from bacteria, fungi, Aves, and Mammalia, the proposed tool was able to accurately find genomic rearrangements. The detected regions were in accordance with previous studies, which took alignment-based approaches or performed FISH (fluorescence in situ hybridization) analysis. The maximum peak memory usage among all experiments was â¼1 GB, which makes Smash++ feasible to run on present-day standard computers.

Assuntos

Biologia Computacional/métodos , Genômica/métodos , Software , Algoritmos , Rearranjo Gênico , Genoma , Sequenciamento de Nucleotídeos em Larga Escala , Análise de Sequência de DNA/métodos

8.

Distribution of Distances Between Symmetric Words in the Human Genome: Analysis of Regular Peaks.

Bastos, Carlos A C; Afreixo, Vera; Rodrigues, João M O S; Pinho, Armando J; Silva, Raquel M.

Interdiscip Sci ; 11(3): 367-372, 2019 Sep.

Artigo em Inglês | MEDLINE | ID: mdl-30911903

RESUMO

Finding DNA sites with high potential for the formation of hairpin/cruciform structures is an important task. Previous works studied the distances between adjacent reversed complement words (symmetric word pairs) and also for non-adjacent words. It was observed that for some words a few distances were favoured (peaks) and that in some distributions there was strong peak regularity. The present work extends previous studies, by improving the detection and characterization of peak regularities in the symmetric word pairs distance distributions of the human genome. This work also analyzes the location of the sequences that originate the observed strong peak periodicity in the distance distribution. The results obtained in this work may indicate genomic sites with potential for the formation of hairpin/cruciform structures.

Assuntos

DNA/química , Genoma Humano , Algoritmos , Cromossomos Humanos , Bases de Dados Genéticas , Genômica , Humanos , Modelos Genéticos , Conformação de Ácido Nucleico , Análise de Sequência de DNA/métodos , Software

9.

AC: A Compression Tool for Amino Acid Sequences.

Hosseini, Morteza; Pratas, Diogo; Pinho, Armando J.

Interdiscip Sci ; 11(1): 68-76, 2019 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-30721401

RESUMO

Advancement of protein sequencing technologies has led to the production of a huge volume of data that needs to be stored and transmitted. This challenge can be tackled by compression. In this paper, we propose AC, a state-of-the-art method for lossless compression of amino acid sequences. The proposed method works based on the cooperation between finite-context models and substitutional tolerant Markov models. Compared to several general-purpose and specific-purpose protein compressors, AC provides the best bit-rates. This method can also compress the sequences nine times faster than its competitor, paq8l. In addition, employing AC, we analyze the compressibility of a large number of sequences from different domains. The results show that viruses are the most difficult sequences to be compressed. Archaea and bacteria are the second most difficult ones, and eukaryota are the easiest sequences to be compressed.

Assuntos

Algoritmos , Compressão de Dados , Sequenciamento de Nucleotídeos em Larga Escala/métodos , Software , Sequência de Aminoácidos , Cadeias de Markov

10.

Cryfa: a secure encryption tool for genomic data.

Hosseini, Morteza; Pratas, Diogo; Pinho, Armando J.

Bioinformatics ; 35(1): 146-148, 2019 01 01.

Artigo em Inglês | MEDLINE | ID: mdl-30020420

RESUMO

Summary: The ever-increasing growth of high-throughput sequencing technologies has led to a great acceleration of medical and biological research and discovery. As these platforms advance, the amount of information for diverse genomes increases at unprecedented rates. Confidentiality, integrity and authenticity of such genomic information should be ensured due to its extremely sensitive nature. In this paper, we propose Cryfa, a fast secure encryption tool for genomic data, namely in Fasta, Fastq, VCF, SAM and BAM formats, which is also capable of reducing the storage size of Fasta and Fastq files. Cryfa uses advanced encryption standard (AES) encryption combined with a shuffling mechanism, which leads to a substantial enhancement of the security against low data complexity attacks. Compared to AES Crypt, a general-purpose encryption tool, Cryfa is an industry-oriented tool, which is able to provide confidentiality, integrity and authenticity of data at four times more speed; in addition, it can reduce the file sizes to 1/3. Due to the absence of a method similar to Cryfa, we have simulated its behavior with a combination of encryption and compression tools, for comparison purpose. For instance, our tool is nine times faster than its fastest competitor in Fasta files. Also, Cryfa has a very low memory usage (only a few megabytes), which makes it feasible to run on any computer. Availability and implementation: Source codes and binaries are available, under GPLv3, at https://github.com/pratas/cryfa. Supplementary information: Supplementary data are available at Bioinformatics online.

Assuntos

Compressão de Dados , Genômica , Sequenciamento de Nucleotídeos em Larga Escala , Software , Biologia Computacional

11.

Metagenomic Composition Analysis of an Ancient Sequenced Polar Bear Jawbone from Svalbard.

Pratas, Diogo; Hosseini, Morteza; Grilo, Gonçalo; Pinho, Armando J; Silva, Raquel M; Caetano, Tânia; Carneiro, João; Pereira, Filipe.

Genes (Basel) ; 9(9)2018 Sep 06.

Artigo em Inglês | MEDLINE | ID: mdl-30200636

RESUMO

The sequencing of ancient DNA samples provides a novel way to find, characterize, and distinguish exogenous genomes of endogenous targets. After sequencing, computational composition analysis enables filtering of undesired sources in the focal organism, with the purpose of improving the quality of assemblies and subsequent data analysis. More importantly, such analysis allows extinct and extant species to be identified without requiring a specific or new sequencing run. However, the identification of exogenous organisms is a complex task, given the nature and degradation of the samples, and the evident necessity of using efficient computational tools, which rely on algorithms that are both fast and highly sensitive. In this work, we relied on a fast and highly sensitive tool, FALCON-meta, which measures similarity against whole-genome reference databases, to analyse the metagenomic composition of an ancient polar bear (Ursus maritimus) jawbone fossil. The fossil was collected in Svalbard, Norway, and has an estimated age of 110,000 to 130,000 years. The FASTQ samples contained 349 GB of nonamplified shotgun sequencing data. We identified and localized, relative to the FASTQ samples, the genomes with significant similarities to reference microbial genomes, including those of viruses, bacteria, and archaea, and to fungal, mitochondrial, and plastidial sequences. Among other striking features, we found significant similarities between modern-human, some bacterial and viral sequences (contamination) and the organelle sequences of wild carrot and tomato relative to the whole samples. For each exogenous candidate, we ran a damage pattern analysis, which in addition to revealing shallow levels of damage in the plant candidates, identified the source as contamination.

12.

Biometric and Emotion Identification: An ECG Compression Based Method.

Brás, Susana; Ferreira, Jacqueline H T; Soares, Sandra C; Pinho, Armando J.

Front Psychol ; 9: 467, 2018.

Artigo em Inglês | MEDLINE | ID: mdl-29670564

RESUMO

We present an innovative and robust solution to both biometric and emotion identification using the electrocardiogram (ECG). The ECG represents the electrical signal that comes from the contraction of the heart muscles, indirectly representing the flow of blood inside the heart, it is known to convey a key that allows biometric identification. Moreover, due to its relationship with the nervous system, it also varies as a function of the emotional state. The use of information-theoretic data models, associated with data compression algorithms, allowed to effectively compare ECG records and infer the person identity, as well as emotional state at the time of data collection. The proposed method does not require ECG wave delineation or alignment, which reduces preprocessing error. The method is divided into three steps: (1) conversion of the real-valued ECG record into a symbolic time-series, using a quantization process; (2) conditional compression of the symbolic representation of the ECG, using the symbolic ECG records stored in the database as reference; (3) identification of the ECG record class, using a 1-NN (nearest neighbor) classifier. We obtained over 98% of accuracy in biometric identification, whereas in emotion recognition we attained over 90%. Therefore, the method adequately identify the person, and his/her emotion. Also, the proposed method is flexible and may be adapted to different problems, by the alteration of the templates for training the model.

13.

Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes.

Pratas, Diogo; Silva, Raquel M; Pinho, Armando J.

Entropy (Basel) ; 20(6)2018 May 23.

Artigo em Inglês | MEDLINE | ID: mdl-33265483

RESUMO

An efficient DNA compressor furnishes an approximation to measure and compare information quantities present in, between and across DNA sequences, regardless of the characteristics of the sources. In this paper, we compare directly two information measures, the Normalized Compression Distance (NCD) and the Normalized Relative Compression (NRC). These measures answer different questions; the NCD measures how similar both strings are (in terms of information content) and the NRC (which, in general, is nonsymmetric) indicates the fraction of one of them that cannot be constructed using information from the other one. This leads to the problem of finding out which measure (or question) is more suitable for the answer we need. For computing both, we use a state of the art DNA sequence compressor that we benchmark with some top compressors in different compression modes. Then, we apply the compressor on DNA sequences with different scales and natures, first using synthetic sequences and then on real DNA sequences. The last include mitochondrial DNA (mtDNA), messenger RNA (mRNA) and genomic DNA (gDNA) of seven primates. We provide several insights into evolutionary acceleration rates at different scales, namely, the observation and confirmation across the whole genomes of a higher variation rate of the mtDNA relative to the gDNA. We also show the importance of relative compression for localizing similar information regions using mtDNA.

14.

DNA word analysis based on the distribution of the distances between symmetric words.

Tavares, Ana H M P; Pinho, Armando J; Silva, Raquel M; Rodrigues, João M O S; Bastos, Carlos A C; Ferreira, Paulo J S G; Afreixo, Vera.

Sci Rep ; 7(1): 728, 2017 04 07.

Artigo em Inglês | MEDLINE | ID: mdl-28389642

RESUMO

We address the problem of discovering pairs of symmetric genomic words (i.e., words and the corresponding reversed complements) occurring at distances that are overrepresented. For this purpose, we developed new procedures to identify symmetric word pairs with uncommon empirical distance distribution and with clusters of overrepresented short distances. We speculate that patterns of overrepresentation of short distances between symmetric word pairs may allow the occurrence of non-standard DNA conformations, such as hairpin/cruciform structures. We focused on the human genome, and analysed both the complete genome as well as a version with known repetitive sequences masked out. We reported several well-defined features in the distributions of distances, which can be classified into three different profiles, showing enrichment in distinct distance ranges. We analysed in greater detail certain pairs of symmetric words of length seven, found by our procedure, characterised by the surprising fact that they occur at single distances more frequently than expected.

Assuntos

DNA , Genoma Humano , Genômica , Análise de Sequência de DNA , Algoritmos , Cromossomos Humanos , DNA/química , DNA/genética , Bases de Dados Genéticas , Genômica/métodos , Humanos , Cadeias de Markov , Modelos Genéticos , Conformação de Ácido Nucleico , Análise de Sequência de DNA/métodos , Relação Estrutura-Atividade

15.

Analysis-Driven Lossy Compression of DNA Microarray Images.

Hernández-Cabronero, Miguel; Blanes, Ian; Pinho, Armando J; Marcellin, Michael W; Serra-Sagristà, Joan.

IEEE Trans Med Imaging ; 35(2): 654-64, 2016 Feb.

Artigo em Inglês | MEDLINE | ID: mdl-26462084

RESUMO

DNA microarrays are one of the fastest-growing new technologies in the field of genetic research, and DNA microarray images continue to grow in number and size. Since analysis techniques are under active and ongoing development, storage, transmission and sharing of DNA microarray images need be addressed, with compression playing a significant role. However, existing lossless coding algorithms yield only limited compression performance (compression ratios below 2:1), whereas lossy coding methods may introduce unacceptable distortions in the analysis process. This work introduces a novel Relative Quantizer (RQ), which employs non-uniform quantization intervals designed for improved compression while bounding the impact on the DNA microarray analysis. This quantizer constrains the maximum relative error introduced into quantized imagery, devoting higher precision to pixels critical to the analysis process. For suitable parameter choices, the resulting variations in the DNA microarray analysis are less than half of those inherent to the experimental variability. Experimental results reveal that appropriate analysis can still be performed for average compression ratios exceeding 4.5:1.

Assuntos

Compressão de Dados/métodos , Processamento de Imagem Assistida por Computador/métodos , Análise de Sequência com Séries de Oligonucleotídeos/instrumentação , Análise de Sequência com Séries de Oligonucleotídeos/métodos , DNA/análise , DNA/química , DNA/genética , Desenho de Equipamento

16.

An alignment-free method to find and visualise rearrangements between pairs of DNA sequences.

Pratas, Diogo; Silva, Raquel M; Pinho, Armando J; Ferreira, Paulo J S G.

Sci Rep ; 5: 10203, 2015 May 18.

Artigo em Inglês | MEDLINE | ID: mdl-25984837

RESUMO

Species evolution is indirectly registered in their genomic structure. The emergence and advances in sequencing technology provided a way to access genome information, namely to identify and study evolutionary macro-events, as well as chromosome alterations for clinical purposes. This paper describes a completely alignment-free computational method, based on a blind unsupervised approach, to detect large-scale and small-scale genomic rearrangements between pairs of DNA sequences. To illustrate the power and usefulness of the method we give complete chromosomal information maps for the pairs human-chimpanzee and human-orangutan. The tool by means of which these results were obtained has been made publicly available and is described in detail.

Assuntos

Biologia Computacional/métodos , Rearranjo Gênico , Genômica/métodos , Algoritmos , Animais , Humanos , Navegador

17.

Three minimal sequences found in Ebola virus genomes and absent from human DNA.

Silva, Raquel M; Pratas, Diogo; Castro, Luísa; Pinho, Armando J; Ferreira, Paulo J S G.

Bioinformatics ; 31(15): 2421-5, 2015 Aug 01.

Artigo em Inglês | MEDLINE | ID: mdl-25840045

RESUMO

MOTIVATION: Ebola virus causes high mortality hemorrhagic fevers, with more than 25 000 cases and 10 000 deaths in the current outbreak. Only experimental therapies are available, thus, novel diagnosis tools and druggable targets are needed. RESULTS: Analysis of Ebola virus genomes from the current outbreak reveals the presence of short DNA sequences that appear nowhere in the human genome. We identify the shortest such sequences with lengths between 12 and 14. Only three absent sequences of length 12 exist and they consistently appear at the same location on two of the Ebola virus proteins, in all Ebola virus genomes, but nowhere in the human genome. The alignment-free method used is able to identify pathogen-specific signatures for quick and precise action against infectious agents, of which the current Ebola virus outbreak provides a compelling example.

Assuntos

DNA Viral/química , Ebolavirus/genética , Surtos de Doenças , Genoma Humano , Genoma Viral , Doença pelo Vírus Ebola/epidemiologia , Doença pelo Vírus Ebola/virologia , Humanos , Análise de Sequência de DNA , Proteínas Virais/genética

18.

MAFCO: a compression tool for MAF files.

Matos, Luís M O; Neves, António J R; Pratas, Diogo; Pinho, Armando J.

PLoS One ; 10(3): e0116082, 2015.

Artigo em Inglês | MEDLINE | ID: mdl-25816229

RESUMO

In the last decade, the cost of genomic sequencing has been decreasing so much that researchers all over the world accumulate huge amounts of data for present and future use. These genomic data need to be efficiently stored, because storage cost is not decreasing as fast as the cost of sequencing. In order to overcome this problem, the most popular general-purpose compression tool, gzip, is usually used. However, these tools were not specifically designed to compress this kind of data, and often fall short when the intention is to reduce the data size as much as possible. There are several compression algorithms available, even for genomic data, but very few have been designed to deal with Whole Genome Alignments, containing alignments between entire genomes of several species. In this paper, we present a lossless compression tool, MAFCO, specifically designed to compress MAF (Multiple Alignment Format) files. Compared to gzip, the proposed tool attains a compression gain from 34% to 57%, depending on the data set. When compared to a recent dedicated method, which is not compatible with some data sets, the compression gain of MAFCO is about 9%. Both source-code and binaries for several operating systems are freely available for non-commercial use at: http://bioinformatics.ua.pt/software/mafco.

Assuntos

Compressão de Dados/métodos , Genômica/métodos , Alinhamento de Sequência , Fatores de Tempo

19.

ECG biometric identification: A compression based approach.

Bras, Susana; Pinho, Armando J.

Annu Int Conf IEEE Eng Med Biol Soc ; 2015: 5838-41, 2015 Aug.

Artigo em Inglês | MEDLINE | ID: mdl-26737619

RESUMO

Using the electrocardiogram signal (ECG) to identify and/or authenticate persons are problems still lacking satisfactory solutions. Yet, ECG possesses characteristics that are unique or difficult to get from other signals used in biometrics: (1) it requires contact and liveliness for acquisition (2) it changes under stress, rendering it potentially useless if acquired under threatening. Our main objective is to present an innovative and robust solution to the above-mentioned problem. To successfully conduct this goal, we rely on information-theoretic data models for data compression and on similarity metrics related to the approximation of the Kolmogorov complexity. The proposed measure allows the comparison of two (or more) ECG segments, without having to follow traditional approaches that require heartbeat segmentation (described as highly influenced by external or internal interferences). As a first approach, the method was able to cluster the data in three groups: identical record, same participant, different participant, by the stratification of the proposed measure with values near 0 for the same participant and closer to 1 for different participants. A leave-one-out strategy was implemented in order to identify the participant in the database based on his/her ECG. A 1NN classifier was implemented, using as distance measure the method proposed in this work. The classifier was able to identify correctly almost all participants, with an accuracy of 99% in the database used.

Assuntos

Eletrocardiografia , Algoritmos , Identificação Biométrica , Compressão de Dados , Processamento de Sinais Assistido por Computador

20.

XS: a FASTQ read simulator.

Pratas, Diogo; Pinho, Armando J; Rodrigues, João M O S.

BMC Res Notes ; 7: 40, 2014 Jan 16.

Artigo em Inglês | MEDLINE | ID: mdl-24433564

RESUMO

BACKGROUND: The emerging next-generation sequencing (NGS) is bringing, besides the natural huge amounts of data, an avalanche of new specialized tools (for analysis, compression, alignment, among others) and large public and private network infrastructures. Therefore, a direct necessity of specific simulation tools for testing and benchmarking is rising, such as a flexible and portable FASTQ read simulator, without the need of a reference sequence, yet correctly prepared for producing approximately the same characteristics as real data. FINDINGS: We present XS, a skilled FASTQ read simulation tool, flexible, portable (does not need a reference sequence) and tunable in terms of sequence complexity. It has several running modes, depending on the time and memory available, and is aimed at testing computing infrastructures, namely cloud computing of large-scale projects, and testing FASTQ compression algorithms. Moreover, XS offers the possibility of simulating the three main FASTQ components individually (headers, DNA sequences and quality-scores). CONCLUSIONS: XS provides an efficient and convenient method for fast simulation of FASTQ files, such as those from Ion Torrent (currently uncovered by other simulators), Roche-454, Illumina and ABI-SOLiD sequencing machines. This tool is publicly available at http://bioinformatics.ua.pt/software/xs/.

Assuntos

Biologia Computacional/métodos , Sequenciamento de Nucleotídeos em Larga Escala/estatística & dados numéricos , Análise de Sequência de DNA/métodos , Algoritmos , Reprodutibilidade dos Testes , Software

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA