Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 5 de 5
Filter
Add more filters










Database
Language
Publication year range
1.
Bioinformatics ; 38(19): 4466-4473, 2022 09 30.
Article in English | MEDLINE | ID: mdl-35929780

ABSTRACT

MOTIVATION: Whole-genome sequencing has revolutionized biosciences by providing tools for constructing complete DNA sequences of individuals. With entire genomes at hand, scientists can pinpoint DNA fragments responsible for oncogenesis and predict patient responses to cancer treatments. Machine learning plays a paramount role in this process. However, the sheer volume of whole-genome data makes it difficult to encode the characteristics of genomic variants as features for learning algorithms. RESULTS: In this article, we propose three feature extraction methods that facilitate classifier learning from sets of genomic variants. The core contributions of this work include: (i) strategies for determining features using variant length binning, clustering and density estimation; (ii) a programing library for automating distribution-based feature extraction in machine learning pipelines. The proposed methods have been validated on five real-world datasets using four different classification algorithms and a clustering approach. Experiments on genomes of 219 ovarian, 61 lung and 929 breast cancer patients show that the proposed approaches automatically identify genomic biomarkers associated with cancer subtypes and clinical response to oncological treatment. Finally, we show that the extracted features can be used alongside unsupervised learning methods to analyze genomic samples. AVAILABILITY AND IMPLEMENTATION: The source code of the presented algorithms and reproducible experimental scripts are available on Github at https://github.com/MNMdiagnostics/dbfe. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genome , Software , Humans , Genomics/methods , Algorithms , Machine Learning
2.
BMC Bioinformatics ; 22(1): 59, 2021 Feb 09.
Article in English | MEDLINE | ID: mdl-33563213

ABSTRACT

BACKGROUND: Long noncoding RNAs represent a large class of transcripts with two common features: they exceed an arbitrary length threshold of 200 nt and are assumed to not encode proteins. Although a growing body of evidence indicates that the vast majority of lncRNAs are potentially nonfunctional, hundreds of them have already been revealed to perform essential gene regulatory functions or to be linked to a number of cellular processes, including those associated with the etiology of human diseases. To better understand the biology of lncRNAs, it is essential to perform a more in-depth study of their evolution. In contrast to protein-encoding transcripts, however, they do not show the strong sequence conservation that usually results from purifying selection; therefore, software that is typically used to resolve the evolutionary relationships of protein-encoding genes and transcripts is not applicable to the study of lncRNAs. RESULTS: To tackle this issue, we developed lncEvo, a computational pipeline that consists of three modules: (1) transcriptome assembly from RNA-Seq data, (2) prediction of lncRNAs, and (3) conservation study-a genome-wide comparison of lncRNA transcriptomes between two species of interest, including search for orthologs. Importantly, one can choose to apply lncEvo solely for transcriptome assembly or lncRNA prediction, without calling the conservation-related part. CONCLUSIONS: lncEvo is an all-in-one tool built with the Nextflow framework, utilizing state-of-the-art software and algorithms with customizable trade-offs between speed and sensitivity, ease of use and built-in reporting functionalities. The source code of the pipeline is freely available for academic and nonacademic use under the MIT license at https://gitlab.com/spirit678/lncrna_conservation_nf .


Subject(s)
Algorithms , Computational Biology , RNA, Long Noncoding , Software , Computational Biology/methods , Conserved Sequence , Genome , Humans , RNA, Long Noncoding/genetics , Transcriptome
3.
Nucleic Acids Res ; 48(D1): D238-D245, 2020 01 08.
Article in English | MEDLINE | ID: mdl-31728519

ABSTRACT

SyntDB (http://syntdb.amu.edu.pl/) is a collection of data on long noncoding RNAs (lncRNAs) and their evolutionary relationships in twelve primate species, including humans. This is the first database dedicated to primate lncRNAs, thousands of which are uniquely stored in SyntDB. The lncRNAs were predicted with our computational pipeline using publicly available RNA-Seq data spanning diverse tissues and organs. Most of the species included in SyntDB still lack lncRNA annotations in public resources. In addition to providing users with unique sets of lncRNAs and their characteristics, SyntDB provides data on orthology relationships between the lncRNAs of humans and other primates, which are not available on this scale elsewhere. Keeping in mind that only a small fraction of currently known human lncRNAs have been functionally characterized and that lncRNA conservation is frequently used to identify the most relevant lncRNAs for functional studies, we believe that SyntDB will contribute to ongoing research aimed at deciphering the biological roles of lncRNAs.


Subject(s)
Databases, Nucleic Acid , Primates/genetics , RNA, Long Noncoding/metabolism , Animals , Humans , RNA, Long Noncoding/chemistry , RNA-Seq
4.
Methods Mol Biol ; 1933: 415-429, 2019.
Article in English | MEDLINE | ID: mdl-30945201

ABSTRACT

Long non-coding RNAs (lncRNAs) are a class of potent regulators of gene expression that are found in a wide array of eukaryotes; however, our knowledge about these molecules in plants is very limited. In particular, a number of plant species with important roles in biotechnology, agriculture and basic research still lack comprehensively identified and annotated sets of lncRNAs. To address these shortcomings, we previously created a database of lncRNAs in 10 model species, called CANTATAdb, and now we are expanding this online resource to encompass 39 species, including three algae. The lncRNAs were identified computationally using publicly available RNA sequencing (RNA-Seq) data. Expression values, coding potential calculations and other types of information were used to provide annotations for the identified lncRNAs. The data are freely available for searching, browsing and downloading from an online database called CANTATAdb 2.0 ( http://cantata.amu.edu.pl , http://yeti.amu.edu.pl/CANTATA/ ).


Subject(s)
Computational Biology/methods , Databases, Nucleic Acid , Genome, Plant , Molecular Sequence Annotation , Plants/genetics , RNA, Long Noncoding/genetics , RNA, Plant/genetics , High-Throughput Nucleotide Sequencing/methods , Search Engine , Sequence Analysis, RNA/methods
5.
Acta Biochim Pol ; 63(4): 825-833, 2016.
Article in English | MEDLINE | ID: mdl-27801428

ABSTRACT

Long non-coding RNAs (lncRNAs) are a class of intensely studied, yet enigmatic molecules that make up a substantial portion of the human transcriptome. In this work, we link the origins and functions of some lncRNAs to retroposition, a process resulting in the creation of intronless copies (retrocopies) of the so-called parental genes. We found 35 human retrocopies transcribed in antisense and giving rise to 58 lncRNA transcripts. These lncRNAs share sequence similarity with the corresponding parental genes but in the sense/antisense orientation, meaning they have the potential to interact with each other and to form RNA:RNA duplexes. We took a closer look at these duplexes and found that 10 of the lncRNAs might regulate parental gene expression and processing at the pre-mRNA and mRNA levels. Further analysis of the co-expression and expression correlation provided support for the existence of functional coupling between lncRNAs and their mate parental gene transcripts.


Subject(s)
RNA, Long Noncoding/genetics , Animals , Base Sequence , Conserved Sequence , DNA, Antisense/genetics , Humans , Mice , Molecular Sequence Annotation , Pan troglodytes , RNA Interference , Retroelements
SELECTION OF CITATIONS
SEARCH DETAIL
...