Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 10 de 10
Filter
Add more filters










Publication year range
1.
Database (Oxford) ; 20202020 01 01.
Article in English | MEDLINE | ID: mdl-32621602

ABSTRACT

Online biological databases housing genomics, genetic and breeding data can be constructed using the Tripal toolkit. Tripal is an open-source, internationally developed framework that implements FAIR data principles and is meant to ease the burden of constructing such websites for research communities. Use of a common, open framework improves the sustainability and manageability of such as site. Site developers can create extensions for their site and in turn share those extensions with others. One challenge that community databases often face is the need to provide tools for their users that analyze increasingly larger datasets using multiple software tools strung together in a scientific workflow on complicated computational resources. The Tripal Galaxy module, a 'plug-in' for Tripal, meets this need through integration of Tripal with the Galaxy Project workflow management system. Site developers can create workflows appropriate to the needs of their community using Galaxy and then share those for execution on their Tripal sites via automatically constructed, but configurable, web forms or using an application programming interface to power web-based analytical applications. The Tripal Galaxy module helps reduce duplication of effort by allowing site developers to spend time constructing workflows and building their applications rather than rebuilding infrastructure for job management of multi-step applications.


Subject(s)
Database Management Systems , Databases, Genetic , Internet , Software , Computational Biology
2.
Evol Appl ; 13(1): 228-241, 2020 Jan.
Article in English | MEDLINE | ID: mdl-31892954

ABSTRACT

Sequencing technologies and bioinformatic approaches are now available to resolve the challenges associated with complex and heterozygous genomes. Increased access to less expensive and more effective instrumentation will contribute to a wealth of high-quality plant genomes in the next few years. In the meantime, more than 370 tree species are associated with public projects in primary repositories that are interrogating expression profiles, identifying variants, or analyzing targeted capture without a high-quality reference genome. Genomic data from these projects generates sequences that represent intermediate assemblies for transcriptomes and genomes. These data contribute to forest tree biology, but the associated sequence remains trapped in supplemental files that are poorly integrated in plant community databases and comparative genomic platforms. Successful implementation of life science cyberinfrastructure is improving data standards, ontologies, analytic workflows, and integrated database platforms for both model and non-model plant species. Unique to forest trees with large populations that are long-lived, outcrossing, and genetically diverse, the phenotypic and environmental metrics associated with georeferenced populations are just as important as the genomic data sampled for each individual. To address questions related to forest health and productivity, cyberinfrastructure must keep pace with the magnitude of genomic and phenomic sampling of larger populations. This review examines the current landscape of cyberinfrastructure, with an emphasis on best practices and resources to align community data with the Findable, Accessible, Interoperable, and Reusable (FAIR) guidelines.

3.
Front Plant Sci ; 10: 813, 2019.
Article in English | MEDLINE | ID: mdl-31293610

ABSTRACT

Despite tremendous advancements in high throughput sequencing, the vast majority of tree genomes, and in particular, forest trees, remain elusive. Although primary databases store genetic resources for just over 2,000 forest tree species, these are largely focused on sequence storage, basic genome assemblies, and functional assignment through existing pipelines. The tree databases reviewed here serve as secondary repositories for community data. They vary in their focal species, the data they curate, and the analytics provided, but they are united in moving toward a goal of centralizing both data access and analysis. They provide frameworks to view and update annotations for complex genomes, interrogate systems level expression profiles, curate data for comparative genomics, and perform real-time analysis with genotype and phenotype data. The organism databases of today are no longer simply catalogs or containers of genetic information. These repositories represent integrated cyberinfrastructure that support cross-site queries and analysis in web-based environments. These resources are striving to integrate across diverse experimental designs, sequence types, and related measures through ontologies, community standards, and web services. Efficient, simple, and robust platforms that enhance the data generated by the research community, contribute to improving forest health and productivity.

5.
Database (Oxford) ; 2018: 1-11, 2018 01 01.
Article in English | MEDLINE | ID: mdl-30239664

ABSTRACT

Forest trees are valued sources of pulp, timber and biofuels, and serve a role in carbon sequestration, biodiversity maintenance and watershed stability. Examining the relationships among genetic, phenotypic and environmental factors for these species provides insight on the areas of concern for breeders and researchers alike. The TreeGenes database is a web-based repository that is home to 1790 tree species and over 1500 registered users. The database provides a curated archive for high-throughput genomics, including reference genomes, transcriptomes, genetic maps and variant data. These resources are paired with extensive phenotypic information and environmental layers. TreeGenes recently migrated to Tripal, an integrated and open-source database schema and content management system. This migration enabled developments focused on data exchange, data transfer and improved analytical capacity, as well as providing TreeGenes the opportunity to communicate with the following partner databases: Hardwood Genomics Web, Genome Database for Rosaceae, and the Citrus Genome Database. Recent development in TreeGenes has focused on coordinating information for georeferenced accessions, including metadata acquisition and ontological frameworks, to improve integration across studies combining genetic, phenotypic and environmental data. This focus was paired with the development of tools to enable comparative genomics and data visualization. By combining advanced data importers, relevant metadata standards and integrated analytical frameworks, TreeGenes provides a platform for researchers to store, submit and analyze forest tree data.


Subject(s)
Databases, Genetic , Forests , Genomics , Data Mining , Gene Ontology , Phenotype , Phylogeny , Search Engine , Software , Trees/genetics , Trees/growth & development
6.
Database (Oxford) ; 20182018 01 01.
Article in English | MEDLINE | ID: mdl-30239679

ABSTRACT

The future of agricultural research depends on data. The sheer volume of agricultural biological data being produced today makes excellent data management essential. Governmental agencies, publishers and science funders require data management plans for publicly funded research. Furthermore, the value of data increases exponentially when they are properly stored, described, integrated and shared, so that they can be easily utilized in future analyses. AgBioData (https://www.agbiodata.org) is a consortium of people working at agricultural biological databases, data archives and knowledgbases who strive to identify common issues in database development, curation and management, with the goal of creating database products that are more Findable, Accessible, Interoperable and Reusable. We strive to promote authentic, detailed, accurate and explicit communication between all parties involved in scientific data. As a step toward this goal, we present the current state of biocuration, ontologies, metadata and persistence, database platforms, programmatic (machine) access to data, communication and sustainability with regard to data curation. Each section describes challenges and opportunities for these topics, along with recommendations and best practices.


Subject(s)
Agriculture , Databases, Genetic , Genomics , Breeding , Gene Ontology , Metadata , Surveys and Questionnaires
7.
J Asia Pac Entomol ; 21(3): 852-863, 2018 Sep.
Article in English | MEDLINE | ID: mdl-34316264

ABSTRACT

The lone star tick, Amblyomma americanum, is an obligatory ectoparasite of many vertebrates and the primary vector of Ehrlichia chaffeensis, the causative agent of human monocytic ehrlichiosis. This study aimed to investigate the comparative transcriptomes of A. americanum underlying the processes of pathogen acquisition and of immunity towards the pathogen. Differential expression of the whole body transcripts in six different treatments were compared: females and males that were E. chaffeensis non-exposed, E. chaffeensis-exposed/uninfected, and E. chaffeensis-exposed/infected. The Trinity assembly pipeline produced 140,574 transcripts from trimmed and filtered total raw sequence reads (approximately 117M reads). The gold transcript set of the transcriptome data was established to minimize noise by retaining only transcripts homologous to official peptide sets of Ixodes scapularis and A. americanum ESTs and transcripts covered with high enough frequency from the raw data. Comparison of the gene ontology term enrichment analyses for the six groups tested here revealed an up-regulation of genes for defense responses against the pathogen and for the supply of intracellular Ca++ for pathogen proliferation in the pathogen-exposed ticks. Analyses of differential expression, focused on functional subcategories including immune, sialome, neuropeptides, and G protein-coupled receptor, revealed that E. chaffeensis-exposed ticks exhibited an upregulation of transcripts involved in the immune deficiency (IMD) pathway, antimicrobial peptides, Kunitz, an insulin-like peptide, and bursicon receptor over unexposed ones, while transcripts for metalloprotease were down-regulated in general. This study found that ticks exhibit enhanced expression of genes responsible for defense against E. chaffeensis.

8.
IEEE Trans Nanobioscience ; 15(2): 84-92, 2016 03.
Article in English | MEDLINE | ID: mdl-26863669

ABSTRACT

Machine learning algorithms are widely used to annotate biological sequences. Low-dimensional informative feature vectors can be crucial for the performance of the algorithms. In prior work, we have proposed the use of a community detection approach to construct low dimensional feature sets for nucleotide sequence classification. Our approach used the Hamming distance between short nucleotide subsequences, called k-mers, to construct a network, and subsequently used community detection to identify groups of k -mers that appear frequently in a set of sequences. Whereas this approach worked well for nucleotide sequence classification, it could not be directly used for protein sequences, as the Hamming distance is not a good measure for comparing short protein k-mers. To address this limitation, we extended our prior approach by replacing the Hamming distance with substitution scores. Experimental results in different learning scenarios show that the features generated with the new approach are more informative than k-mers.


Subject(s)
Algorithms , Computational Biology/methods , Proteins/chemistry , Proteins/classification , Supervised Machine Learning
9.
IEEE Trans Nanobioscience ; 15(2): 75-83, 2016 03.
Article in English | MEDLINE | ID: mdl-26849871

ABSTRACT

Supervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain, the limited labeled data and optionally the unlabeled data from the target domain to train classifiers in a domain adaptation setting. We propose two such classifiers, based on logistic regression, and evaluate them for the task of splice site prediction-a difficult and essential step in gene prediction. Our classifiers achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.


Subject(s)
Algorithms , Computational Biology/methods , High-Throughput Nucleotide Sequencing/methods , Logistic Models , RNA Splicing/genetics , Sequence Analysis, DNA/methods , Animals , Area Under Curve , Models, Statistical , ROC Curve
10.
BMC Genomics ; 16: 734, 2015 Sep 29.
Article in English | MEDLINE | ID: mdl-26416786

ABSTRACT

BACKGROUND: Genome assembly remains an unsolved problem. Assembly projects face a range of hurdles that confound assembly. Thus a variety of tools and approaches are needed to improve draft genomes. RESULTS: We used a custom assembly workflow to optimize consensus genome map assembly, resulting in an assembly equal to the estimated length of the Tribolium castaneum genome and with an N50 of more than 1 Mb. We used this map for super scaffolding the T. castaneum sequence assembly, more than tripling its N50 with the program Stitch. CONCLUSIONS: In this article we present software that leverages consensus genome maps assembled from extremely long single molecule maps to increase the contiguity of sequence assemblies. We report the results of applying these tools to validate and improve a 7x Sanger draft of the T. castaneum genome.


Subject(s)
Genome , Software , Tribolium/genetics , Animals , Genomics/methods , Sequence Analysis, DNA
SELECTION OF CITATIONS
SEARCH DETAIL
...