Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 131.005
Filter
1.
BMC Bioinformatics ; 25(1): 204, 2024 Jun 01.
Article in English | MEDLINE | ID: mdl-38824535

ABSTRACT

BACKGROUND: Protein solubility is a critically important physicochemical property closely related to protein expression. For example, it is one of the main factors to be considered in the design and production of antibody drugs and a prerequisite for realizing various protein functions. Although several solubility prediction models have emerged in recent years, many of these models are limited to capturing information embedded in one-dimensional amino acid sequences, resulting in unsatisfactory predictive performance. RESULTS: In this study, we introduce a novel Graph Attention network-based protein Solubility model, GATSol, which represents the 3D structure of proteins as a protein graph. In addition to the node features of amino acids extracted by the state-of-the-art protein large language model, GATSol utilizes amino acid distance maps generated using the latest AlphaFold technology. Rigorous testing on independent eSOL and the Saccharomyces cerevisiae test datasets has shown that GATSol outperforms most recently introduced models, especially with respect to the coefficient of determination R2, which reaches 0.517 and 0.424, respectively. It outperforms the current state-of-the-art GraphSol by 18.4% on the S. cerevisiae_test set. CONCLUSIONS: GATSol captures 3D dimensional features of proteins by building protein graphs, which significantly improves the accuracy of protein solubility prediction. Recent advances in protein structure modeling allow our method to incorporate spatial structure features extracted from predicted structures into the model by relying only on the input of protein sequences, which simplifies the entire graph neural network prediction process, making it more user-friendly and efficient. As a result, GATSol may help prioritize highly soluble proteins, ultimately reducing the cost and effort of experimental work. The source code and data of the GATSol model are freely available at https://github.com/binbinbinv/GATSol .


Subject(s)
Proteins , Solubility , Proteins/chemistry , Proteins/metabolism , Protein Conformation , Databases, Protein , Computational Biology/methods , Software , Saccharomyces cerevisiae/metabolism , Saccharomyces cerevisiae/chemistry , Algorithms , Models, Molecular , Amino Acid Sequence
2.
Philos Trans A Math Phys Eng Sci ; 382(2274): 20230257, 2024 Jul 09.
Article in English | MEDLINE | ID: mdl-38826050

ABSTRACT

The OpenFlexure Microscope is an accessible, three-dimensional-printed robotic microscope, with sufficient image quality to resolve diagnostic features including parasites and cancerous cells. As access to lab-grade microscopes is a major challenge in global healthcare, the OpenFlexure Microscope has been developed to be manufactured, maintained and used in remote environments, supporting point-of-care diagnosis. The steps taken in transforming the hardware and software from an academic prototype towards an accepted medical device include addressing technical and social challenges, and are key for any innovation targeting improved effectiveness in low-resource healthcare. This article is part of the Theo Murphy meeting issue 'Open, reproducible hardware for microscopy'.


Subject(s)
Microscopy , Microscopy/instrumentation , Microscopy/methods , Humans , Robotics/instrumentation , Robotics/trends , Robotics/statistics & numerical data , Equipment Design , Printing, Three-Dimensional/instrumentation , Delivery of Health Care , Software , Point-of-Care Systems
3.
J Vis Exp ; (207)2024 May 17.
Article in English | MEDLINE | ID: mdl-38829110

ABSTRACT

PyDesigner is a Python-based software package based on the original Diffusion parameter EStImation with Gibbs and NoisE Removal (DESIGNER) pipeline (Dv1) for dMRI preprocessing and tensor estimation. This software is openly provided for non-commercial research and may not be used for clinical care. PyDesigner combines tools from FSL and MRtrix3 to perform denoising, Gibbs ringing correction, eddy current motion correction, brain masking, image smoothing, and Rician bias correction to optimize the estimation of multiple diffusion measures. It can be used across platforms on Windows, Mac, and Linux to accurately derive commonly used metrics from DKI, DTI, WMTI, FBI, and FBWM datasets as well as tractography ODFs and .fib files. It is also file-format agnostic, accepting inputs in the form of .nii, .nii.gz, .mif, and dicom format. User-friendly and easy to install, this software also outputs quality control metrics illustrating signal-to-noise ratio graphs, outlier voxels, and head motion to evaluate data integrity. Additionally, this dMRI processing pipeline supports multiple echo-time dataset processing and features pipeline customization, allowing the user to specify which processes are employed and which outputs are produced to meet a variety of user needs.


Subject(s)
Diffusion Magnetic Resonance Imaging , Software , Humans , Diffusion Magnetic Resonance Imaging/methods , Image Processing, Computer-Assisted/methods , Brain/diagnostic imaging
4.
Genome Biol ; 25(1): 145, 2024 Jun 03.
Article in English | MEDLINE | ID: mdl-38831386

ABSTRACT

BACKGROUND: Single-cell RNA sequencing (scRNA-seq) and spatially resolved transcriptomics (SRT) have led to groundbreaking advancements in life sciences. To develop bioinformatics tools for scRNA-seq and SRT data and perform unbiased benchmarks, data simulation has been widely adopted by providing explicit ground truth and generating customized datasets. However, the performance of simulation methods under multiple scenarios has not been comprehensively assessed, making it challenging to choose suitable methods without practical guidelines. RESULTS: We systematically evaluated 49 simulation methods developed for scRNA-seq and/or SRT data in terms of accuracy, functionality, scalability, and usability using 152 reference datasets derived from 24 platforms. SRTsim, scDesign3, ZINB-WaVE, and scDesign2 have the best accuracy performance across various platforms. Unexpectedly, some methods tailored to scRNA-seq data have potential compatibility for simulating SRT data. Lun, SPARSim, and scDesign3-tree outperform other methods under corresponding simulation scenarios. Phenopath, Lun, Simple, and MFA yield high scalability scores but they cannot generate realistic simulated data. Users should consider the trade-offs between method accuracy and scalability (or functionality) when making decisions. Additionally, execution errors are mainly caused by failed parameter estimations and appearance of missing or infinite values in calculations. We provide practical guidelines for method selection, a standard pipeline Simpipe ( https://github.com/duohongrui/simpipe ; https://doi.org/10.5281/zenodo.11178409 ), and an online tool Simsite ( https://www.ciblab.net/software/simshiny/ ) for data simulation. CONCLUSIONS: No method performs best on all criteria, thus a good-yet-not-the-best method is recommended if it solves problems effectively and reasonably. Our comprehensive work provides crucial insights for developers on modeling gene expression data and fosters the simulation process for users.


Subject(s)
Gene Expression Profiling , Single-Cell Analysis , Single-Cell Analysis/methods , Gene Expression Profiling/methods , Humans , Software , Computer Simulation , Transcriptome , Computational Biology/methods , Sequence Analysis, RNA/methods , RNA-Seq/methods , RNA-Seq/standards
5.
BMC Med Inform Decis Mak ; 24(1): 152, 2024 Jun 04.
Article in English | MEDLINE | ID: mdl-38831432

ABSTRACT

BACKGROUND: Machine learning (ML) has emerged as the predominant computational paradigm for analyzing large-scale datasets across diverse domains. The assessment of dataset quality stands as a pivotal precursor to the successful deployment of ML models. In this study, we introduce DREAMER (Data REAdiness for MachinE learning Research), an algorithmic framework leveraging supervised and unsupervised machine learning techniques to autonomously evaluate the suitability of tabular datasets for ML model development. DREAMER is openly accessible as a tool on GitHub and Docker, facilitating its adoption and further refinement within the research community.. RESULTS: The proposed model in this study was applied to three distinct tabular datasets, resulting in notable enhancements in their quality with respect to readiness for ML tasks, as assessed through established data quality metrics. Our findings demonstrate the efficacy of the framework in substantially augmenting the original dataset quality, achieved through the elimination of extraneous features and rows. This refinement yielded improved accuracy across both supervised and unsupervised learning methodologies. CONCLUSION: Our software presents an automated framework for data readiness, aimed at enhancing the integrity of raw datasets to facilitate robust utilization within ML pipelines. Through our proposed framework, we streamline the original dataset, resulting in enhanced accuracy and efficiency within the associated ML algorithms.


Subject(s)
Machine Learning , Humans , Datasets as Topic , Unsupervised Machine Learning , Algorithms , Supervised Machine Learning , Software
6.
Brief Bioinform ; 25(4)2024 May 23.
Article in English | MEDLINE | ID: mdl-38828640

ABSTRACT

Cell hashing, a nucleotide barcode-based method that allows users to pool multiple samples and demultiplex in downstream analysis, has gained widespread popularity in single-cell sequencing due to its compatibility, simplicity, and cost-effectiveness. Despite these advantages, the performance of this method remains unsatisfactory under certain circumstances, especially in experiments that have imbalanced sample sizes or use many hashtag antibodies. Here, we introduce a hybrid demultiplexing strategy that increases accuracy and cell recovery in multi-sample single-cell experiments. This approach correlates the results of cell hashing and genetic variant clustering, enabling precise and efficient cell identity determination without additional experimental costs or efforts. In addition, we developed HTOreader, a demultiplexing tool for cell hashing that improves the accuracy of cut-off calling by avoiding the dominance of negative signals in experiments with many hashtags or imbalanced sample sizes. When compared to existing methods using real-world datasets, this hybrid approach and HTOreader consistently generate reliable results with increased accuracy and cell recovery.


Subject(s)
Single-Cell Analysis , Single-Cell Analysis/methods , Humans , Algorithms , Software , High-Throughput Nucleotide Sequencing/methods , Computational Biology/methods
7.
PLoS One ; 19(6): e0300664, 2024.
Article in English | MEDLINE | ID: mdl-38829847

ABSTRACT

Acoustic surveys of bat echolocation calls are an important management tool for determining presence and probable absence of threatened and endangered bat species. In the northeastern United States, software programs such as Bat Call Identification (BCID), Kaleidoscope Pro (KPro), and Sonobat can automatically classify ultrasonic detector sound files, yet the programs' accuracy in correctly classifying calls to species has not been independently assessed. We used 1,500 full-spectrum reference calls with known identities for nine northeastern United States bat species to test the accuracy of these programs using calculations of Positive Predictive Value (PPV), Negative Predictive Value (NPV), Sensitivity (SN), Specificity (SP), Overall Accuracy, and No Information Rate. We found that BCID performed less accurately than other programs, likely because it only operates on zero-crossing data and may be less accurate for recordings converted from full-spectrum to zero-crossing. NPV and SP values were high across all species categories for SonoBat and KPro, indicating these programs' success at avoiding false positives. However, PPV and SN values were relatively low, particularly for individual Myotis species, indicating these programs are prone to false negatives. SonoBat and KPro performed better when distinguishing Myotis species from non-Myotis species. We expect less accuracy from these programs for acoustic recordings collected under normal working conditions, and caution that a bat acoustic expert should verify automatically classified files when making species-specific regulatory or conservation decisions.


Subject(s)
Chiroptera , Echolocation , Chiroptera/physiology , Chiroptera/classification , Animals , Echolocation/physiology , New England , Vocalization, Animal/physiology , Software , Species Specificity , Acoustics
8.
Commun Biol ; 7(1): 675, 2024 Jun 01.
Article in English | MEDLINE | ID: mdl-38824179

ABSTRACT

The three-dimensional (3D) organization of genome is fundamental to cell biology. To explore 3D genome, emerging high-throughput approaches have produced billions of sequencing reads, which is challenging and time-consuming to analyze. Here we present Microcket, a package for mapping and extracting interacting pairs from 3D genomics data, including Hi-C, Micro-C, and derivant protocols. Microcket utilizes a unique read-stitch strategy that takes advantage of the long read cycles in modern DNA sequencers; benchmark evaluations reveal that Microcket runs much faster than the current tools along with improved mapping efficiency, and thus shows high potential in accelerating and enhancing the biological investigations into 3D genome. Microcket is freely available at https://github.com/hellosunking/Microcket .


Subject(s)
Genomics , Software , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Humans , Sequence Analysis, DNA/methods , Data Analysis
9.
Gigascience ; 132024 Jan 02.
Article in English | MEDLINE | ID: mdl-38832466

ABSTRACT

BACKGROUND: Due to human error, sample swapping in large cohort studies with heterogeneous data types (e.g., mix of Oxford Nanopore Technologies, Pacific Bioscience, Illumina data, etc.) remains a common issue plaguing large-scale studies. At present, all sample swapping detection methods require costly and unnecessary (e.g., if data are only used for genome assembly) alignment, positional sorting, and indexing of the data in order to compare similarly. As studies include more samples and new sequencing data types, robust quality control tools will become increasingly important. FINDINGS: The similarity between samples can be determined using indexed k-mer sequence variants. To increase statistical power, we use coverage information on variant sites, calculating similarity using a likelihood ratio-based test. Per sample error rate, and coverage bias (i.e., missing sites) can also be estimated with this information, which can be used to determine if a spatially indexed principal component analysis (PCA)-based prescreening method can be used, which can greatly speed up analysis by preventing exhaustive all-to-all comparisons. CONCLUSIONS: Because this tool processes raw data, is faster than alignment, and can be used on very low-coverage data, it can save an immense degree of computational resources in standard quality control (QC) pipelines. It is robust enough to be used on different sequencing data types, important in studies that leverage the strengths of different sequencing technologies. In addition to its primary use case of sample swap detection, this method also provides information useful in QC, such as error rate and coverage bias, as well as population-level PCA ancestry analysis visualization.


Subject(s)
High-Throughput Nucleotide Sequencing , Sequence Analysis, DNA , Humans , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Software , Principal Component Analysis , Computational Biology/methods , Algorithms
10.
Gigascience ; 132024 Jan 02.
Article in English | MEDLINE | ID: mdl-38832465

ABSTRACT

BACKGROUND: As the number of genome-wide association study (GWAS) and quantitative trait locus (QTL) mappings in rice continues to grow, so does the already long list of genomic loci associated with important agronomic traits. Typically, loci implicated by GWAS/QTL analysis contain tens to hundreds to thousands of single-nucleotide polmorphisms (SNPs)/genes, not all of which are causal and many of which are in noncoding regions. Unraveling the biological mechanisms that tie the GWAS regions and QTLs to the trait of interest is challenging, especially since it requires collating functional genomics information about the loci from multiple, disparate data sources. RESULTS: We present RicePilaf, a web app for post-GWAS/QTL analysis, that performs a slew of novel bioinformatics analyses to cross-reference GWAS results and QTL mappings with a host of publicly available rice databases. In particular, it integrates (i) pangenomic information from high-quality genome builds of multiple rice varieties, (ii) coexpression information from genome-scale coexpression networks, (iii) ontology and pathway information, (iv) regulatory information from rice transcription factor databases, (v) epigenomic information from multiple high-throughput epigenetic experiments, and (vi) text-mining information extracted from scientific abstracts linking genes and traits. We demonstrate the utility of RicePilaf by applying it to analyze GWAS peaks of preharvest sprouting and genes underlying yield-under-drought QTLs. CONCLUSIONS: RicePilaf enables rice scientists and breeders to shed functional light on their GWAS regions and QTLs, and it provides them with a means to prioritize SNPs/genes for further experiments. The source code, a Docker image, and a demo version of RicePilaf are publicly available at https://github.com/bioinfodlsu/rice-pilaf.


Subject(s)
Data Mining , Genome-Wide Association Study , Oryza , Quantitative Trait Loci , Oryza/genetics , Software , Epigenomics/methods , Computational Biology/methods , Polymorphism, Single Nucleotide , Genomics/methods , Genome, Plant , Chromosome Mapping , Databases, Genetic
11.
Gigascience ; 132024 Jan 02.
Article in English | MEDLINE | ID: mdl-38832467

ABSTRACT

BACKGROUND: Modern sequencing technologies offer extraordinary opportunities for virus discovery and virome analysis. Annotation of viral sequences from metagenomic data requires a complex series of steps to ensure accurate annotation of individual reads and assembled contigs. In addition, varying study designs will require project-specific statistical analyses. FINDINGS: Here we introduce Hecatomb, a bioinformatic platform coordinating commonly used tasks required for virome analysis. Hecatomb means "a great sacrifice." In this setting, Hecatomb is "sacrificing" false-positive viral annotations using extensive quality control and tiered-database searches. Hecatomb processes metagenomic data obtained from both short- and long-read sequencing technologies, providing annotations to individual sequences and assembled contigs. Results are provided in commonly used data formats useful for downstream analysis. Here we demonstrate the functionality of Hecatomb through the reanalysis of a primate enteric and a novel coral reef virome. CONCLUSION: Hecatomb provides an integrated platform to manage many commonly used steps for virome characterization, including rigorous quality control, host removal, and both read- and contig-based analysis. Each step is managed using the Snakemake workflow manager with dependency management using Conda. Hecatomb outputs several tables properly formatted for immediate use within popular data analysis and visualization tools, enabling effective data interpretation for a variety of study designs. Hecatomb is hosted on GitHub (github.com/shandley/hecatomb) and is available for installation from Bioconda and PyPI.


Subject(s)
Metagenomics , Software , Metagenomics/methods , Virome/genetics , Viruses/genetics , Viruses/classification , Animals , Computational Biology/methods , Genome, Viral , Metagenome
12.
Microb Genom ; 10(6)2024 Jun.
Article in English | MEDLINE | ID: mdl-38833287

ABSTRACT

It is now possible to assemble near-perfect bacterial genomes using Oxford Nanopore Technologies (ONT) long reads, but short-read polishing is usually required for perfection. However, the effect of short-read depth on polishing performance is not well understood. Here, we introduce Pypolca (with default and careful parameters) and Polypolish v0.6.0 (with a new careful parameter). We then show that: (1) all polishers other than Pypolca-careful, Polypolish-default and Polypolish-careful commonly introduce false-positive errors at low read depth; (2) most of the benefit of short-read polishing occurs by 25× depth; (3) Polypolish-careful almost never introduces false-positive errors at any depth; and (4) Pypolca-careful is the single most effective polisher. Overall, we recommend the following polishing strategies: Polypolish-careful alone when depth is very low (<5×), Polypolish-careful and Pypolca-careful when depth is low (5-25×), and Polypolish-default and Pypolca-careful when depth is sufficient (>25×).


Subject(s)
Genome, Bacterial , Nanopores , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Nanopore Sequencing/methods , Bacteria/genetics , Bacteria/classification , Software , Genomics/methods
13.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38695119

ABSTRACT

Sequence similarity is of paramount importance in biology, as similar sequences tend to have similar function and share common ancestry. Scoring matrices, such as PAM or BLOSUM, play a crucial role in all bioinformatics algorithms for identifying similarities, but have the drawback that they are fixed, independent of context. We propose a new scoring method for amino acid similarity that remedies this weakness, being contextually dependent. It relies on recent advances in deep learning architectures that employ self-supervised learning in order to leverage the power of enormous amounts of unlabelled data to generate contextual embeddings, which are vector representations for words. These ideas have been applied to protein sequences, producing embedding vectors for protein residues. We propose the E-score between two residues as the cosine similarity between their embedding vector representations. Thorough testing on a wide variety of reference multiple sequence alignments indicate that the alignments produced using the new $E$-score method, especially ProtT5-score, are significantly better than those obtained using BLOSUM matrices. The new method proposes to change the way alignments are computed, with far-reaching implications in all areas of textual data that use sequence similarity. The program to compute alignments based on various $E$-scores is available as a web server at e-score.csd.uwo.ca. The source code is freely available for download from github.com/lucian-ilie/E-score.


Subject(s)
Algorithms , Computational Biology , Sequence Alignment , Sequence Alignment/methods , Computational Biology/methods , Software , Sequence Analysis, Protein/methods , Amino Acid Sequence , Proteins/chemistry , Proteins/genetics , Deep Learning , Databases, Protein
14.
Elife ; 132024 May 02.
Article in English | MEDLINE | ID: mdl-38696239

ABSTRACT

The reconstruction of complete microbial metabolic pathways using 'omics data from environmental samples remains challenging. Computational pipelines for pathway reconstruction that utilize machine learning methods to predict the presence or absence of KEGG modules in incomplete genomes are lacking. Here, we present MetaPathPredict, a software tool that incorporates machine learning models to predict the presence of complete KEGG modules within bacterial genomic datasets. Using gene annotation data and information from the KEGG module database, MetaPathPredict employs deep learning models to predict the presence of KEGG modules in a genome. MetaPathPredict can be used as a command line tool or as a Python module, and both options are designed to be run locally or on a compute cluster. Benchmarks show that MetaPathPredict makes robust predictions of KEGG module presence within highly incomplete genomes.


Subject(s)
Genome, Bacterial , Metabolic Networks and Pathways , Software , Metabolic Networks and Pathways/genetics , Computational Biology/methods , Machine Learning , Bacteria/genetics , Bacteria/metabolism , Bacteria/classification
15.
Am J Clin Nutr ; 119(5): 1301-1308, 2024 May.
Article in English | MEDLINE | ID: mdl-38702110

ABSTRACT

BACKGROUND: There are few resources available for researchers aiming to conduct 24-h dietary record and recall analysis using R. OBJECTIVES: We aimed to develop DietDiveR, which is a toolkit of functions written in R for the analysis of recall or record data collected with the Automated Self-Administered 24-h Dietary Assessment Tool or 2-d 24-h dietary recall data from the National Health and Nutrition Examination Survey (NHANES). The R functions are intended for food and nutrition researchers who are not computational experts. METHODS: DietDiveR provides users with functions to 1) clean dietary data, 2) analyze 24-h dietary intakes in relation to other study-specific metadata variables, 3) visualize percentages of energy intake from macronutrients, 4) perform principal component analysis or k-means clustering to group participants by similar data-driven dietary patterns, 5) generate foodtrees based on the hierarchical food group information for food items consumed, 6) perform principal coordinate analysis taking food grouping information into account, and 7) calculate diversity metrics for overall diet and specific food groups. DietDiveR includes a self-paced tutorial on a website (https://computational-nutrition-lab.github.io/DietDiveR/). As a demonstration, we applied DietDiveR to a demonstration data set and data from NHANES 2015-2016 to derive a dietary diversity measure of nuts, seeds, and legumes consumption. RESULTS: Adult participants in the NHANES 2015-2016 cycle were grouped depending on the diversity in their mean consumption of nuts, seeds, and legumes. The group with the highest diversity in nuts, seeds, and legumes consumption had 3.8 cm lower waist circumference (95% confidence interval: 1.0, 6.5) than those who did not consume nuts, seeds, and legumes. CONCLUSIONS: DietDiveR enables users to visualize dietary data and conduct data-driven dietary pattern analyses using R to answer research questions regarding diet. As a demonstration of this toolkit, we explored the diversity of nuts, seeds, and legumes consumption to highlight some of the ways DietDiveR can be used for analyses of dietary diversity.


Subject(s)
Diet , Nutrition Surveys , Humans , Cross-Sectional Studies , Diet Records , Female , Male , Adult , Feeding Behavior , Nutrition Assessment , Middle Aged , Software , Dietary Patterns
16.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38701412

ABSTRACT

Trajectory inference is a crucial task in single-cell RNA-sequencing downstream analysis, which can reveal the dynamic processes of biological development, including cell differentiation. Dimensionality reduction is an important step in the trajectory inference process. However, most existing trajectory methods rely on cell features derived from traditional dimensionality reduction methods, such as principal component analysis and uniform manifold approximation and projection. These methods are not specifically designed for trajectory inference and fail to fully leverage prior information from upstream analysis, limiting their performance. Here, we introduce scCRT, a novel dimensionality reduction model for trajectory inference. In order to utilize prior information to learn accurate cells representation, scCRT integrates two feature learning components: a cell-level pairwise module and a cluster-level contrastive module. The cell-level module focuses on learning accurate cell representations in a reduced-dimensionality space while maintaining the cell-cell positional relationships in the original space. The cluster-level contrastive module uses prior cell state information to aggregate similar cells, preventing excessive dispersion in the low-dimensional space. Experimental findings from 54 real and 81 synthetic datasets, totaling 135 datasets, highlighted the superior performance of scCRT compared with commonly used trajectory inference methods. Additionally, an ablation study revealed that both cell-level and cluster-level modules enhance the model's ability to learn accurate cell features, facilitating cell lineage inference. The source code of scCRT is available at https://github.com/yuchen21-web/scCRT-for-scRNA-seq.


Subject(s)
Algorithms , Single-Cell Analysis , Single-Cell Analysis/methods , Humans , RNA-Seq/methods , Computational Biology/methods , Software , Sequence Analysis, RNA/methods , Animals , Single-Cell Gene Expression Analysis
17.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38701416

ABSTRACT

Predicting protein function is crucial for understanding biological life processes, preventing diseases and developing new drug targets. In recent years, methods based on sequence, structure and biological networks for protein function annotation have been extensively researched. Although obtaining a protein in three-dimensional structure through experimental or computational methods enhances the accuracy of function prediction, the sheer volume of proteins sequenced by high-throughput technologies presents a significant challenge. To address this issue, we introduce a deep neural network model DeepSS2GO (Secondary Structure to Gene Ontology). It is a predictor incorporating secondary structure features along with primary sequence and homology information. The algorithm expertly combines the speed of sequence-based information with the accuracy of structure-based features while streamlining the redundant data in primary sequences and bypassing the time-consuming challenges of tertiary structure analysis. The results show that the prediction performance surpasses state-of-the-art algorithms. It has the ability to predict key functions by effectively utilizing secondary structure information, rather than broadly predicting general Gene Ontology terms. Additionally, DeepSS2GO predicts five times faster than advanced algorithms, making it highly applicable to massive sequencing data. The source code and trained models are available at https://github.com/orca233/DeepSS2GO.


Subject(s)
Algorithms , Computational Biology , Neural Networks, Computer , Protein Structure, Secondary , Proteins , Proteins/chemistry , Proteins/metabolism , Proteins/genetics , Computational Biology/methods , Databases, Protein , Gene Ontology , Sequence Analysis, Protein/methods , Software
18.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38701415

ABSTRACT

N4-acetylcytidine (ac4C) is a modification found in ribonucleic acid (RNA) related to diseases. Expensive and labor-intensive methods hindered the exploration of ac4C mechanisms and the development of specific anti-ac4C drugs. Therefore, an advanced prediction model for ac4C in RNA is urgently needed. Despite the construction of various prediction models, several limitations exist: (1) insufficient resolution at base level for ac4C sites; (2) lack of information on species other than Homo sapiens; (3) lack of information on RNA other than mRNA; and (4) lack of interpretation for each prediction. In light of these limitations, we have reconstructed the previous benchmark dataset and introduced a new dataset including balanced RNA sequences from multiple species and RNA types, while also providing base-level resolution for ac4C sites. Additionally, we have proposed a novel transformer-based architecture and pipeline for predicting ac4C sites, allowing for highly accurate predictions, visually interpretable results and no restrictions on the length of input RNA sequences. Statistically, our work has improved the accuracy of predicting specific ac4C sites in multiple species from less than 40% to around 85%, achieving a high AUC > 0.9. These results significantly surpass the performance of all existing models.


Subject(s)
Cytidine , Cytidine/analogs & derivatives , RNA , Cytidine/genetics , RNA/genetics , RNA/chemistry , Humans , Computational Biology/methods , Animals , Software , Algorithms
19.
Brief Bioinform ; 25(3)2024 Mar 27.
Article in English | MEDLINE | ID: mdl-38701418

ABSTRACT

Coverage quantification is required in many sequencing datasets within the field of genomics research. However, most existing tools fail to provide comprehensive statistical results and exhibit limited performance gains from multithreading. Here, we present PanDepth, an ultra-fast and efficient tool for calculating coverage and depth from sequencing alignments. PanDepth outperforms other tools in computation time and memory efficiency for both BAM and CRAM-format alignment files from sequencing data, regardless of read length. It employs chromosome parallel computation and optimized data structures, resulting in ultrafast computation speeds and memory efficiency. It accepts sorted or unsorted BAM and CRAM-format alignment files as well as GTF, GFF and BED-formatted interval files or a specific window size. When provided with a reference genome sequence and the option to enable GC content calculation, PanDepth includes GC content statistics, enhancing the accuracy and reliability of copy number variation analysis. Overall, PanDepth is a powerful tool that accelerates scientific discovery in genomics research.


Subject(s)
Genomics , Software , Genomics/methods , Humans , Sequence Analysis, DNA/methods , High-Throughput Nucleotide Sequencing/methods , Base Composition , DNA Copy Number Variations , Computational Biology/methods , Algorithms , Sequence Alignment/methods
20.
Protein Sci ; 33(6): e4985, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38717278

ABSTRACT

Inteins are proteins that excise themselves out of host proteins and ligate the flanking polypeptides in an auto-catalytic process called protein splicing. In nature, inteins are either contiguous or split. In the case of split inteins, the two fragments must first form a complex for the splicing to occur. Contiguous inteins have previously been artificially split in two fragments because split inteins allow for distinct applications than contiguous ones. Even naturally split inteins have been split at unnatural split sites to obtain fragments with reduced affinity for one another, which are useful to create conditional inteins or to study protein-protein interactions. So far, split sites in inteins have been heuristically identified. We developed Int&in, a web server freely available for academic research (https://intein.biologie.uni-freiburg.de) that runs a machine learning model using logistic regression to predict active and inactive split sites in inteins with high accuracy. The model was trained on a dataset of 126 split sites generated using the gp41-1, Npu DnaE and CL inteins and validated using 97 split sites extracted from the literature. Despite the limited data size, the model, which uses various protein structural features, as well as sequence conservation information, achieves an accuracy of 0.79 and 0.78 for the training and testing sets, respectively. We envision Int&in will facilitate the engineering of novel split inteins for applications in synthetic and cell biology.


Subject(s)
Inteins , Internet , Machine Learning , Protein Splicing , Software , Catalytic Domain
SELECTION OF CITATIONS
SEARCH DETAIL
...