Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
Add more filters










Database
Language
Publication year range
1.
Comput Struct Biotechnol J ; 23: 2289-2303, 2024 Dec.
Article in English | MEDLINE | ID: mdl-38840832

ABSTRACT

The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.

2.
Bioessays ; : e2300210, 2024 May 08.
Article in English | MEDLINE | ID: mdl-38715516

ABSTRACT

Understanding the influence of cis-regulatory elements on gene regulation poses numerous challenges given complexities stemming from variations in transcription factor (TF) binding, chromatin accessibility, structural constraints, and cell-type differences. This review discusses the role of gene regulatory networks in enhancing understanding of transcriptional regulation and covers construction methods ranging from expression-based approaches to supervised machine learning. Additionally, key experimental methods, including MPRAs and CRISPR-Cas9-based screening, which have significantly contributed to understanding TF binding preferences and cis-regulatory element functions, are explored. Lastly, the potential of machine learning and artificial intelligence to unravel cis-regulatory logic is analyzed. These computational advances have far-reaching implications for precision medicine, therapeutic target discovery, and the study of genetic variations in health and disease.

3.
Comput Struct Biotechnol J ; 23: 1919-1928, 2024 Dec.
Article in English | MEDLINE | ID: mdl-38711760

ABSTRACT

The decrease in sequencing expenses has facilitated the creation of reference genomes and proteomes for an expanding array of organisms. Nevertheless, no established repository that details organism-specific genomic and proteomic sequences of specific lengths, referred to as kmers, exists to our knowledge. In this article, we present kmerDB, a database accessible through an interactive web interface that provides kmer-based information from genomic and proteomic sequences in a systematic way. kmerDB currently contains 202,340,859,107 base pairs and 19,304,903,356 amino acids, spanning 54,039 and 21,865 reference genomes and proteomes, respectively, as well as 6,905,362 and 149,305,183 genomic and proteomic species-specific sequences, termed quasi-primes. Additionally, we provide access to 5,186,757 nucleic and 214,904,089 peptide sequences absent from every genome and proteome, termed primes. kmerDB features a user-friendly interface offering various search options and filters for easy parsing and searching. The service is available at: www.kmerdb.com.

4.
NAR Genom Bioinform ; 6(2): lqae029, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38584871

ABSTRACT

The prevalence of nucleic and peptide short sequences across organismal genomes and proteomes has not been thoroughly investigated. We examined 45 785 reference genomes and 21 871 reference proteomes, spanning archaea, bacteria, eukaryotes and viruses to calculate the rarity of short sequences in them. To capture this, we developed a metric of the rarity of each sequence in nature, the rarity index. We find that the frequency of certain dipeptides in rare oligopeptide sequences is hundreds of times lower than expected, which is not the case for any dinucleotides. We also generate predictive regression models that infer the rarity of nucleic and proteomic sequences across nature or within each domain of life and viruses separately. When examining each of the three domains of life and viruses separately, the R² performance of the model predicting rarity for 5-mer peptides from mono- and dipeptides ranged between 0.814 and 0.932. A separate model predicting rarity for 10-mer oligonucleotides from mono- and dinucleotides achieved R² performance between 0.408 and 0.606. Our results indicate that the mono- and dinucleotide composition of nucleic sequences and the mono- and dipeptide composition of peptide sequences can explain a significant proportion of the variance in their frequencies in nature.

5.
Eur J Cancer ; 196: 113421, 2024 Jan.
Article in English | MEDLINE | ID: mdl-37952501

ABSTRACT

Early diagnosis of cancer can significantly improve survival of cancer patients; however sensitive and highly specific biomarkers for cancer detection are currently lacking for most cancer types. Nullpeptides are short peptides that are absent from the human proteome. Here, we examined the emergence of nullpeptides during cancer development. We analyzed 3,600,964 somatic mutations across 10,064 whole exome sequencing tumor samples spanning 32 cancer types. We analyze RNA-seq data from primary tumor samples to identify the subset of nullpeptides that emerge in highly expresed genes. We show that nullpeptides, and particularly the subset that is highly recurrent across cancer patients, can be identified in tumor biopsy samples. We find that cancer genes show an excess of nullpeptides and detect nullpeptide hotspots in specific loci of oncogenes and tumor suppressors. We also observe that recurrent nullpeptides are more likely to be found in neoantigens, which have been shown to be effective targets for immunotherapy, suggesting that they can be used to prioritize candidates. Our findings provide evidence for the utility of nullpeptides as cancer detection and therapeutic biomarkers.


Subject(s)
Neoplasms , Humans , Neoplasms/therapy , Oncogenes , Peptides/genetics , Immunotherapy , Biomarkers , Mutation , Antigens, Neoplasm
6.
BMC Genomics ; 24(1): 768, 2023 Dec 12.
Article in English | MEDLINE | ID: mdl-38087204

ABSTRACT

Early detection of human disease is associated with improved clinical outcomes. However, many diseases are often detected at an advanced, symptomatic stage where patients are past efficacious treatment periods and can result in less favorable outcomes. Therefore, methods that can accurately detect human disease at a presymptomatic stage are urgently needed. Here, we introduce "frequentmers"; short sequences that are specific and recurrently observed in either patient or healthy control samples, but not in both. We showcase the utility of frequentmers for the detection of liver cirrhosis using metagenomic Next Generation Sequencing data from stool samples of patients and controls. We develop classification models for the detection of liver cirrhosis and achieve an AUC score of 0.91 using ten-fold cross-validation. A small subset of 200 frequentmers can achieve comparable results in detecting liver cirrhosis. Finally, we identify the microbial organisms in liver cirrhosis samples, which are associated with the most predictive frequentmer biomarkers.


Subject(s)
High-Throughput Nucleotide Sequencing , Liver Cirrhosis , Humans , Liver Cirrhosis/diagnosis , Liver Cirrhosis/genetics , Health Status , Metagenome , Metagenomics , Sensitivity and Specificity
7.
NAR Genom Bioinform ; 5(2): lqad039, 2023 Jun.
Article in English | MEDLINE | ID: mdl-37101657

ABSTRACT

Determining the organisms present in a biosample has many important applications in agriculture, wildlife conservation, and healthcare. Here, we develop a universal fingerprint based on the identification of short peptides that are unique to a specific organism. We define quasi-prime peptides as sequences that are found in only one species, and we analyzed proteomes from 21 875 species, from viruses to humans, and annotated the smallest peptide kmer sequences that are unique to a species and absent from all other proteomes. We also perform simulations across all reference proteomes and observe a lower than expected number of peptide kmers across species and taxonomies, indicating an enrichment for nullpeptides, sequences absent from a proteome. For humans, we find that quasi-primes are found in genes enriched for specific gene ontology terms, including proteasome and ATP and GTP catalysis. We also provide a set of quasi-prime peptides for a number of human pathogens and model organisms and further showcase its utility via two case studies for Mycobacterium tuberculosis and Vibrio cholerae, where we identify quasi-prime peptides in two transmembrane and extracellular proteins with relevance for pathogen detection. Our catalog of quasi-prime peptides provides the smallest unit of information that is specific to a single organism at the protein level, providing a versatile tool for species identification.

SELECTION OF CITATIONS
SEARCH DETAIL
...