Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 18 de 18
Filter
Add more filters










Publication year range
1.
Plants (Basel) ; 12(20)2023 Oct 14.
Article in English | MEDLINE | ID: mdl-37896036

ABSTRACT

The exact identification of promoter sequences remains a serious problem in computational biology, as the promoter prediction algorithms under development continue to produce false-positive results. Therefore, to fully assess the validity of predicted sequences, it is necessary to perform a comprehensive test of their properties, such as the presence of downstream transcribed DNA regions behind them, or chromatin accessibility for transcription factor binding. In this paper, we examined the promoter sequences of chromosome 1 of the rice Oryza sativa genome from the Database of Potential Promoter Sequences predicted using a mathematical algorithm based on the derivation and calculation of statistically significant promoter classes. In this paper TATA motifs and cis-regulatory elements were identified in the predicted promoter sequences. We also verified the presence of potential transcription start sites near the predicted promoters by analyzing CAGE-seq data. We searched for unannotated transcripts behind the predicted sequences by de novo assembling transcripts from RNA-seq data. We also examined chromatin accessibility in the region of the predicted promoters by analyzing ATAC-seq data. As a result of this work, we identified the predicted sequences that are most likely to be promoters for further experimental validation in an in vivo or in vitro system.

2.
Int J Mol Sci ; 23(7)2022 Mar 29.
Article in English | MEDLINE | ID: mdl-35409125

ABSTRACT

The aim of this work was to compare the multiple alignment methods MAHDS, T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK in their ability to align highly divergent amino acid sequences. To accomplish this, we created test amino acid sequences with an average number of substitutions per amino acid (x) from 0.6 to 5.6, a total of 81 sets. Comparison of the performance of sequence alignments constructed by MAHDS and previously developed algorithms using the CS and Z score criteria and the benchmark alignment database (BAliBASE) indicated that, although the quality of the alignments built with MAHDS was somewhat lower than that of the other algorithms, it was compensated by greater statistical significance. MAHDS could construct statistically significant alignments of artificial sequences with x ≤ 4.8, whereas the other algorithms (T-Coffee, MUSCLE, Clustal Omega, Kalign, MAFFT, and PRANK) could not perform that at x > 2.4. The application of MAHDS to align 21 families of highly diverged proteins (identity < 20%) from Pfam and HOMSTRAD databases showed that it could calculate statistically significant alignments in cases when the other methods failed. Thus, MAHDS could be used to construct statistically significant multiple alignments of highly divergent protein sequences, which accumulated multiple mutations during evolution.


Subject(s)
Algorithms , Coffee , Amino Acid Sequence , Proteins/chemistry , Proteins/genetics , Sequence Alignment , Software
3.
Genes (Basel) ; 12(4)2021 03 25.
Article in English | MEDLINE | ID: mdl-33806152

ABSTRACT

Currently, there is a lack of bioinformatics approaches to identify highly divergent tandem repeats (TRs) in eukaryotic genomes. Here, we developed a new mathematical method to search for TRs, which uses a novel algorithm for constructing multiple alignments based on the generation of random position weight matrices (RPWMs), and applied it to detect TRs of 2 to 50 nucleotides long in the rice genome. The RPWM method could find highly divergent TRs in the presence of insertions or deletions. Comparison of the RPWM algorithm with the other methods of TR identification showed that RPWM could detect TRs in which the average number of base substitutions per nucleotide (x) was between 1.5 and 3.2, whereas T-REKS and TRF methods could not detect divergent TRs with x > 1.5. Applied to the search of TRs in the rice genome, the RPWM method revealed that TRs occupied 5% of the genome and that most of them were 2 and 3 bases long. Using RPWM, we also revealed the correlation of TRs with dispersed repeats and transposons, suggesting that some transposons originated from TRs. Thus, the novel RPWM algorithm is an effective tool to search for highly divergent TRs in the genomes.


Subject(s)
Chromosome Mapping/methods , Chromosomes, Plant/genetics , Genome, Plant , Oryza/genetics , Tandem Repeat Sequences/genetics , Phylogeny
4.
BMC Bioinformatics ; 22(1): 42, 2021 Feb 02.
Article in English | MEDLINE | ID: mdl-33530928

ABSTRACT

BACKGROUND: Transposable elements (TEs) constitute a significant part of eukaryotic genomes. Short interspersed nuclear elements (SINEs) are non-autonomous TEs, which are widely represented in mammalian genomes and also found in plants. After insertion in a new position in the genome, TEs quickly accumulate mutations, which complicate their identification and annotation by modern bioinformatics methods. In this study, we searched for highly divergent SINE copies in the genome of rice (Oryza sativa subsp. japonica) using the Highly Divergent Repeat Search Method (HDRSM). RESULTS: The HDRSM considers correlations of neighboring symbols to construct position weight matrix (PWM) for a SINE family, which is then used to perform a search for new copies. In order to evaluate the accuracy of the method and compare it with the RepeatMasker program, we generated a set of SINE copies containing nucleotide substitutions and indels and inserted them into an artificial chromosome for analysis. The HDRSM showed better results both in terms of the number of identified inserted repeats and the accuracy of determining their boundaries. A search for the copies of 39 SINE families in the rice genome produced 14,030 hits; among them, 5704 were not detected by RepeatMasker. CONCLUSIONS: The HDRSM could find divergent SINE copies, correctly determine their boundaries, and offer a high level of statistical significance. We also found that RepeatMasker is able to find relatively short copies of the SINE families with a higher level of similarity, while HDRSM is able to find more diverged copies. To obtain a comprehensive profile of SINE distribution in the genome, combined application of the HDRSM and RepeatMasker is recommended.


Subject(s)
DNA Transposable Elements , Oryza , Short Interspersed Nucleotide Elements , Animals , DNA Transposable Elements/genetics , Evolution, Molecular , Humans , Oryza/genetics , Phylogeny , Position-Specific Scoring Matrices , Short Interspersed Nucleotide Elements/genetics
5.
Genes (Basel) ; 12(2)2021 01 21.
Article in English | MEDLINE | ID: mdl-33494278

ABSTRACT

In this study, we developed a new mathematical method for performing multiple alignment of highly divergent sequences (MAHDS), i.e., sequences that have on average more than 2.5 substitutions per position (x). We generated sets of artificial DNA sequences with x ranging from 0 to 4.4 and applied MAHDS as well as currently used multiple sequence alignment algorithms, including ClustalW, MAFFT, T-Coffee, Kalign, and Muscle to these sets. The results indicated that most of the existing methods could produce statistically significant alignments only for the sets with x < 2.5, whereas MAHDS could operate on sequences with x = 4.4. We also used MAHDS to analyze a set of promoter sequences from the Arabidopsis thaliana genome and discovered many conserved regions upstream of the transcription initiation site (from -499 to +1 bp); a part of the downstream region (from +1 to +70 bp) also significantly contributed to the obtained alignments. The possibilities of applying the newly developed method for the identification of promoter sequences in any genome are discussed. A server for multiple alignment of nucleotide sequences has been created.


Subject(s)
Arabidopsis/genetics , Computational Biology , Genome, Plant , Genomics , Promoter Regions, Genetic , Sequence Analysis, DNA/methods , Algorithms , Computational Biology/methods , Genomics/methods
6.
J Comput Biol ; 26(11): 1253-1261, 2019 11.
Article in English | MEDLINE | ID: mdl-31211597

ABSTRACT

Gene fusion is known to be one of the mechanisms of a new gene formation. Most bioinformatics methods for studying fused genes are based on the sequence similarity search. However, if the ancestral sequences were lost during evolution or changed too much, it is impossible to detect the fusion. Previously, we have developed a method of searching for triplet periodicity (TP) change points in protein-coding sequences (CDS) and showed the possible relation of this phenomenon with gene formation as a result of fusion. In this study, we improved the TP change point detection method and studied the genes of six eukaryotic genomes. At the level of 2%-3% of the probability of type I error, TP change points were found in 20%-40% of genes. Further analysis showed that about 30% of the TP change points can be explained by amino acid repeats. Another 30% can be potentially fused genes, alignment for which was detected by the BLAST program. We believe that the rest of the results can be fused genes, the ancestral sequences for which have been lost. The method is more sensitive to TP changes and allowed us to find up to two to three times more cases of significant TP change points than our previous method.


Subject(s)
Computational Biology/methods , Genome/genetics , Open Reading Frames/genetics , Repetitive Sequences, Amino Acid/genetics , Animals , Eukaryota/genetics , Humans , Sequence Alignment/methods
7.
Biomed Res Int ; 2017: 7949287, 2017.
Article in English | MEDLINE | ID: mdl-28182099

ABSTRACT

Summary. We analyzed several prokaryotic and eukaryotic genomes looking for the periodicity sequences availability and employing a new mathematical method. The method envisaged using the random position weight matrices and dynamic programming. Insertions and deletions were allowed inside periodicities, thus adding a novelty to the results we obtained. A periodicity length, one of the key periodicity features, varied from 2 to 50 nt. Totally over 60,000 periodicity sequences were found in 15 genomes including some chromosomes of the H. sapiens (partial), C. elegans, D. melanogaster, and A. thaliana genomes.


Subject(s)
Genome , INDEL Mutation/genetics , Sequence Analysis, DNA , Animals , Arabidopsis/genetics , Caenorhabditis elegans/genetics , Chromosomes/genetics , Drosophila melanogaster/genetics , Humans , Models, Theoretical , Prokaryotic Cells
8.
Stat Appl Genet Mol Biol ; 14(2): 113-23, 2015 Apr.
Article in English | MEDLINE | ID: mdl-25719343

ABSTRACT

Triplet periodicity (TP) is a distinctive feature of the protein coding sequences of both prokaryotic and eukaryotic genomes. In this work, we explored the TP difference inside and between 45 prokaryotic genomes. We constructed two hypotheses of TP distribution on a set of coding sequences and generated artificial datasets that correspond to the hypotheses. We found that TP is more similar inside a genome than between genomes and that TP distribution inside a real genome dataset corresponds to the hypothesis which implies that a common TP pattern exists for the majority of sequences inside a genome. Additionally, we performed gene classification based on TP matrixes. This classification showed that TP allows identification of the genome to which a given gene belongs with more than 85% accuracy.


Subject(s)
Genome/genetics , Algorithms , Databases, Genetic , Open Reading Frames/genetics , Periodicity , Prokaryotic Cells/physiology
9.
Adv Bioinformatics ; 2015: 635437, 2015.
Article in English | MEDLINE | ID: mdl-26770195

ABSTRACT

Over the last years a great number of bacterial genomes were sequenced. Now one of the most important challenges of computational genomics is the functional annotation of nucleic acid sequences. In this study we presented the computational method and the annotation system for predicting biological functions using phylogenetic profiles. The phylogenetic profile of a gene was created by way of searching for similarities between the nucleotide sequence of the gene and 1204 reference genomes, with further estimation of the statistical significance of found similarities. The profiles of the genes with known functions were used for prediction of possible functions and functional groups for the new genes. We conducted the functional annotation for genes from 104 bacterial genomes and compared the functions predicted by our system with the already known functions. For the genes that have already been annotated, the known function matched the function we predicted in 63% of the time, and in 86% of the time the known function was found within the top five predicted functions. Besides, our system increased the share of annotated genes by 19%. The developed system may be used as an alternative or complementary system to the current annotation systems.

10.
Comput Biol Chem ; 53 Pt A: 43-8, 2014 Dec.
Article in English | MEDLINE | ID: mdl-25218218

ABSTRACT

To determine the periodicity of a DNA sequence, different spectral approaches are applied (discrete Fourier transform (DFT), autocorrelation (CORR), information decomposition (ID), hybrid method (HYB), concept of spectral envelope for spectral analysis (SE), normalized autocorrelation (CORR_N) and profile analysis (PA). In this work, we investigated the possibility of finding the true period length, by depending on the average number of accumulated changes in DNA bases (PM) for the methods stated above. The results show that for periods with short length (≤4 b.p), it is possible to use the hybrid method (HYB), which combines properties of autocorrelation, Fourier transform, and information decomposition (ID). For larger period lengths (>4) with values of point mutation (PM) equal to 1.0 or more per one nucleotide, it is preferable to use information of decomposition method (ID), as the other spectral approaches cannot achieve correct determination of the period length present in the analyzed sequence.


Subject(s)
Caenorhabditis elegans/genetics , DNA, Helminth/genetics , Models, Statistical , Periodicity , Sequence Analysis, DNA/statistics & numerical data , Animals , Fourier Analysis , Nucleotides , Point Mutation
11.
Article in English | MEDLINE | ID: mdl-26356866

ABSTRACT

It is known that nucleotide sequences are not totally homogeneous and this heterogeneity could not be due to random fluctuations only. Such heterogeneity poses a problem of making sequence segmentation into a set of homogeneous parts divided by the points called "change points". In this work we investigated a special case of change points-paired change points (PCP). We used a well-known property of coding sequences-triplet periodicity (TP). The sequences that we are especially interested in consist of three successive parts: the first and the last parts have similar TP while the middle part has different TP type. We aimed to find the genes with PCP and provide explanation for this phenomenon. We developed a mathematical method for the PCP detection based on the new measure of similarity between TP matrices. We investigated 66,936 bacterial genes from 17 bacterial genomes and revealed 2,700 genes with PCP and 6,459 genes with single change point (SCP). We developed a mathematical approach to visualize the PCP cases. We suppose that PCP could be associated with double fusion or insertion events. The results of investigating the sequences with artificial insertions/fusions and distribution of TP inside the genome support the idea that the real number of genes formed by insertion/ fusion events could be 5-7 times greater than the number of genes revealed in the present work.


Subject(s)
Algorithms , Genes, Bacterial/genetics , Sequence Alignment/methods , Sequence Analysis, DNA/methods , Gene Fusion/genetics , Mutagenesis, Insertional/genetics
12.
Gene ; 491(1): 58-64, 2012 Jan 01.
Article in English | MEDLINE | ID: mdl-21982972

ABSTRACT

The triplet periodicity (TP) is a distinguished property of protein coding sequences. There are complex genes with more than one TP type along their sequence. We say that these genes contain a triplet periodicity change point. The aim of the work is to find all genes that contain TP change point and attempt to compare the positions of change point in genes with known biological data. We have developed a mathematical method to identify triplet periodicity changes along a sequence. We have found 311,221 genes with the TP change point in the KEGG/Genes database (version 48). It is about 8% from the total database volume (4013150). We showed that the repetitive sequences are not the only cause of such events. We suppose that the TP change point may indicate a fusion of genes or domains. We performed BLAST analysis to find potential ancestral genes for the parts of genes with TP change point. As a result we found that in 131323 cases sequences with TP change point have proper similarities for one or both parts. The relationship between TP change point and the fusion events in genes is discussed. The program realization of the method is available by request to authors.


Subject(s)
Open Reading Frames , Algorithms , Base Sequence , Computational Biology , Databases, Genetic , Molecular Sequence Data , Periodicity , Repetitive Sequences, Nucleic Acid
13.
Genomics Proteomics Bioinformatics ; 9(4-5): 158-70, 2011 Oct.
Article in English | MEDLINE | ID: mdl-22196359

ABSTRACT

The concept of the phase shift of triplet periodicity (TP) was used for searching potential DNA insertions in genes from 17 bacterial genomes. A mathematical algorithm for detection of these insertions has been developed. This approach can detect potential insertions and deletions with lengths that are not multiples of three bases, especially insertions of relatively large DNA fragments (>100 bases). New similarity measure between triplet matrixes was employed to improve the sensitivity for detecting the TP phase shift. Sequences of 17,220 bacterial genes with each consisting of more than 1,200 bases were analyzed, and the presence of a TP phase shift has been shown in ∼16% of analysed genes (2,809 genes), which is about 4 times more than that detected in our previous work. We propose that shifts of the TP phase may indicate the shifts of reading frame in genes after insertions of the DNA fragments with lengths that are not multiples of three bases. A relationship between the phase shifts of TP and the frame shifts in genes is discussed.


Subject(s)
Algorithms , Computational Biology/methods , DNA Transposable Elements/genetics , Genes, Bacterial/genetics , Base Sequence , Periodicity , Reading Frames/genetics , Sequence Homology, Amino Acid
14.
J Integr Bioinform ; 7(3)2010 Mar 25.
Article in English | MEDLINE | ID: mdl-20375465

ABSTRACT

The definition of a phase shift of triplet periodicity (TP) is introduced. The mathematical algorithm for detection of TP phase shift of nucleotide sequences has been developed. Gene sequences from Kegg-46 data bank were analyzed with a purpose of searching genes with a phase shift of TP. The presence of a phase shift of triplet periodicity has been shown for 318329 genes (approximately 10% from the number of genes in Kegg-46). We suppose that shifts of the TP phase may indicate the shifts of reading frame (RF) in genes. A relationship between the phase shifts of TP and the frame shifts in genes is discussed.


Subject(s)
Genes, Bacterial/genetics , Periodicity , Algorithms , Bacteria/genetics , Base Sequence , Databases, Genetic , Open Reading Frames/genetics , Sequence Homology, Amino Acid
15.
J Proteome Res ; 6(2): 862-8, 2007 Feb.
Article in English | MEDLINE | ID: mdl-17269743

ABSTRACT

Latent amino acid repeats seem to be widespread in genetic sequences and to reflect their structure, function, and evolution. We have recently identified latent periodicity in more than 150 protein families including protein kinases and various nucleotide-binding proteins. The latent repeats in these families were correlated to their structure and evolution. However, a majority of known protein families were not identified with our latent periodicity search algorithm. The main presumable reason for this was the inability of our techniques to identify periodicities interspersed with insertions and deletions. We designed the new latent periodicity search algorithm, which is capable of taking into account insertions and deletions. As a result, we identified many novel cases of latent periodicity peculiar to protein families. Possible origins of the periodic structure of these families are discussed. Summarizing, we presume that latent periodicity is present in a substantial portion of known protein families. The latent periodicity matrices and the results of Swiss-Prot scans are available from http://bioinf.narod.ru/del/.


Subject(s)
Algorithms , Amino Acid Sequence , Proteins/chemistry , Adenosine Triphosphatases/chemistry , Chaperonin 60/chemistry , Endoribonucleases/chemistry , Gene Products, gag/chemistry , Models, Theoretical , Molecular Sequence Data , Nucleotidyltransferases/chemistry , Periodicity
16.
J Comput Biol ; 13(4): 946-64, 2006 May.
Article in English | MEDLINE | ID: mdl-16761920

ABSTRACT

Here, we have applied information decomposition, cyclic profile alignment, and noise decomposition techniques to search for latent repeats within protein families of various functions. We have identified 94 protein families with a family-specific periodicity. In each case, the periodic element was found in greater than 70% of family members. Latent periodicity profiles with specific length and signature were obtained in each case. The possible relationship between the periodic elements thus identified and the evolutionary development of the protein families are discussed with specific reference to the possibility that there is a correlation between the periodic elements and protein function.


Subject(s)
Computational Biology , Multigene Family , Sequence Analysis, Protein , Algorithms , Amino Acid Sequence , Sequence Alignment
17.
Comput Biol Chem ; 29(3): 229-43, 2005 Jun.
Article in English | MEDLINE | ID: mdl-15979043

ABSTRACT

We identified latent periodicity in catalytic domains of approximately 85% of annotated serine-threonine and tyrosine protein kinases. Similar results were obtained for other 22 protein families and domains. We also designed the method of noise decomposition, which is aimed to distinguish between different periodicity types of the same period length. The method is to be used in conjunction with the method of cyclic profile alignment, and this combination is able to reveal structure-related or function-related patterns of latent periodicity. Possible origins of the periodic structure of protein kinase active sites are discussed. Summarizing, we presume that latent periodicity is the common property of many catalytic protein domains.


Subject(s)
Protein Serine-Threonine Kinases/chemistry , Protein-Tyrosine Kinases/chemistry , Algorithms , Amino Acid Motifs , Amino Acid Sequence , Catalytic Domain , Molecular Sequence Data
18.
Gene ; 335: 57-71, 2004 Jun 23.
Article in English | MEDLINE | ID: mdl-15194190

ABSTRACT

Transfer RNA (tRNA)-like sequences were searched for in the nine basic taxonomic divisions of GenBank-121 (viruses, phages, bacteria, plants, invertebrates, vertebrates, rodents, mammals, and primates) by an original program package implementing a dynamic profile alignment approach for the genetic texts' analysis, in using 22 profiles of tRNAs of different isotypes. In total, 175,901 previously unknown tRNA-like sequences were revealed. The locations of the tRNA-likes were considered over the regions whose functional meaning is described by standard Feature Keys in GenBank. Many regions containing the tRNA-like sequences were recognized as known repeats. A mode of distribution of the tRNA-like sequences in a genome was proposed as expansion in a content of the various transposable elements. An analysis of the integrity of RNA polymerase III inner promoters in the tRNA-like sequences over the GenBank divisions has shown a high possibility of generating new copies of short interspersed nuclear element (SINE) repeats in all divisions, excepting primates. The numerous tRNA-likes found in the regions of RNA polymerase II promoters have suggested an adaptation of RNA polymerase III promoter to a binding of RNA polymerase II.


Subject(s)
Algorithms , DNA/genetics , Evolution, Molecular , Genome , Animals , Base Composition , Base Sequence , DNA/chemistry , DNA Transposable Elements/genetics , Databases, Nucleic Acid , Eukaryotic Cells/metabolism , Genetic Variation , Humans , Molecular Sequence Data , Nucleic Acid Conformation , Prokaryotic Cells/metabolism , RNA, Transfer/chemistry , RNA, Transfer/genetics , Regulatory Sequences, Nucleic Acid/genetics , Repetitive Sequences, Nucleic Acid/genetics , Sequence Alignment/methods
SELECTION OF CITATIONS
SEARCH DETAIL
...