Evolutionary insights from suffix array-based genome sequence analysis.

Poddar, Anindya; Chandra, Nagasuma; Ganapathiraju, Madhavi; Sekar, K; Klein-Seetharaman, Judith; Reddy, Raj; Balakrishnan, N

Poddar, Anindya; Chandra, Nagasuma; Ganapathiraju, Madhavi; Sekar, K; Klein-Seetharaman, Judith; Reddy, Raj; Balakrishnan, N.

J Biosci ; 2007 Aug; 32(5): 871-81

Artículo en Inglés | IMSEAR | ID: sea-110954

ABSTRACT

ABSTRACT

Gene and protein sequence analyses, central components of studies in modern biology are easily amenable to string matching and pattern recognition algorithms. The growing need of analysing whole genome sequences more efficiently and thoroughly, has led to the emergence of new computational methods. Suffix trees and suffix arrays are data structures, well known in many other areas and are highly suited for sequence analysis too. Here we report an improvement to the design of construction of suffix arrays. Enhancement in versatility and scalability, enabled by this approach, is demonstrated through the use of real-life examples. The scalability of the algorithm to whole genomes renders it suitable to address many biologically interesting problems. One example is the evolutionary insight gained by analysing unigrams, bi-grams and higher n-grams, indicating that the genetic code has a direct influence on the overall composition of the genome. Further, different proteomes have been analysed for the coverage of the possible peptide space, which indicate that as much as a quarter of the total space at the tetra-peptide level is left un-sampled in prokaryotic organisms, although almost all tri-peptides can be seen in one protein or another in a proteome. Besides, distinct patterns begin to emerge for the counts of particular tetra and higher peptides, indicative of a 'meaning' for tetra and higher n-grams. The toolkit has also been used to demonstrate the usefulness of identifying repeats in whole proteomes efficiently. As an example, 16 members of one COG,coded by the genome of Mycobacterium tuberculosis H37Rv have been found to contain a repeating sequence of 300 amino acids.

Asunto(s)

Algoritmos; Animales; Biología Computacional; Evolución Molecular; Genoma; Mycobacterium tuberculosis/genética; Análisis de Secuencia por Matrices de Oligonucleótidos; Oligopéptidos/genética; Análisis por Matrices de Proteínas; Análisis de Secuencia de ADN; Análisis de Secuencia de Proteína; Programas Informáticos

Texto completo

Imprimir

XML

Buscar en Google

Texto completo: Disponible Índice: IMSEAR (Asia Sudoriental) Asunto principal: Oligopéptidos / Algoritmos / Programas Informáticos / Genoma / Análisis de Secuencia de ADN / Evolución Molecular / Biología Computacional / Análisis de Secuencia por Matrices de Oligonucleótidos / Análisis de Secuencia de Proteína / Análisis por Matrices de Proteínas Idioma: Inglés Revista: J Biosci Año: 2007 Tipo del documento: Artículo

Similares

MEDLINE

LILACS

LIS

Texto completo

Imprimir

XML

Buscar en Google