Search | VHL Regional Portal

Detecting atypical examples of known domain types by sequence similarity searching: the SBASE domain library approach.

Dhir, Somdutta; Pacurar, Mircea; Franklin, Dino; Gáspári, Zoltán; Kertész-Farkas, Attila; Kocsor, András; Eisenhaber, Frank; Pongor, Sándor.

Curr Protein Pept Sci ; 11(7): 538-49, 2010 Nov.

Article in English | MEDLINE | ID: mdl-20887262

ABSTRACT

SBASE is a project initiated to detect known domain types and predicting domain architectures using sequence similarity searching (Simon et al., Protein Seq Data Anal, 5: 39-42, 1992, Pongor et al, Nucl. Acids. Res. 21:3111-3115, 1992). The current approach uses a curated collection of domain sequences - the SBASE domain library - and standard similarity search algorithms, followed by postprocessing which is based on a simple statistics of the domain similarity network (http://hydra.icgeb.trieste.it/sbase/). It is especially useful in detecting rare, atypical examples of known domain types which are sometimes missed even by more sophisticated methodologies. This approach does not require multiple alignment or machine learning techniques, and can be a useful complement to other domain detection methodologies. This article gives an overview of the project history as well as of the concepts and principles developed within this the project.

Subject(s)

Data Mining , Databases, Protein , Proteins/chemistry , Algorithms , Humans , Neural Networks, Computer , Online Systems , Protein Structure, Tertiary , Proteins/classification , ROC Curve , Sequence Homology, Amino Acid

Protein classification based on propagation of unrooted binary trees.

Kocsor, András; Busa-Fekete, Róbert; Pongor, Sándor.

Protein Pept Lett ; 15(5): 428-34, 2008.

Article in English | MEDLINE | ID: mdl-18537730

ABSTRACT

We present two efficient network propagation algorithms that operate on a binary tree, i.e., a sparse-edged substitute of an entire similarity network. TreeProp-N is based on passing increments between nodes while TreeProp-E employs propagation to the edges of the tree. Both algorithms improve protein classification efficiency.

Subject(s)

Algorithms , Computational Biology/methods , Proteins/classification , Databases, Protein , Proteins/chemistry

ROC analysis: applications to the classification of biological sequences and 3D structures.

Sonego, Paolo; Kocsor, András; Pongor, Sándor.

Brief Bioinform ; 9(3): 198-209, 2008 May.

Article in English | MEDLINE | ID: mdl-18192302

ABSTRACT

ROC ('receiver operator characteristics') analysis is a visual as well as numerical method used for assessing the performance of classification algorithms, such as those used for predicting structures and functions from sequence data. This review summarizes the fundamental concepts of ROC analysis and the interpretation of results using examples of sequence and structure comparison. We overview the available programs and provide evaluation guidelines for genomic/proteomic data, with particular regard to applications to large and heterogeneous databases used in bioinformatics.

Subject(s)

Algorithms , Models, Chemical , Models, Molecular , ROC Curve , Sequence Alignment/methods , Sequence Analysis/methods , Software , Molecular Conformation

Benchmarking protein classification algorithms via supervised cross-validation.

Kertész-Farkas, Attila; Dhir, Somdutta; Sonego, Paolo; Pacurar, Mircea; Netoteia, Sergiu; Nijveen, Harm; Kuzniar, Arnold; Leunissen, Jack A M; Kocsor, András; Pongor, Sándor.

J Biochem Biophys Methods ; 70(6): 1215-23, 2008 Apr 24.

Article in English | MEDLINE | ID: mdl-17604112

ABSTRACT

Development and testing of protein classification algorithms are hampered by the fact that the protein universe is characterized by groups vastly different in the number of members, in average protein size, similarity within group, etc. Datasets based on traditional cross-validation (k-fold, leave-one-out, etc.) may not give reliable estimates on how an algorithm will generalize to novel, distantly related subtypes of the known protein classes. Supervised cross-validation, i.e., selection of test and train sets according to the known subtypes within a database has been successfully used earlier in conjunction with the SCOP database. Our goal was to extend this principle to other databases and to design standardized benchmark datasets for protein classification. Hierarchical classification trees of protein categories provide a simple and general framework for designing supervised cross-validation strategies for protein classification. Benchmark datasets can be designed at various levels of the concept hierarchy using a simple graph-theoretic distance. A combination of supervised and random sampling was selected to construct reduced size model datasets, suitable for algorithm comparison. Over 3000 new classification tasks were added to our recently established protein classification benchmark collection that currently includes protein sequence (including protein domains and entire proteins), protein structure and reading frame DNA sequence data. We carried out an extensive evaluation based on various machine-learning algorithms such as nearest neighbor, support vector machines, artificial neural networks, random forests and logistic regression, used in conjunction with comparison algorithms, BLAST, Smith-Waterman, Needleman-Wunsch, as well as 3D comparison methods DALI and PRIDE. The resulting datasets provide lower, and in our opinion more realistic estimates of the classifier performance than do random cross-validation schemes. A combination of supervised and random sampling was used to construct model datasets, suitable for algorithm comparison.

Subject(s)

Algorithms , Proteins/analysis , Proteins/classification , Proteins/chemistry , Sequence Analysis, Protein

Balanced ROC analysis (BAROC) protocol for the evaluation of protein similarities.

Busa-Fekete, Róbert; Kertész-Farkas, Attila; Kocsor, András; Pongor, Sándor.

J Biochem Biophys Methods ; 70(6): 1210-4, 2008 Apr 24.

Article in English | MEDLINE | ID: mdl-17689617

ABSTRACT

Identification of problematic protein classes (domain types, protein families) that are difficult to predict from sequence is a key issue in genome annotation. ROC (Receiver Operating Characteristic) analysis is routinely used for the evaluation of protein similarities, however its results - the area under curve (AUC) values - are differentially biased for the various protein classes that are highly different in size. We show the bias can be compensated for by adjusting the length of the top list in a class-dependent fashion, so that the number of negatives within the top list will be equal to (or proportional with) the size of the positive class. Using this balanced protocol the problematic classes can be identified by their AUC values, or by a scatter diagram in which the AUC values are plotted against positive/negative ratio of the top list. The use of likelihood-ratio scoring (Kaján et al, Bioinformatics,22, 2865-2869, 2007) the bias caused by class imbalance can be further decreased.

Subject(s)

Proteins/analysis , ROC Curve , Algorithms

A Protein Classification Benchmark collection for machine learning.

Sonego, Paolo; Pacurar, Mircea; Dhir, Somdutta; Kertész-Farkas, Attila; Kocsor, András; Gáspári, Zoltán; Leunissen, Jack A M; Pongor, Sándor.

Nucleic Acids Res ; 35(Database issue): D232-6, 2007 Jan.

Article in English | MEDLINE | ID: mdl-17142240

ABSTRACT

Protein classification by machine learning algorithms is now widely used in structural and functional annotation of proteins. The Protein Classification Benchmark collection (http://hydra.icgeb.trieste.it/benchmark) was created in order to provide standard datasets on which the performance of machine learning methods can be compared. It is primarily meant for method developers and users interested in comparing methods under standardized conditions. The collection contains datasets of sequences and structures, and each set is subdivided into positive/negative, training/test sets in several ways. There is a total of 6405 classification tasks, 3297 on protein sequences, 3095 on protein structures and 10 on protein coding regions in DNA. Typical tasks include the classification of structural domains in the SCOP and CATH databases based on their sequences or structures, as well as various functional and taxonomic classification problems. In the case of hierarchical classification schemes, the classification tasks can be defined at various levels of the hierarchy (such as classes, folds, superfamilies, etc.). For each dataset there are distance matrices available that contain all vs. all comparison of the data, based on various sequence or structure comparison methods, as well as a set of classification performance measures computed with various classifier algorithms.

Subject(s)

Artificial Intelligence , Databases, Protein , Proteins/classification , Algorithms , Internet , Protein Structure, Tertiary , Proteins/chemistry , Reproducibility of Results , Sequence Analysis, Protein , User-Computer Interface

Application of a simple likelihood ratio approximant to protein sequence classification.

Kaján, László; Kertész-Farkas, Attila; Franklin, Dino; Ivanova, Neli; Kocsor, András; Pongor, Sándor.

Bioinformatics ; 22(23): 2865-9, 2006 Dec 01.

Article in English | MEDLINE | ID: mdl-17090576

ABSTRACT

MOTIVATION: Likelihood ratio approximants (LRA) have been widely used for model comparison in statistics. The present study was undertaken in order to explore their utility as a scoring (ranking) function in the classification of protein sequences. RESULTS: We used a simple LRA-based on the maximal similarity (or minimal distance) scores of the two top ranking sequence classes. The scoring methods (Smith-Waterman, BLAST, local alignment kernel and compression based distances) were compared on datasets designed to test sequence similarities between proteins distantly related in terms of structure or evolution. It was found that LRA-based scoring can significantly outperform simple scoring methods.

Subject(s)

Algorithms , Proteins/chemistry , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Amino Acid Sequence , Computer Simulation , Likelihood Functions , Models, Chemical , Models, Molecular , Models, Statistical , Molecular Sequence Data , Sequence Homology, Amino Acid

Kalman filtering for disease-state estimation from microarray data.

Kelemen, János Z; Kertész-Farkas, Attila; Kocsor, András; Puskás, László G.

Bioinformatics ; 22(24): 3047-53, 2006 Dec 15.

Article in English | MEDLINE | ID: mdl-17065158

ABSTRACT

MOTIVATION: In this paper, we propose using the Kalman filter (KF) as a pre-processing step in microarray-based molecular diagnosis. Incorporating the expression covariance between genes is important in such classification problems, since this represents the functional relationships that govern tissue state. Failing to fulfil such requirements may result in biologically implausible class prediction models. Here, we show that employing the KF to remove noise (while retaining meaningful covariance and thus being able to estimate the underlying biological state from microarray measurements) yields linearly separable data suitable for most classification algorithms. RESULTS: We demonstrate the utility and performance of the KF as a robust disease-state estimator on publicly available binary and multi-class microarray datasets in combination with the most widely used classification methods to date. Moreover, using popular graphical representation schemes we show that our filtered datasets also have an improved visualization capability.

Subject(s)

Biomarkers, Tumor/analysis , Diagnosis, Computer-Assisted/methods , Gene Expression Profiling/methods , Neoplasm Proteins/analysis , Neoplasms/diagnosis , Neoplasms/metabolism , Oligonucleotide Array Sequence Analysis/methods , Algorithms , Humans , Reproducibility of Results , Sensitivity and Specificity , Systems Theory

Application of compression-based distance measures to protein sequence classification: a methodological study.

Kocsor, András; Kertész-Farkas, Attila; Kaján, László; Pongor, Sándor.

Bioinformatics ; 22(4): 407-12, 2006 Feb 15.

Article in English | MEDLINE | ID: mdl-16317070

ABSTRACT

MOTIVATION: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences. RESULTS: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith-Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith-Waterman algorithm and two hidden Markov model-based algorithms.

Subject(s)

Algorithms , Data Compression/methods , Proteins/chemistry , Proteins/classification , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Amino Acid Sequence , Molecular Sequence Data , Proteins/analysis

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL