Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 24
Filtrar
Mais filtros











Base de dados
Intervalo de ano de publicação
1.
Neural Netw ; 169: 191-204, 2024 Jan.
Artigo em Inglês | MEDLINE | ID: mdl-37898051

RESUMO

This paper analyzes diverse features extracted from spoken language to select the most discriminative ones for dementia detection. We present a two-step feature selection (FS) approach: Step 1 utilizes filter methods to pre-screen features, and Step 2 uses a novel feature ranking (FR) method, referred to as dual dropout ranking (DDR), to rank the screened features and select spoken language biomarkers. The proposed DDR is based on a dual-net architecture that separates FS and dementia detection into two neural networks (namely, the operator and selector). The operator is trained on features obtained from the selector to reduce classification or regression loss. The selector is optimized to predict the operator's performance based on automatic regularization. Results show that the approach significantly reduces feature dimensionality while identifying small feature subsets that achieve comparable or superior performance compared with the full, default feature set. The Python codes are available at https://github.com/kexquan/dual-dropout-ranking.


Assuntos
Demência , Redes Neurais de Computação , Humanos , Biomarcadores , Demência/diagnóstico , Idioma
2.
Front Neurosci ; 17: 1351848, 2023.
Artigo em Inglês | MEDLINE | ID: mdl-38292896

RESUMO

Introduction: Speaker diarization is an essential preprocessing step for diagnosing cognitive impairments from speech-based Montreal cognitive assessments (MoCA). Methods: This paper proposes three enhancements to the conventional speaker diarization methods for such assessments. The enhancements tackle the challenges of diarizing MoCA recordings on two fronts. First, multi-scale channel interdependence speaker embedding is used as the front-end speaker representation for overcoming the acoustic mismatch caused by far-field microphones. Specifically, a squeeze-and-excitation (SE) unit and channel-dependent attention are added to Res2Net blocks for multi-scale feature aggregation. Second, a sequence comparison approach with a holistic view of the whole conversation is applied to measure the similarity of short speech segments in the conversation, which results in a speaker-turn aware scoring matrix for the subsequent clustering step. Third, to further enhance the diarization performance, we propose incorporating a pairwise similarity measure so that the speaker-turn aware scoring matrix contains both local and global information across the segments. Results: Evaluations on an interactive MoCA dataset show that the proposed enhancements lead to a diarization system that outperforms the conventional x-vector/PLDA systems under language-, age-, and microphone-mismatch scenarios. Discussion: The results also show that the proposed enhancements can help hypothesize the speaker-turn timestamps, making the diarization method amendable to datasets without timestamp information.

3.
IEEE Trans Neural Netw Learn Syst ; 33(5): 2236-2245, 2022 05.
Artigo em Inglês | MEDLINE | ID: mdl-33373306

RESUMO

Domain adaptation aims to reduce the mismatch between the source and target domains. A domain adversarial network (DAN) has been recently proposed to incorporate adversarial learning into deep neural networks to create a domain-invariant space. However, DAN's major drawback is that it is difficult to find the domain-invariant space by using a single feature extractor. In this article, we propose to split the feature extractor into two contrastive branches, with one branch delegating for the class-dependence in the latent space and another branch focusing on domain-invariance. The feature extractor achieves these contrastive goals by sharing the first and last hidden layers but possessing decoupled branches in the middle hidden layers. For encouraging the feature extractor to produce class-discriminative embedded features, the label predictor is adversarially trained to produce equal posterior probabilities across all of the outputs instead of producing one-hot outputs. We refer to the resulting domain adaptation network as "contrastive adversarial domain adaptation network (CADAN)." We evaluated the embedded features' domain-invariance via a series of speaker identification experiments under both clean and noisy conditions. Results demonstrate that the embedded features produced by CADAN lead to a 33% improvement in speaker identification accuracy compared with the conventional DAN.


Assuntos
Redes Neurais de Computação , Reconhecimento Psicológico , Aprendizagem
4.
IEEE Trans Neural Netw Learn Syst ; 33(1): 172-184, 2022 01.
Artigo em Inglês | MEDLINE | ID: mdl-33035171

RESUMO

When training data are scarce, it is challenging to train a deep neural network without causing the overfitting problem. For overcoming this challenge, this article proposes a new data augmentation network-namely adversarial data augmentation network (ADAN)- based on generative adversarial networks (GANs). The ADAN consists of a GAN, an autoencoder, and an auxiliary classifier. These networks are trained adversarially to synthesize class-dependent feature vectors in both the latent space and the original feature space, which can be augmented to the real training data for training classifiers. Instead of using the conventional cross-entropy loss for adversarial training, the Wasserstein divergence is used in an attempt to produce high-quality synthetic samples. The proposed networks were applied to speech emotion recognition using EmoDB and IEMOCAP as the evaluation data sets. It was found that by forcing the synthetic latent vectors and the real latent vectors to share a common representation, the gradient vanishing problem can be largely alleviated. Also, results show that the augmented data generated by the proposed networks are rich in emotion information. Thus, the resulting emotion classifiers are competitive with state-of-the-art speech emotion recognition systems.


Assuntos
Aprendizado de Máquina , Redes Neurais de Computação , Emoções , Entropia , Fala
5.
IEEE J Biomed Health Inform ; 24(3): 717-727, 2020 03.
Artigo em Inglês | MEDLINE | ID: mdl-31150349

RESUMO

Automatic classification of electrocardiogram (ECG) signals is important for diagnosing heart arrhythmias. A big challenge in automatic ECG classification is the variation in the waveforms and characteristics of ECG signals among different patients. To address this issue, this paper proposes adapting a patient-independent deep neural network (DNN) using the information in the patient-dependent identity vectors (i-vectors). The adapted networks, namely i-vector adapted patient-specific DNNs (iAP-DNNs), are tuned toward the ECG characteristics of individual patients. For each patient, his/her ECG waveforms are compressed into an i-vector using a factor analysis model. Then, this i-vector is injected into the middle hidden layer of the patient-independent DNN. Stochastic gradient descent is then applied to fine-tune the whole network to form a patient-specific classifier. As a result, the adaptation makes use of not only the raw ECG waveforms from the specific patient but also the compact representation of his/her ECG characteristics through the i-vector. Analysis on the hidden-layer activations shows that by leveraging the information in the i-vectors, the iAP-DNNs are more capable of discriminating normal heartbeats against arrhythmic heartbeats than the networks that use the patient-specific ECG only for the adaptation. Experimental results based on the MIT-BIH database suggest that the iAP-DNNs perform better than existing patient-specific classifiers in terms of various performance measures. In particular, the sensitivity and specificity of the existing methods are all under the receiver operating characteristic curves of the iAP-DNNs.


Assuntos
Arritmias Cardíacas/diagnóstico , Eletrocardiografia/métodos , Frequência Cardíaca/fisiologia , Redes Neurais de Computação , Algoritmos , Eletrocardiografia/classificação , Humanos , Processamento de Sinais Assistido por Computador
6.
IEEE J Biomed Health Inform ; 23(4): 1574-1584, 2019 07.
Artigo em Inglês | MEDLINE | ID: mdl-30235153

RESUMO

This paper proposes deep learning methods with signal alignment that facilitate the end-to-end classification of raw electrocardiogram (ECG) signals into heartbeat types, i.e., normal beat or different types of arrhythmias. Time-domain sample points are extracted from raw ECG signals, and consecutive vectors are extracted from a sliding time-window covering these sample points. Each of these vectors comprises the consecutive sample points of a complete heartbeat cycle, which includes not only the QRS complex but also the P and T waves. Unlike existing heartbeat classification methods in which medical doctors extract handcrafted features from raw ECG signals, the proposed end-to-end method leverages a deep neural network for both feature extraction and classification based on aligned heartbeats. This strategy not only obviates the need to handcraft the features but also produces optimized ECG representation for heartbeat classification. Evaluations on the MIT-BIH arrhythmia database show that at the same specificity, the proposed patient-independent classifier can detect supraventricular- and ventricular-ectopic beats at a sensitivity that is at least 10% higher than current state-of-the-art methods. More importantly, there is a wide range of operating points in which both the sensitivity and specificity of the proposed classifier are higher than those achieved by state-of-the-art classifiers. The proposed classifier can also perform comparable to patient-specific classifiers, but at the same time enjoys the advantage of patient independence.


Assuntos
Aprendizado Profundo , Eletrocardiografia/métodos , Processamento de Sinais Assistido por Computador , Adulto , Idoso , Idoso de 80 Anos ou mais , Arritmias Cardíacas/diagnóstico , Feminino , Frequência Cardíaca/fisiologia , Humanos , Masculino , Pessoa de Meia-Idade , Adulto Jovem
7.
Artigo em Inglês | MEDLINE | ID: mdl-26887009

RESUMO

Predicting the localization of chloroplast proteins at the sub-subcellular level is an essential yet challenging step to elucidate their functions. Most of the existing subchloroplast localization predictors are limited to predicting single-location proteins and ignore the multi-location chloroplast proteins. While recent studies have led to some multi-location chloroplast predictors, they usually perform poorly. This paper proposes an ensemble transductive learning method to tackle this multi-label classification problem. Specifically, given a protein in a dataset, its composition-based sequence information and profile-based evolutionary information are respectively extracted. These two kinds of features are respectively compared with those of other proteins in the dataset. The comparisons lead to two similarity vectors which are weighted-combined to constitute an ensemble feature vector. A transductive learning model based on the least squares and nearest neighbor algorithms is proposed to process the ensemble features. We refer to the resulting predictor to as EnTrans-Chlo. Experimental results on a stringent benchmark dataset and a novel dataset demonstrate that EnTrans-Chlo significantly outperforms state-of-the-art predictors and particularly gains more than 4% (absolute) improvement on the overall actual accuracy. For readers' convenience, EnTrans-Chlo is freely available online at http://bioinfo.eie.polyu.edu.hk/EnTransChloServer/.

8.
Bioinformatics ; 33(5): 749-750, 2017 03 01.
Artigo em Inglês | MEDLINE | ID: mdl-28011780

RESUMO

Although many web-servers for predicting protein subcellular localization have been developed, they often have the following drawbacks: (i) lack of interpretability or interpreting results with heterogenous information which may confuse users; (ii) ignoring multi-location proteins and (iii) only focusing on specific organism. To tackle these problems, we present an interpretable and efficient web-server, namely FUEL-mLoc, using eature- nified prediction and xplanation of m ulti- oc alization of cellular proteins in multiple organisms. Compared to conventional localization predictors, FUEL-mLoc has the following advantages: (i) using unified features (i.e. essential GO terms) to interpret why a prediction is made; (ii) being capable of predicting both single- and multi-location proteins and (iii) being able to handle proteins of multiple organisms, including Eukaryota, Homo sapiens, Viridiplantae, Gram-positive Bacteria, Gram-negative Bacteria and Virus . Experimental results demonstrate that FUEL-mLoc outperforms state-of-the-art subcellular-localization predictors. Availability and Implementation: http://bioinfo.eie.polyu.edu.hk/FUEL-mLoc/. Contacts: shibiao.wan@princeton.edu or enmwmak@polyu.edu.hk. Supplementary information: Supplementary data are available at Bioinformatics online.


Assuntos
Biologia Computacional/métodos , Proteínas/metabolismo , Software , Bactérias/metabolismo , Compartimento Celular , Eucariotos/metabolismo , Humanos , Transporte Proteico , Vírus/metabolismo
9.
J Proteome Res ; 15(12): 4755-4762, 2016 12 02.
Artigo em Inglês | MEDLINE | ID: mdl-27766879

RESUMO

In the postgenomic era, the number of unreviewed protein sequences is remarkably larger and grows tremendously faster than that of reviewed ones. However, existing methods for protein subchloroplast localization often ignore the information from these unlabeled proteins. This paper proposes a multi-label predictor based on ensemble linear neighborhood propagation (LNP), namely, LNP-Chlo, which leverages hybrid sequence-based feature information from both labeled and unlabeled proteins for predicting localization of both single- and multi-label chloroplast proteins. Experimental results on a stringent benchmark dataset and a novel independent dataset suggest that LNP-Chlo performs at least 6% (absolute) better than state-of-the-art predictors. This paper also demonstrates that ensemble LNP significantly outperforms LNP based on individual features. For readers' convenience, the online Web server LNP-Chlo is freely available at http://bioinfo.eie.polyu.edu.hk/LNPChloServer/ .


Assuntos
Proteínas de Cloroplastos/metabolismo , Cloroplastos/metabolismo , Frações Subcelulares/química , Cloroplastos/química , Biologia Computacional/métodos , Bases de Dados de Proteínas
10.
Data Brief ; 8: 105-7, 2016 Sep.
Artigo em Inglês | MEDLINE | ID: mdl-27294176

RESUMO

Identifying membrane proteins and their multi-functional types is an indispensable yet challenging topic in proteomics and bioinformatics. In this article, we provide data that are used for training and testing Mem-ADSVM (Wan et al., 2016. "Mem-ADSVM: a two-layer multi-label predictor for identifying multi-functional types of membrane proteins" [1]), a two-layer multi-label predictor for predicting multi-functional types of membrane proteins.

11.
J Theor Biol ; 398: 32-42, 2016 06 07.
Artigo em Inglês | MEDLINE | ID: mdl-27000774

RESUMO

Identifying membrane proteins and their multi-functional types is an indispensable yet challenging topic in proteomics and bioinformatics. However, most of the existing membrane-protein predictors have the following problems: (1) they do not predict whether a given protein is a membrane protein or not; (2) they are limited to predicting membrane proteins with single-label functional types but ignore those with multi-functional types; and (3) there is still much room for improvement for their performance. To address these problems, this paper proposes a two-layer multi-label predictor, namely Mem-ADSVM, which can identify membrane proteins (Layer I) and their multi-functional types (Layer II). Specifically, given a query protein, its associated gene ontology (GO) information is retrieved by searching a compact GO-term database with its homologous accession number. Subsequently, the GO information is classified by a binary support vector machine (SVM) classifier to determine whether it is a membrane protein or not. If yes, it will be further classified by a multi-label multi-class SVM classifier equipped with an adaptive-decision (AD) scheme to determine to which functional type(s) it belongs. Experimental results show that Mem-ADSVM significantly outperforms state-of-the-art predictors in terms of identifying both membrane proteins and their multi-functional types. This paper also suggests that the two-layer prediction architecture is better than the one-layer for prediction performance. For reader׳s convenience, the Mem-ADSVM server is available online at http://bioinfo.eie.polyu.edu.hk/MemADSVMServer/.


Assuntos
Proteínas de Membrana/análise , Software , Algoritmos , Bases de Dados de Proteínas , Tomada de Decisões , Ontologia Genética , Reprodutibilidade dos Testes
12.
BMC Bioinformatics ; 17: 97, 2016 Feb 24.
Artigo em Inglês | MEDLINE | ID: mdl-26911432

RESUMO

BACKGROUND: Predicting protein subcellular localization is indispensable for inferring protein functions. Recent studies have been focusing on predicting not only single-location proteins, but also multi-location proteins. Almost all of the high performing predictors proposed recently use gene ontology (GO) terms to construct feature vectors for classification. Despite their high performance, their prediction decisions are difficult to interpret because of the large number of GO terms involved. RESULTS: This paper proposes using sparse regressions to exploit GO information for both predicting and interpreting subcellular localization of single- and multi-location proteins. Specifically, we compared two multi-label sparse regression algorithms, namely multi-label LASSO (mLASSO) and multi-label elastic net (mEN), for large-scale predictions of protein subcellular localization. Both algorithms can yield sparse and interpretable solutions. By using the one-vs-rest strategy, mLASSO and mEN identified 87 and 429 out of more than 8,000 GO terms, respectively, which play essential roles in determining subcellular localization. More interestingly, many of the GO terms selected by mEN are from the biological process and molecular function categories, suggesting that the GO terms of these categories also play vital roles in the prediction. With these essential GO terms, not only where a protein locates can be decided, but also why it resides there can be revealed. CONCLUSIONS: Experimental results show that the output of both mEN and mLASSO are interpretable and they perform significantly better than existing state-of-the-art predictors. Moreover, mEN selects more features and performs better than mLASSO on a stringent human benchmark dataset. For readers' convenience, an online server called SpaPredictor for both mLASSO and mEN is available at http://bioinfo.eie.polyu.edu.hk/SpaPredictorServer/.


Assuntos
Biologia Computacional/métodos , Proteínas/metabolismo , Fenômenos Biológicos , Humanos , Transporte Proteico
13.
Artigo em Inglês | MEDLINE | ID: mdl-26336143

RESUMO

Membrane proteins play important roles in various biological processes within organisms. Predicting the functional types of membrane proteins is indispensable to the characterization of membrane proteins. Recent studies have extended to predicting single- and multi-type membrane proteins. However, existing predictors perform poorly and more importantly, they are often lack of interpretability. To address these problems, this paper proposes an efficient predictor, namely Mem-mEN, which can produce sparse and interpretable solutions for predicting membrane proteins with single- and multi-label functional types. Given a query membrane protein, its associated gene ontology (GO) information is retrieved by searching a compact GO-term database with its homologous accession number, which is subsequently classified by a multi-label elastic net (EN) classifier. Experimental results show that Mem-mEN significantly outperforms existing state-of-the-art membrane-protein predictors. Moreover, by using Mem-mEN, 338 out of more than 7,900 GO terms are found to play more essential roles in determining the functional types. Based on these 338 essential GO terms, Mem-mEN can not only predict the functional type of a membrane protein, but also explain why it belongs to that type. For the reader's convenience, the Mem-mEN server is available online at http://bioinfo.eie.polyu.edu.hk/MemmENServer/.


Assuntos
Biologia Computacional/métodos , Ontologia Genética , Proteínas de Membrana/genética , Proteínas de Membrana/fisiologia , Modelos Estatísticos , Bases de Dados de Proteínas , Proteínas de Membrana/química , Proteínas de Membrana/metabolismo , Redes Neurais de Computação
14.
J Theor Biol ; 382: 223-34, 2015 Oct 07.
Artigo em Inglês | MEDLINE | ID: mdl-26164062

RESUMO

Knowing the subcellular compartments of human proteins is essential to shed light on the mechanisms of a broad range of human diseases. In computational methods for protein subcellular localization, knowledge-based methods (especially gene ontology (GO) based methods) are known to perform better than sequence-based methods. However, existing GO-based predictors often lack interpretability and suffer from overfitting due to the high dimensionality of feature vectors. To address these problems, this paper proposes an interpretable multi-label predictor, namely mLASSO-Hum, which can yield sparse and interpretable solutions for large-scale prediction of human protein subcellular localization. By using the one-vs-rest LASSO-based classifiers, 87 out of more than 8000 GO terms are found to play more significant roles in determining the subcellular localization. Based on these 87 essential GO terms, we can decide not only where a protein resides within a cell, but also why it is located there. To further exploit information from the remaining GO terms, a method based on the GO hierarchical information derived from the depth distance of GO terms is proposed. Experimental results show that mLASSO-Hum performs significantly better than state-of-the-art predictors. We also found that in addition to the GO terms from the cellular component category, GO terms from the other two categories also play important roles in the final classification decisions. For readers' convenience, the mLASSO-Hum server is available online at http://bioinfo.eie.polyu.edu.hk/mLASSOHumServer/.


Assuntos
Biologia Computacional/métodos , Proteínas/metabolismo , Software , Bases de Dados de Proteínas , Ontologia Genética , Redes Reguladoras de Genes , Humanos , Reprodutibilidade dos Testes , Estatística como Assunto , Frações Subcelulares/metabolismo
15.
Anal Biochem ; 473: 14-27, 2015 Mar 15.
Artigo em Inglês | MEDLINE | ID: mdl-25449328

RESUMO

Proteins located in appropriate cellular compartments are of paramount importance to exert their biological functions. Prediction of protein subcellular localization by computational methods is required in the post-genomic era. Recent studies have been focusing on predicting not only single-location proteins but also multi-location proteins. However, most of the existing predictors are far from effective for tackling the challenges of multi-label proteins. This article proposes an efficient multi-label predictor, namely mPLR-Loc, based on penalized logistic regression and adaptive decisions for predicting both single- and multi-location proteins. Specifically, for each query protein, mPLR-Loc exploits the information from the Gene Ontology (GO) database by using its accession number (AC) or the ACs of its homologs obtained via BLAST. The frequencies of GO occurrences are used to construct feature vectors, which are then classified by an adaptive decision-based multi-label penalized logistic regression classifier. Experimental results based on two recent stringent benchmark datasets (virus and plant) show that mPLR-Loc remarkably outperforms existing state-of-the-art multi-label predictors. In addition to being able to rapidly and accurately predict subcellular localization of single- and multi-label proteins, mPLR-Loc can also provide probabilistic confidence scores for the prediction decisions. For readers' convenience, the mPLR-Loc server is available online (http://bioinfo.eie.polyu.edu.hk/mPLRLocServer).


Assuntos
Biologia Computacional/métodos , Espaço Intracelular/metabolismo , Proteínas de Plantas/metabolismo , Ontologia Genética , Modelos Logísticos , Proteínas de Plantas/genética , Transporte Proteico , Viridiplantae/citologia
16.
J Theor Biol ; 360: 34-45, 2014 Nov 07.
Artigo em Inglês | MEDLINE | ID: mdl-24997236

RESUMO

Locating proteins within cellular contexts is of paramount significance in elucidating their biological functions. Computational methods based on knowledge databases (such as gene ontology annotation (GOA) database) are known to be more efficient than sequence-based methods. However, the predominant scenarios of knowledge-based methods are that (1) knowledge databases typically have enormous size and are growing exponentially, (2) knowledge databases contain redundant information, and (3) the number of extracted features from knowledge databases is much larger than the number of data samples with ground-truth labels. These properties render the extracted features liable to redundant or irrelevant information, causing the prediction systems suffer from overfitting. To address these problems, this paper proposes an efficient multi-label predictor, namely R3P-Loc, which uses two compact databases for feature extraction and applies random projection (RP) to reduce the feature dimensions of an ensemble ridge regression (RR) classifier. Two new compact databases are created from Swiss-Prot and GOA databases. These databases possess almost the same amount of information as their full-size counterparts but with much smaller size. Experimental results on two recent datasets (eukaryote and plant) suggest that R3P-Loc can reduce the dimensions by seven-folds and significantly outperforms state-of-the-art predictors. This paper also demonstrates that the compact databases reduce the memory consumption by 39 times without causing degradation in prediction accuracy. For readers׳ convenience, the R3P-Loc server is available online at url:http://bioinfo.eie.polyu.edu.hk/R3PLocServer/.


Assuntos
Bases de Dados Genéticas , Espaço Intracelular/metabolismo , Modelos Biológicos , Proteínas/metabolismo , Software , Internet
17.
PLoS One ; 9(3): e89545, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24647341

RESUMO

Protein subcellular localization prediction, as an essential step to elucidate the functions in vivo of proteins and identify drugs targets, has been extensively studied in previous decades. Instead of only determining subcellular localization of single-label proteins, recent studies have focused on predicting both single- and multi-location proteins. Computational methods based on Gene Ontology (GO) have been demonstrated to be superior to methods based on other features. However, existing GO-based methods focus on the occurrences of GO terms and disregard their relationships. This paper proposes a multi-label subcellular-localization predictor, namely HybridGO-Loc, that leverages not only the GO term occurrences but also the inter-term relationships. This is achieved by hybridizing the GO frequencies of occurrences and the semantic similarity between GO terms. Given a protein, a set of GO terms are retrieved by searching against the gene ontology database, using the accession numbers of homologous proteins obtained via BLAST search as the keys. The frequency of GO occurrences and semantic similarity (SS) between GO terms are used to formulate frequency vectors and semantic similarity vectors, respectively, which are subsequently hybridized to construct fusion vectors. An adaptive-decision based multi-label support vector machine (SVM) classifier is proposed to classify the fusion vectors. Experimental results based on recent benchmark datasets and a new dataset containing novel proteins show that the proposed hybrid-feature predictor significantly outperforms predictors based on individual GO features as well as other state-of-the-art predictors. For readers' convenience, the HybridGO-Loc server, which is for predicting virus or plant proteins, is available online at http://bioinfo.eie.polyu.edu.hk/HybridGoServer/.


Assuntos
Ontologia Genética/estatística & dados numéricos , Proteínas de Plantas/análise , Software , Máquina de Vetores de Suporte , Proteínas Virais/análise , Biologia Computacional , Bases de Dados de Proteínas , Anotação de Sequência Molecular , Células Vegetais/química , Células Vegetais/virologia , Plantas/química , Análise de Sequência de Proteína , Vírus/química , Vocabulário Controlado
18.
J Theor Biol ; 323: 40-8, 2013 Apr 21.
Artigo em Inglês | MEDLINE | ID: mdl-23376577

RESUMO

Prediction of protein subcellular localization is an important yet challenging problem. Recently, several computational methods based on Gene Ontology (GO) have been proposed to tackle this problem and have demonstrated superiority over methods based on other features. Existing GO-based methods, however, do not fully use the GO information. This paper proposes an efficient GO method called GOASVM that exploits the information from the GO term frequencies and distant homologs to represent a protein in the general form of Chou's pseudo-amino acid composition. The method first selects a subset of relevant GO terms to form a GO vector space. Then for each protein, the method uses the accession number (AC) of the protein or the ACs of its homologs to find the number of occurrences of the selected GO terms in the Gene Ontology annotation (GOA) database as a means to construct GO vectors for support vector machines (SVMs) classification. With the advantages of GO term frequencies and a new strategy to incorporate useful homologous information, GOASVM can achieve a prediction accuracy of 72.2% on a new independent test set comprising novel proteins that were added to Swiss-Prot six years later than the creation date of the training set. GOASVM and Supplementary materials are available online at http://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/GOASVM.html.


Assuntos
Aminoácidos/metabolismo , Biologia Computacional/métodos , Anotação de Sequência Molecular , Software , Animais , Bases de Dados de Proteínas , Humanos , Reprodutibilidade dos Testes , Frações Subcelulares/metabolismo , Máquina de Vetores de Suporte
19.
BMC Bioinformatics ; 13: 290, 2012 Nov 06.
Artigo em Inglês | MEDLINE | ID: mdl-23130999

RESUMO

BACKGROUND: Although many computational methods have been developed to predict protein subcellular localization, most of the methods are limited to the prediction of single-location proteins. Multi-location proteins are either not considered or assumed not existing. However, proteins with multiple locations are particularly interesting because they may have special biological functions, which are essential to both basic research and drug discovery. RESULTS: This paper proposes an efficient multi-label predictor, namely mGOASVM, for predicting the subcellular localization of multi-location proteins. Given a protein, the accession numbers of its homologs are obtained via BLAST search. Then, the original accession number and the homologous accession numbers of the protein are used as keys to search against the Gene Ontology (GO) annotation database to obtain a set of GO terms. Given a set of training proteins, a set of T relevant GO terms is obtained by finding all of the GO terms in the GO annotation database that are relevant to the training proteins. These relevant GO terms then form the basis of a T-dimensional Euclidean space on which the GO vectors lie. A support vector machine (SVM) classifier with a new decision scheme is proposed to classify the multi-label GO vectors. The mGOASVM predictor has the following advantages: (1) it uses the frequency of occurrences of GO terms for feature representation; (2) it selects the relevant GO subspace which can substantially speed up the prediction without compromising performance; and (3) it adopts an efficient multi-label SVM classifier which significantly outperforms other predictors. Briefly, on two recently published virus and plant datasets, mGOASVM achieves an actual accuracy of 88.9% and 87.4%, respectively, which are significantly higher than those achieved by the state-of-the-art predictors such as iLoc-Virus (74.8%) and iLoc-Plant (68.1%). CONCLUSIONS: mGOASVM can efficiently predict the subcellular locations of multi-label proteins. The mGOASVM predictor is available online at http://bioinfo.eie.polyu.edu.hk/mGoaSvmServer/mGOASVM.html.


Assuntos
Biologia Computacional/métodos , Espaço Intracelular/metabolismo , Proteínas/genética , Proteínas/metabolismo , Software , Máquina de Vetores de Suporte , Algoritmos , Bases de Dados de Proteínas , Genes
20.
Proteome Sci ; 9 Suppl 1: S8, 2011 Oct 14.
Artigo em Inglês | MEDLINE | ID: mdl-22166017

RESUMO

BACKGROUND: The functions of proteins are closely related to their subcellular locations. In the post-genomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by computational means. RESULTS: This paper proposes mitigating the computation burden of alignment-based approaches to subcellular localization prediction by a cascaded fusion of cleavage site prediction and profile alignment. Specifically, the informative segments of protein sequences are identified by a cleavage site predictor using the information in their N-terminal shorting signals. Then, the sequences are truncated at the cleavage site positions, and the shortened sequences are passed to PSI-BLAST for computing their profiles. Subcellular localization are subsequently predicted by a profile-to-profile alignment support-vector-machine (SVM) classifier. To further reduce the training and recognition time of the classifier, the SVM classifier is replaced by a new kernel method based on the perturbational discriminant analysis (PDA). CONCLUSIONS: Experimental results on a new dataset based on Swiss-Prot Release 57.5 show that the method can make use of the best property of signal- and homology-based approaches and can attain an accuracy comparable to that achieved by using full-length sequences. Analysis of profile-alignment score matrices suggest that both profile creation time and profile alignment time can be reduced without significant reduction in subcellular localization accuracy. It was found that PDA enjoys a short training time as compared to the conventional SVM. We advocate that the method will be important for biologists to conduct large-scale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments.

SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA