Search | VHL Regional Portal

1.

Classification of Parkinson's disease from smartphone recording data using time-frequency analysis and convolutional neural network.

Worasawate, Denchai; Asawaponwiput, Warisara; Yoshimura, Natsue; Intarapanich, Apichart; Surangsrirat, Decho.

Technol Health Care ; 31(2): 705-718, 2023.

Article in English | MEDLINE | ID: mdl-36155539

ABSTRACT

BACKGROUND: Parkinson's disease (PD) is a long-term neurodegenerative disease of the central nervous system. The current diagnosis is dependent on clinical observation and the abilities and experience of a trained specialist. One of the symptoms that affects most patients is voice impairment. OBJECTIVE: Voice samples are non-invasive data that can be collected remotely for diagnosis and disease progression monitoring. In this study, we analyzed voice recording data from a smartphone as a possible medical self-diagnosis tool by using only one-second voice recording. The data from one of the largest mobile PD studies, the mPower study, was used. METHODS: A total of 29,798 ten-second voice recordings on smartphone from 4,051 participants were used for the analysis. The voice recordings were from sustained phonation by participants saying /aa/ for ten seconds into an iPhone microphone. A dataset comprising 385,143 short one-second audio samples was generated from the original ten-second voice recordings. The samples were converted to a spectrogram using a short-time Fourier transform. CNN models were then applied to classify the samples. RESULTS: Classification accuracies of the proposed method with LeNet-5, ResNet-50, and VGGNet-16 are 97.7 ± 0.1%, 98.6 ± 0.2%, and 99.3 ± 0.1%, respectively. CONCLUSIONS: We achieve a respectable classification performance using a generalized approach on a dataset with a large number of samples. The result emphasizes that an analysis based on one-second clip recorded on a smartphone could be a promising non-invasive and remotely available PD biomarker.

Subject(s)

Neurodegenerative Diseases , Parkinson Disease , Voice , Humans , Parkinson Disease/diagnosis , Smartphone , Neural Networks, Computer

2.

Host-Guest Interactions of Plumbagin with ß-Cyclodextrin, Dimethyl-ß-Cyclodextrin and Hydroxypropyl-ß-Cyclodextrin: Semi-Empirical Quantum Mechanical PM6 and PM7 Methods.

Srihakulung, Ornin; Maezono, Ryo; Toochinda, Pisanu; Kongprawechnon, Waree; Intarapanich, Apichart; Lawtrakul, And Luckhana.

Sci Pharm ; 86(2)2018 May 15.

Article in English | MEDLINE | ID: mdl-29762548

ABSTRACT

Molecular interactions of plumbagin inclusion complexes with ß-cyclodextrin (BCD), dimethyl--cyclodextrin (MBCD), and hydroxypropyl-ß-cyclodextrin (HPBCD) were investigated by semi-empirical, Parameterization Method 6 and 7 (PM6, and PM7) in the aqueous phase using polarizable continuum calculations. The results revealed two different binding modes of the plumbagin molecule inside the BCD cavity with a negative value of the complexation energy. In conformation-I, the hydroxyl phenolic group of plumbagin was placed in the BCD cavity near the narrow-side of the host molecule. In the other model, conformation-II, the methyl quinone group of plumbagin was placed in the cavity of BCD near the narrow-side of the host molecule. The higher the negative value of the complexation energy, the more favorable is the pathway of inclusion-complex formation.

3.

ToNER: A tool for identifying nucleotide enrichment signals in feature-enriched RNA-seq data.

Promworn, Yuttachon; Kaewprommal, Pavita; Shaw, Philip J; Intarapanich, Apichart; Tongsima, Sissades; Piriyapongsa, Jittima.

PLoS One ; 12(5): e0178483, 2017.

Article in English | MEDLINE | ID: mdl-28542466

ABSTRACT

BACKGROUND: Biochemical methods are available for enriching 5' ends of RNAs in prokaryotes, which are employed in the differential RNA-seq (dRNA-seq) and the more recent Cappable-seq protocols. Computational methods are needed to locate RNA 5' ends from these data by statistical analysis of the enrichment. Although statistical-based analysis methods have been developed for dRNA-seq, they may not be suitable for Cappable-seq data. The more efficient enrichment method employed in Cappable-seq compared with dRNA-seq could affect data distribution and thus algorithm performance. RESULTS: We present Transformation of Nucleotide Enrichment Ratios (ToNER), a tool for statistical modeling of enrichment from RNA-seq data obtained from enriched and unenriched libraries. The tool calculates nucleotide enrichment scores and determines the global transformation for fitting to the normal distribution using the Box-Cox procedure. From the transformed distribution, sites of significant enrichment are identified. To increase power of detection, meta-analysis across experimental replicates is offered. We tested the tool on Cappable-seq and dRNA-seq data for identifying Escherichia coli transcript 5' ends and compared the results with those from the TSSAR tool, which is designed for analyzing dRNA-seq data. When combining results across Cappable-seq replicates, ToNER detects more known transcript 5' ends than TSSAR. In general, the transcript 5' ends detected by ToNER but not TSSAR occur in regions which cannot be locally modeled by TSSAR. CONCLUSION: ToNER uses a simple yet robust statistical modeling approach, which can be used for detecting RNA 5'ends from Cappable-seq data, in particular when combining information from experimental replicates. The ToNER tool could potentially be applied for analyzing other RNA-seq datasets in which enrichment for other structural features of RNA is employed. The program is freely available for download at ToNER webpage (http://www4a.biotec.or.th/GI/tools/toner) and GitHub repository (https://github.com/PavitaKae/ToNER).

Subject(s)

Nucleotides/genetics , RNA/genetics , Algorithms , Escherichia coli/genetics , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, RNA/methods

4.

Fast processing of microscopic images using object-based extended depth of field.

Intarapanich, Apichart; Kaewkamnerd, Saowaluck; Pannarut, Montri; Shaw, Philip J; Tongsima, Sissades.

BMC Bioinformatics ; 17(Suppl 19): 516, 2016 Dec 22.

Article in English | MEDLINE | ID: mdl-28155648

ABSTRACT

BACKGROUND: Microscopic analysis requires that foreground objects of interest, e.g. cells, are in focus. In a typical microscopic specimen, the foreground objects may lie on different depths of field necessitating capture of multiple images taken at different focal planes. The extended depth of field (EDoF) technique is a computational method for merging images from different depths of field into a composite image with all foreground objects in focus. Composite images generated by EDoF can be applied in automated image processing and pattern recognition systems. However, current algorithms for EDoF are computationally intensive and impractical, especially for applications such as medical diagnosis where rapid sample turnaround is important. Since foreground objects typically constitute a minor part of an image, the EDoF technique could be made to work much faster if only foreground regions are processed to make the composite image. We propose a novel algorithm called object-based extended depths of field (OEDoF) to address this issue. METHODS: The OEDoF algorithm consists of four major modules: 1) color conversion, 2) object region identification, 3) good contrast pixel identification and 4) detail merging. First, the algorithm employs color conversion to enhance contrast followed by identification of foreground pixels. A composite image is constructed using only these foreground pixels, which dramatically reduces the computational time. RESULTS: We used 250 images obtained from 45 specimens of confirmed malaria infections to test our proposed algorithm. The resulting composite images with all in-focus objects were produced using the proposed OEDoF algorithm. We measured the performance of OEDoF in terms of image clarity (quality) and processing time. The features of interest selected by the OEDoF algorithm are comparable in quality with equivalent regions in images processed by the state-of-the-art complex wavelet EDoF algorithm; however, OEDoF required four times less processing time. CONCLUSIONS: This work presents a modification of the extended depth of field approach for efficiently enhancing microscopic images. This selective object processing scheme used in OEDoF can significantly reduce the overall processing time while maintaining the clarity of important image features. The empirical results from parasite-infected red cell images revealed that our proposed method efficiently and effectively produced in-focus composite images. With the speed improvement of OEDoF, this proposed algorithm is suitable for processing large numbers of microscope images, e.g., as required for medical diagnosis.

Subject(s)

Algorithms , Image Processing, Computer-Assisted/methods , Malaria, Falciparum/diagnosis , Microscopy/methods , Models, Biological , Pattern Recognition, Automated/methods , Signal Processing, Computer-Assisted , Humans , Malaria, Falciparum/parasitology , Plasmodium falciparum/isolation & purification

5.

Automatic DNA Diagnosis for 1D Gel Electrophoresis Images using Bio-image Processing Technique.

Intarapanich, Apichart; Kaewkamnerd, Saowaluck; Shaw, Philip J; Ukosakit, Kittipat; Tragoonrung, Somvong; Tongsima, Sissades.

BMC Genomics ; 16 Suppl 12: S15, 2015.

Article in English | MEDLINE | ID: mdl-26681167

ABSTRACT

BACKGROUND: DNA gel electrophoresis is a molecular biology technique for separating different sizes of DNA fragments. Applications of DNA gel electrophoresis include DNA fingerprinting (genetic diagnosis), size estimation of DNA, and DNA separation for Southern blotting. Accurate interpretation of DNA banding patterns from electrophoretic images can be laborious and error prone when a large number of bands are interrogated manually. Although many bio-imaging techniques have been proposed, none of them can fully automate the typing of DNA owing to the complexities of migration patterns typically obtained. RESULTS: We developed an image-processing tool that automatically calls genotypes from DNA gel electrophoresis images. The image processing workflow comprises three main steps: 1) lane segmentation, 2) extraction of DNA bands and 3) band genotyping classification. The tool was originally intended to facilitate large-scale genotyping analysis of sugarcane cultivars. We tested the proposed tool on 10 gel images (433 cultivars) obtained from polyacrylamide gel electrophoresis (PAGE) of PCR amplicons for detecting intron length polymorphisms (ILP) on one locus of the sugarcanes. These gel images demonstrated many challenges in automated lane/band segmentation in image processing including lane distortion, band deformity, high degree of noise in the background, and bands that are very close together (doublets). Using the proposed bio-imaging workflow, lanes and DNA bands contained within are properly segmented, even for adjacent bands with aberrant migration that cannot be separated by conventional techniques. The software, called GELect, automatically performs genotype calling on each lane by comparing with an all-banding reference, which was created by clustering the existing bands into the non-redundant set of reference bands. The automated genotype calling results were verified by independent manual typing by molecular biologists. CONCLUSIONS: This work presents an automated genotyping tool from DNA gel electrophoresis images, called GELect, which was written in Java and made available through the imageJ framework. With a novel automated image processing workflow, the tool can accurately segment lanes from a gel matrix, intelligently extract distorted and even doublet bands that are difficult to identify by existing image processing tools. Consequently, genotyping from DNA gel electrophoresis can be performed automatically allowing users to efficiently conduct large scale DNA fingerprinting via DNA gel electrophoresis. The software is freely available from http://www.biotec.or.th/gi/tools/gelect.

Subject(s)

DNA Fingerprinting/methods , DNA, Plant/analysis , Saccharum/genetics , Automation, Laboratory , Electrophoresis, Polyacrylamide Gel/methods , Genotype , Image Processing, Computer-Assisted/methods , Polymorphism, Genetic , Software

6.

iNJclust: Iterative Neighbor-Joining Tree Clustering Framework for Inferring Population Structure.

Limpiti, Tulaya; Amornbunchornvej, Chainarong; Intarapanich, Apichart; Assawamakin, Anunchai; Tongsima, Sissades.

IEEE/ACM Trans Comput Biol Bioinform ; 11(5): 903-14, 2014.

Article in English | MEDLINE | ID: mdl-26356862

ABSTRACT

Understanding genetic differences among populations is one of the most important issues in population genetics. Genetic variations, e.g., single nucleotide polymorphisms, are used to characterize commonality and difference of individuals from various populations. This paper presents an efficient graph-based clustering framework which operates iteratively on the Neighbor-Joining (NJ) tree called the iNJclust algorithm. The framework uses well-known genetic measurements, namely the allele-sharing distance, the neighbor-joining tree, and the fixation index. The behavior of the fixation index is utilized in the algorithm's stopping criterion. The algorithm provides an estimated number of populations, individual assignments, and relationships between populations as outputs. The clustering result is reported in the form of a binary tree, whose terminal nodes represent the final inferred populations and the tree structure preserves the genetic relationships among them. The clustering performance and the robustness of the proposed algorithm are tested extensively using simulated and real data sets from bovine, sheep, and human populations. The result indicates that the number of populations within each data set is reasonably estimated, the individual assignment is robust, and the structure of the inferred population tree corresponds to the intrinsic relationships among populations within the data.

Subject(s)

Cluster Analysis , Genetics, Population/methods , Genomics/methods , Algorithms , Animals , Cattle , Computer Simulation , Databases, Genetic , Humans , Sheep , Software

7.

MetaSel: a metaphase selection tool using a Gaussian-based classification technique.

Uttamatanin, Ravi; Yuvapoositanon, Peerapol; Intarapanich, Apichart; Kaewkamnerd, Saowaluck; Phuksaritanon, Ratsapan; Assawamakin, Anunchai; Tongsima, Sissades.

BMC Bioinformatics ; 14 Suppl 16: S13, 2013.

Article in English | MEDLINE | ID: mdl-24564477

ABSTRACT

BACKGROUND: Identification of good metaphase spreads is an important step in chromosome analysis for identifying individuals with genetic disorders. The process of finding suitable metaphase chromosomes for accurate clinical analysis is, however, very time consuming since they are selected manually. The selection of suitable metaphase chromosome spreads thus represents a major bottleneck for conventional cytogenetic analysis. Although many algorithms have been developed for karyotyping, none have adequately addressed the critical bottleneck of selecting suitable chromosome spreads. In this paper, we present a software tool that uses a simple rule-based system to efficiently identify metaphase spreads suitable for karyotyping. RESULTS: The chromosome shapes can be classified by the software into four main classes. The first and the second classes refer to individual chromosomes with straight and skewed shapes, respectively. The third class is characterized as those chromosomes with overlapping bodies and the fourth class is for the non-chromosome objects. Good metaphase spreads should largely contain chromosomes of the first and the second classes, while the third class should be kept minimal. Several image parameters were examined and used for creating rule-based classification. The threshold value for each parameter is determined using a statistical model. We observed that the Gaussian model can represent the empirical probability density function of the parameters and, hence, the threshold value can be easily determined. The proposed rules can efficiently and accurately classify the individual chromosome with > 90% accuracy. CONCLUSIONS: The software tool, termed MetaSel, was developed. Using the Gaussian-based rules, the tool can be used to quickly rank hundreds of chromosome spread images so as to assist cytogeneticists to perform karyotyping effectively. Furthermore, MetaSel offers an intuitive, yet comprehensive, workflow to assist karyotyping, including tools for editing chromosome (split, merge and fix) and a karyotyping editor (moving, rotating, and pairing homologous chromosomes). The program can be freely downloaded from "http://www4a.biotec.or.th/GI/tools/metasel".

Subject(s)

Chromosomes/classification , Image Processing, Computer-Assisted/methods , Karyotyping/methods , Metaphase , Software , Algorithms , Humans , Models, Statistical , Normal Distribution

8.

An automatic device for detection and classification of malaria parasite species in thick blood film.

Kaewkamnerd, Saowaluck; Uthaipibull, Chairat; Intarapanich, Apichart; Pannarut, Montri; Chaotheing, Sastra; Tongsima, Sissades.

BMC Bioinformatics ; 13 Suppl 17: S18, 2012.

Article in English | MEDLINE | ID: mdl-23281600

ABSTRACT

BACKGROUND: Current malaria diagnosis relies primarily on microscopic examination of Giemsa-stained thick and thin blood films. This method requires vigorously trained technicians to efficiently detect and classify the malaria parasite species such as Plasmodium falciparum (Pf) and Plasmodium vivax (Pv) for an appropriate drug administration. However, accurate classification of parasite species is difficult to achieve because of inherent technical limitations and human inconsistency. To improve performance of malaria parasite classification, many researchers have proposed automated malaria detection devices using digital image analysis. These image processing tools, however, focus on detection of parasites on thin blood films, which may not detect the existence of parasites due to the parasite scarcity on the thin blood film. The problem is aggravated with low parasitemia condition. Automated detection and classification of parasites on thick blood films, which contain more numbers of parasite per detection area, would address the previous limitation. RESULTS: The prototype of an automatic malaria parasite identification system is equipped with mountable motorized units for controlling the movements of objective lens and microscope stage. This unit was tested for its precision to move objective lens (vertical movement, z-axis) and microscope stage (in x- and y-horizontal movements). The average precision of x-, y- and z-axes movements were 71.481 ± 7.266 µm, 40.009 ± 0.000 µm, and 7.540 ± 0.889 nm, respectively. Classification of parasites on 60 Giemsa-stained thick blood films (40 blood films containing infected red blood cells and 20 control blood films of normal red blood cells) was tested using the image analysis module. By comparing our results with the ones verified by trained malaria microscopists, the prototype detected parasite-positive and parasite-negative blood films at the rate of 95% and 68.5% accuracy, respectively. For classification performance, the thick blood films with Pv parasite was correctly classified with the success rate of 75% while the accuracy of Pf classification was 90%. CONCLUSIONS: This work presents an automatic device for both detection and classification of malaria parasite species on thick blood film. The system is based on digital image analysis and featured with motorized stage units, designed to easily be mounted on most conventional light microscopes used in the endemic areas. The constructed motorized module could control the movements of objective lens and microscope stage at high precision for effective acquisition of quality images for analysis. The analysis program could accurately classify parasite species, into Pf or Pv, based on distribution of chromatin size.

Subject(s)

Erythrocytes/parasitology , Image Processing, Computer-Assisted/methods , Malaria/diagnosis , Microscopy/methods , Plasmodium/classification , Plasmodium/isolation & purification , Animals , Chromatin/ultrastructure , Humans , Malaria/blood , Malaria/parasitology , Malaria, Falciparum/blood , Malaria, Falciparum/diagnosis , Malaria, Falciparum/parasitology , Parasitemia/blood , Parasitemia/parasitology , Plasmodium falciparum/classification , Plasmodium falciparum/isolation & purification , Plasmodium vivax/classification , Plasmodium vivax/isolation & purification

9.

iLOCi: a SNP interaction prioritization technique for detecting epistasis in genome-wide association studies.

Piriyapongsa, Jittima; Ngamphiw, Chumpol; Intarapanich, Apichart; Kulawonganunchai, Supasak; Assawamakin, Anunchai; Bootchai, Chaiwat; Shaw, Philip J; Tongsima, Sissades.

BMC Genomics ; 13 Suppl 7: S2, 2012.

Article in English | MEDLINE | ID: mdl-23281813

ABSTRACT

BACKGROUND: Genome-wide association studies (GWAS) do not provide a full account of the heritability of genetic diseases since gene-gene interactions, also known as epistasis are not considered in single locus GWAS. To address this problem, a considerable number of methods have been developed for identifying disease-associated gene-gene interactions. However, these methods typically fail to identify interacting markers explaining more of the disease heritability over single locus GWAS, since many of the interactions significant for disease are obscured by uninformative marker interactions e.g., linkage disequilibrium (LD). RESULTS: In this study, we present a novel SNP interaction prioritization algorithm, named iLOCi (Interacting Loci). This algorithm accounts for marker dependencies separately in case and control groups. Disease-associated interactions are then prioritized according to a novel ranking score calculated from the difference in marker dependencies for every possible pair between case and control groups. The analysis of a typical GWAS dataset can be completed in less than a day on a standard workstation with parallel processing capability. The proposed framework was validated using simulated data and applied to real GWAS datasets using the Wellcome Trust Case Control Consortium (WTCCC) data. The results from simulated data showed the ability of iLOCi to identify various types of gene-gene interactions, especially for high-order interaction. From the WTCCC data, we found that among the top ranked interacting SNP pairs, several mapped to genes previously known to be associated with disease, and interestingly, other previously unreported genes with biologically related roles. CONCLUSION: iLOCi is a powerful tool for uncovering true disease interacting markers and thus can provide a more complete understanding of the genetic basis underlying complex disease. The program is available for download at http://www4a.biotec.or.th/GI/tools/iloci.

Subject(s)

Algorithms , Epistasis, Genetic/genetics , Genome-Wide Association Study , Polymorphism, Single Nucleotide/genetics , Humans , Linkage Disequilibrium , ROC Curve

10.

Study of large and highly stratified population datasets by combining iterative pruning principal component analysis and structure.

Limpiti, Tulaya; Intarapanich, Apichart; Assawamakin, Anunchai; Shaw, Philip J; Wangkumhang, Pongsakorn; Piriyapongsa, Jittima; Ngamphiw, Chumpol; Tongsima, Sissades.

BMC Bioinformatics ; 12: 255, 2011 Jun 23.

Article in English | MEDLINE | ID: mdl-21699684

ABSTRACT

BACKGROUND: The ever increasing sizes of population genetic datasets pose great challenges for population structure analysis. The Tracy-Widom (TW) statistical test is widely used for detecting structure. However, it has not been adequately investigated whether the TW statistic is susceptible to type I error, especially in large, complex datasets. Non-parametric, Principal Component Analysis (PCA) based methods for resolving structure have been developed which rely on the TW test. Although PCA-based methods can resolve structure, they cannot infer ancestry. Model-based methods are still needed for ancestry analysis, but they are not suitable for large datasets. We propose a new structure analysis framework for large datasets. This includes a new heuristic for detecting structure and incorporation of the structure patterns inferred by a PCA method to complement STRUCTURE analysis. RESULTS: A new heuristic called EigenDev for detecting population structure is presented. When tested on simulated data, this heuristic is robust to sample size. In contrast, the TW statistic was found to be susceptible to type I error, especially for large population samples. EigenDev is thus better-suited for analysis of large datasets containing many individuals, in which spurious patterns are likely to exist and could be incorrectly interpreted as population stratification. EigenDev was applied to the iterative pruning PCA (ipPCA) method, which resolves the underlying subpopulations. This subpopulation information was used to supervise STRUCTURE analysis to infer patterns of ancestry at an unprecedented level of resolution. To validate the new approach, a bovine and a large human genetic dataset (3945 individuals) were analyzed. We found new ancestry patterns consistent with the subpopulations resolved by ipPCA. CONCLUSIONS: The EigenDev heuristic is robust to sampling and is thus superior for detecting structure in large datasets. The application of EigenDev to the ipPCA algorithm improves the estimation of the number of subpopulations and the individual assignment accuracy, especially for very large and complex datasets. Furthermore, we have demonstrated that the structure resolved by this approach complements parametric analysis, allowing a much more comprehensive account of population structure. The new version of the ipPCA software with EigenDev incorporated can be downloaded from http://www4a.biotec.or.th/GI/tools/ippca.

Subject(s)

Algorithms , Cattle/genetics , Population Groups/genetics , Principal Component Analysis , Animals , Artificial Intelligence , Genetics, Population , Genome, Human , Haplotypes , Humans

11.

pHCR: a parallel haplotype configuration reduction algorithm for haplotype interaction analysis.

Makarasara, Wattanan; Kumasaka, Natsuhiko; Assawamakin, Anunchai; Takahashi, Atsushi; Intarapanich, Apichart; Ngamphiw, Chumpol; Kulawonganunchai, Supasak; Ruangrit, Uttapong; Fucharoen, Suthat; Kamatani, Naoyuki; Tongsima, Sissades.

J Hum Genet ; 54(11): 634-41, 2009 Nov.

Article in English | MEDLINE | ID: mdl-19927163

ABSTRACT

Finding gene interaction models is one of the most important issues in genotype-phenotype association studies. This paper presents a model-free nonparametric statistical interaction analysis known as Parallel Haplotype Configuration Reduction (pHCR). This technique extends the original Multifactor Dimensionality Reduction (MDR) algorithm by using haplotype contribution values (c-values) and a haplotype interaction scheme instead of analyzing interactions among single-nucleotide polymorphisms. The proposed algorithm uses the statistical power of haplotypes to obtain a gene-gene interaction model. pHCR computes a statistical value for each haplotype, which contributes to the phenotype, and then performs haplotype interaction analysis on the basis of the cumulative c-value of each individual haplotype. To address the high computational complexity of pHCR, this paper also presents a scalable parallel computing solution. Nine common two-locus disease models were used to evaluate the algorithm performance under different scenarios. The results from all cases showed that pHCR shows higher power to detect gene-gene interaction in comparison with the results obtained from running MDR on the same data set. We also compared pHCR with FAMHAP, which mainly considers haplotype in the association analysis. For every experiment on the simulated data set, pHCR correctly produced haplotype interactions with much fewer false positives. We also challenged pHCR with a real data set input of beta-thalassemia/Hemoglobin E (HbE) disease. The result suggested the interaction between two previously reported quantitative trait loci of the fetal hemoglobin level, which is a major modifying factor, and disease severity of beta-thalassemia/HbE disease.

Subject(s)

Algorithms , Computational Biology/methods , Haplotypes , Alleles , Genetic Association Studies/methods , Genetic Predisposition to Disease , Hemoglobin E/genetics , Humans , Reproducibility of Results , beta-Thalassemia/genetics

12.

Iterative pruning PCA improves resolution of highly structured populations.

Intarapanich, Apichart; Shaw, Philip J; Assawamakin, Anunchai; Wangkumhang, Pongsakorn; Ngamphiw, Chumpol; Chaichoompu, Kridsadakorn; Piriyapongsa, Jittima; Tongsima, Sissades.

BMC Bioinformatics ; 10: 382, 2009 Nov 23.

Article in English | MEDLINE | ID: mdl-19930644

ABSTRACT

BACKGROUND: Non-random patterns of genetic variation exist among individuals in a population owing to a variety of evolutionary factors. Therefore, populations are structured into genetically distinct subpopulations. As genotypic datasets become ever larger, it is increasingly difficult to correctly estimate the number of subpopulations and assign individuals to them. The computationally efficient non-parametric, chiefly Principal Components Analysis (PCA)-based methods are thus becoming increasingly relied upon for population structure analysis. Current PCA-based methods can accurately detect structure; however, the accuracy in resolving subpopulations and assigning individuals to them is wanting. When subpopulations are closely related to one another, they overlap in PCA space and appear as a conglomerate. This problem is exacerbated when some subpopulations in the dataset are genetically far removed from others. We propose a novel PCA-based framework which addresses this shortcoming. RESULTS: A novel population structure analysis algorithm called iterative pruning PCA (ipPCA) was developed which assigns individuals to subpopulations and infers the total number of subpopulations present. Genotypic data from simulated and real population datasets with different degrees of structure were analyzed. For datasets with simple structures, the subpopulation assignments of individuals made by ipPCA were largely consistent with the STRUCTURE, BAPS and AWclust algorithms. On the other hand, highly structured populations containing many closely related subpopulations could be accurately resolved only by ipPCA, and not by other methods. CONCLUSION: The algorithm is computationally efficient and not constrained by the dataset complexity. This systematic subpopulation assignment approach removes the need for prior population labels, which could be advantageous when cryptic stratification is encountered in datasets containing individuals otherwise assumed to belong to a homogenous population.

Subject(s)

Computational Biology/methods , Population/genetics , Principal Component Analysis/methods , Algorithms , Animals , Genetic Variation , Genetics, Population , Humans , Models, Genetic

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL