Search | VHL Regional Portal

Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure.

Gauch, Hugh G; Qian, Sheng; Piepho, Hans-Peter; Zhou, Linda; Chen, Rui.

PLoS One ; 14(6): e0218306, 2019.

Article in English | MEDLINE | ID: mdl-31211811

ABSTRACT

SNP datasets are high-dimensional, often with thousands to millions of SNPs and hundreds to thousands of samples or individuals. Accordingly, PCA graphs are frequently used to provide a low-dimensional visualization in order to display and discover patterns in SNP data from humans, animals, plants, and microbes-especially to elucidate population structure. PCA is not a single method that is always done the same way, but rather requires three choices which we explore as a three-way factorial: two kinds of PCA graphs by three SNP codings by six PCA variants. Our main three recommendations are simple and easily implemented: Use PCA biplots, SNP coding 1 for the rare allele and 0 for the common allele, and double-centered PCA (or AMMI1 if main effects are also of interest). We also document contemporary practices by a literature survey of 125 representative articles that apply PCA to SNP data, find that virtually none implement our recommendations. The ultimate benefit from informed and optimal choices of PCA graph, SNP coding, and PCA variant, is expected to be discovery of more biology, and thereby acceleration of medical, agricultural, and other vital applications.

Subject(s)

Genetics, Population , Genotype , Molecular Biology/statistics & numerical data , Polymorphism, Single Nucleotide/genetics , Alleles , Animals , Avena/genetics , Bacteria/genetics , Genetic Predisposition to Disease , Humans , Plants/genetics , Principal Component Analysis

T-REX: software for the processing and analysis of T-RFLP data.

Culman, Steven W; Bukowski, Robert; Gauch, Hugh G; Cadillo-Quiroz, Hinsby; Buckley, Daniel H.

BMC Bioinformatics ; 10: 171, 2009 Jun 06.

Article in English | MEDLINE | ID: mdl-19500385

ABSTRACT

BACKGROUND: Despite increasing popularity and improvements in terminal restriction fragment length polymorphism (T-RFLP) and other microbial community fingerprinting techniques, there are still numerous obstacles that hamper the analysis of these datasets. Many steps are required to process raw data into a format ready for analysis and interpretation. These steps can be time-intensive, error-prone, and can introduce unwanted variability into the analysis. Accordingly, we developed T-REX, free, online software for the processing and analysis of T-RFLP data. RESULTS: Analysis of T-RFLP data generated from a multiple-factorial study was performed with T-REX. With this software, we were able to i) label raw data with attributes related to the experimental design of the samples, ii) determine a baseline threshold for identification of true peaks over noise, iii) align terminal restriction fragments (T-RFs) in all samples (i.e., bin T-RFs), iv) construct a two-way data matrix from labeled data and process the matrix in a variety of ways, v) produce several measures of data matrix complexity, including the distribution of variance between main and interaction effects and sample heterogeneity, and vi) analyze a data matrix with the additive main effects and multiplicative interaction (AMMI) model. CONCLUSION: T-REX provides a free, platform-independent tool to the research community that allows for an integrated, rapid, and more robust analysis of T-RFLP data.

Subject(s)

DNA Fingerprinting/methods , Polymorphism, Restriction Fragment Length , Sequence Analysis, DNA/methods , Software , Analysis of Variance , Database Management Systems , Databases, Genetic , Genes, Bacterial , Internet , Models, Statistical , Sequence Alignment , User-Computer Interface

Simple sequence repeats in Neurospora crassa: distribution, polymorphism and evolutionary inference.

Kim, Tae-Sung; Booth, James G; Gauch, Hugh G; Sun, Qi; Park, Jongsun; Lee, Yong-Hwan; Lee, Kwangwon.

BMC Genomics ; 9: 31, 2008 Jan 23.

Article in English | MEDLINE | ID: mdl-18215294

ABSTRACT

BACKGROUND: Simple sequence repeats (SSRs) have been successfully used for various genetic and evolutionary studies in eukaryotic systems. The eukaryotic model organism Neurospora crassa is an excellent system to study evolution and biological function of SSRs. RESULTS: We identified and characterized 2749 SSRs of 963 SSR types in the genome of N. crassa. The distribution of tri-nucleotide (nt) SSRs, the most common SSRs in N. crassa, was significantly biased in exons. We further characterized the distribution of 19 abundant SSR types (AST), which account for 71% of total SSRs in the N. crassa genome, using a Poisson log-linear model. We also characterized the size variation of SSRs among natural accessions using Polymorphic Index Content (PIC) and ANOVA analyses and found that there are genome-wide, chromosome-dependent and local-specific variations. Using polymorphic SSRs, we have built linkage maps from three line-cross populations. CONCLUSION: Taking our computational, statistical and experimental data together, we conclude that 1) the distributions of the SSRs in the sequenced N. crassa genome differ systematically between chromosomes as well as between SSR types, 2) the size variation of tri-nt SSRs in exons might be an important mechanism in generating functional variation of proteins in N. crassa, 3) there are different levels of evolutionary forces in variation of amino acid repeats, and 4) SSRs are stable molecular markers for genetic studies in N. crassa.

Subject(s)

Evolution, Molecular , Microsatellite Repeats , Neurospora crassa/genetics , Polymorphism, Genetic , Expressed Sequence Tags , Genetic Markers , Genome, Fungal

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL