Search | VHL Regional Portal

1.

Michael Waterman's Contributions to Computational Biology and Bioinformatics.

Pevzner, Pavel; Vingron, Martin; Reidys, Christian; Sun, Fengzhu; Istrail, Sorin.

J Comput Biol ; 29(7): 601-615, 2022 07.

Article in English | MEDLINE | ID: mdl-35727100

ABSTRACT

On the occasion of Dr. Michael Waterman's 80th birthday, we review his major contributions to the field of computational biology and bioinformatics including the famous Smith-Waterman algorithm for sequence alignment, the probability and statistics theory related to sequence alignment, algorithms for sequence assembly, the Lander-Waterman model for genome physical mapping, combinatorics and predictions of ribonucleic acid structures, word counting statistics in molecular sequences, alignment-free sequence comparison, and algorithms for haplotype block partition and tagSNP selection related to the International HapMap Project. His books Introduction to Computational Biology: Maps, Sequences and Genomes for graduate students and Computational Genome Analysis: An Introduction geared toward undergraduate students played key roles in computational biology and bioinformatics education. We also highlight his efforts of building the computational biology and bioinformatics community as the founding editor of the Journal of Computational Biology and a founding member of the International Conference on Research in Computational Molecular Biology (RECOMB).

Subject(s)

Algorithms , Computational Biology , Genome , Humans , Sequence Alignment

2.

Combinatorial and statistical prediction of gene expression from haplotype sequence.

Alpay, Berk A; Demetci, Pinar; Istrail, Sorin; Aguiar, Derek.

Bioinformatics ; 36(Suppl_1): i194-i202, 2020 07 01.

Article in English | MEDLINE | ID: mdl-32657373

ABSTRACT

MOTIVATION: Genome-wide association studies (GWAS) have discovered thousands of significant genetic effects on disease phenotypes. By considering gene expression as the intermediary between genotype and disease phenotype, expression quantitative trait loci studies have interpreted many of these variants by their regulatory effects on gene expression. However, there remains a considerable gap between genotype-to-gene expression association and genotype-to-gene expression prediction. Accurate prediction of gene expression enables gene-based association studies to be performed post hoc for existing GWAS, reduces multiple testing burden, and can prioritize genes for subsequent experimental investigation. RESULTS: In this work, we develop gene expression prediction methods that relax the independence and additivity assumptions between genetic markers. First, we consider gene expression prediction from a regression perspective and develop the HAPLEXR algorithm which combines haplotype clusterings with allelic dosages. Second, we introduce the new gene expression classification problem, which focuses on identifying expression groups rather than continuous measurements; we formalize the selection of an appropriate number of expression groups using the principle of maximum entropy. Third, we develop the HAPLEXD algorithm that models haplotype sharing with a modified suffix tree data structure and computes expression groups by spectral clustering. In both models, we penalize model complexity by prioritizing genetic clusters that indicate significant effects on expression. We compare HAPLEXR and HAPLEXD with three state-of-the-art expression prediction methods and two novel logistic regression approaches across five GTEx v8 tissues. HAPLEXD exhibits significantly higher classification accuracy overall; HAPLEXR shows higher prediction accuracy on approximately half of the genes tested and the largest number of best predicted genes (r2>0.1) among all methods. We show that variant and haplotype features selected by HAPLEXR are smaller in size than competing methods (and thus more interpretable) and are significantly enriched in functional annotations related to gene regulation. These results demonstrate the importance of explicitly modeling non-dosage dependent and intragenic epistatic effects when predicting expression. AVAILABILITY AND IMPLEMENTATION: Source code and binaries are freely available at https://github.com/rapturous/HAPLEX. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genome-Wide Association Study , Polymorphism, Single Nucleotide , Gene Expression , Haplotypes , Phenotype , Quantitative Trait Loci

3.

Proteinarium: Multi-sample protein-protein interaction analysis and visualization tool.

Armanious, David; Schuster, Jessica; Tollefson, George A; Agudelo, Anthony; DeWan, Andrew T; Istrail, Sorin; Padbury, James; Uzun, Alper.

Genomics ; 112(6): 4288-4296, 2020 11.

Article in English | MEDLINE | ID: mdl-32702417

ABSTRACT

We posit the likely architecture of complex diseases is that subgroups of patients share variants in genes in specific networks which are sufficient to give rise to a shared phenotype. We developed Proteinarium, a multi-sample protein-protein interaction (PPI) tool, to identify clusters of patients with shared gene networks. Proteinarium converts user defined seed genes to protein symbols and maps them onto the STRING interactome. A PPI network is built for each sample using Dijkstra's algorithm. Pairwise similarity scores are calculated to compare the networks and cluster the samples. A layered graph of PPI networks for the samples in any cluster can be visualized. To test this newly developed analysis pipeline, we reanalyzed publicly available data sets, from which modest outcomes had previously been achieved. We found significant clusters of patients with unique genes which enhanced the findings in the original study.

Subject(s)

Protein Interaction Mapping/methods , Software , Cluster Analysis , Computer Graphics , Female , Humans , Male , Pregnancy , Premature Birth , Prostatic Hyperplasia/genetics , Prostatic Hyperplasia/metabolism , Protein Interaction Maps , Transcriptome

4.

Preface Special Issue: RECOMB 2018.

Istrail, Sorin.

J Comput Biol ; 27(3): 301, 2020 03.

Article in English | MEDLINE | ID: mdl-32160037

5.

Eric Davidson's Regulatory Genome for Computer Science: Causality, Logic, and Proof Principles of the Genomic cis-Regulatory Code.

Istrail, Sorin.

J Comput Biol ; 26(7): 653-684, 2019 07.

Article in English | MEDLINE | ID: mdl-31356126

Subject(s)

Computational Biology , Gene Expression Regulation , Genetic Code , Genome , Logic , Regulatory Sequences, Nucleic Acid/genetics , Binding Sites/genetics , Databases, Genetic , Gene Regulatory Networks , Semantics , Transcription Factors/metabolism

6.

How Does the Regulatory Genome Work?

Istrail, Sorin; Peter, Isabelle S.

J Comput Biol ; 26(7): 685-695, 2019 07.

Article in English | MEDLINE | ID: mdl-31166788

ABSTRACT

The regulatory genome controls genome activity throughout the life of an organism. This requires that complex information processing functions are encoded in, and operated by, the regulatory genome. Although much remains to be learned about how the regulatory genome works, we here discuss two cases where regulatory functions have been experimentally dissected in great detail and at the systems level, and formalized by computational logic models. Both examples derive from the sea urchin embryo, but assess two distinct organizational levels of genomic information processing. The first example shows how the regulatory system of a single gene, endo16, executes logic operations through individual transcription factor binding sites and cis-regulatory modules that control the expression of this gene. The second example shows information processing at the gene regulatory network (GRN) level. The GRN controlling development of the sea urchin endomesoderm has been experimentally explored at an almost complete level. A Boolean logic model of this GRN suggests that the modular logic functions encoded at the single-gene level show compositionality and suffice to account for integrated function at the network level. We discuss these examples both from a biological-experimental point of view and from a computer science-informational point of view, as both illuminate principles of how the regulatory genome works.

Subject(s)

Gene Regulatory Networks , Genome , Animals , Endoderm/embryology , Endoderm/metabolism , Gene Expression Regulation, Developmental , Mesoderm/embryology , Mesoderm/metabolism

7.

Global Comparison of Drug Resistance Mutations After First-Line Antiretroviral Therapy Across Human Immunodeficiency Virus-1 Subtypes.

Huang, Austin; Hogan, Joseph W; Luo, Xi; DeLong, Allison; Saravanan, Shanmugam; Wu, Yasong; Sirivichayakul, Sunee; Kumarasamy, Nagalingeswaran; Zhang, Fujie; Phanuphak, Praphan; Diero, Lameck; Buziba, Nathan; Istrail, Sorin; Katzenstein, David A; Kantor, Rami.

Open Forum Infect Dis ; 3(2): ofv158, 2016 Apr.

Article in English | MEDLINE | ID: mdl-27419147

ABSTRACT

Background. Human immunodeficiency virus (HIV)-1 drug resistance mutations (DRMs) often accompany treatment failure. Although subtype differences are widely studied, DRM comparisons between subtypes either focus on specific geographic regions or include populations with heterogeneous treatments. Methods. We characterized DRM patterns following first-line failure and their impact on future treatment in a global, multi-subtype reverse-transcriptase sequence dataset. We developed a hierarchical modeling approach to address the high-dimensional challenge of modeling and comparing frequencies of multiple DRMs in varying first-line regimens, durations, and subtypes. Drug resistance mutation co-occurrence was characterized using a novel application of a statistical network model. Results. In 1425 sequences, 202 subtype B, 696 C, 44 G, 351 circulating recombinant forms (CRF)01_AE, 58 CRF02_AG, and 74 from other subtypes mutation frequencies were higher in subtypes C and CRF01_AE compared with B overall. Mutation frequency increased by 9%-20% at reverse transcriptase positions 41, 67, 70, 184, 215, and 219 in subtype C and CRF01_AE vs B. Subtype C and CRF01_AE exhibited higher predicted cross-resistance (+12%-18%) to future therapy options compared with subtype B. Topologies of subtype mutation networks were mostly similar. Conclusions. We find clear differences in DRM outcomes following first-line failure, suggesting subtype-specific ecological or biological factors that determine DRM patterns.

8.

Eric Davidson: Master of the universe.

Istrail, Sorin.

Dev Biol ; 412(2 Suppl): S47-54, 2016 Apr 15.

Article in English | MEDLINE | ID: mdl-26825397

Subject(s)

Developmental Biology/history , Animals , Friends , Gene Regulatory Networks , History, 20th Century , History, 21st Century , Sea Urchins

9.

Transcriptome of American oysters, Crassostrea virginica, in response to bacterial challenge: insights into potential mechanisms of disease resistance.

McDowell, Ian C; Nikapitiya, Chamilani; Aguiar, Derek; Lane, Christopher E; Istrail, Sorin; Gomez-Chiarri, Marta.

PLoS One ; 9(8): e105097, 2014.

Article in English | MEDLINE | ID: mdl-25122115

ABSTRACT

The American oyster Crassostrea virginica, an ecologically and economically important estuarine organism, can suffer high mortalities in areas in the Northeast United States due to Roseovarius Oyster Disease (ROD), caused by the gram-negative bacterial pathogen Roseovarius crassostreae. The goals of this research were to provide insights into: 1) the responses of American oysters to R. crassostreae, and 2) potential mechanisms of resistance or susceptibility to ROD. The responses of oysters to bacterial challenge were characterized by exposing oysters from ROD-resistant and susceptible families to R. crassostreae, followed by high-throughput sequencing of cDNA samples from various timepoints after disease challenge. Sequence data was assembled into a reference transcriptome and analyzed through differential gene expression and functional enrichment to uncover genes and processes potentially involved in responses to ROD in the American oyster. While susceptible oysters experienced constant levels of mortality when challenged with R. crassostreae, resistant oysters showed levels of mortality similar to non-challenged oysters. Oysters exposed to R. crassostreae showed differential expression of transcripts involved in immune recognition, signaling, protease inhibition, detoxification, and apoptosis. Transcripts involved in metabolism were enriched in susceptible oysters, suggesting that bacterial infection places a large metabolic demand on these oysters. Transcripts differentially expressed in resistant oysters in response to infection included the immune modulators IL-17 and arginase, as well as several genes involved in extracellular matrix remodeling. The identification of potential genes and processes responsible for defense against R. crassostreae in the American oyster provides insights into potential mechanisms of disease resistance.

Subject(s)

Ostreidae/genetics , Rhodobacteraceae/pathogenicity , Transcriptome , Animals , Gene Expression Regulation , Ostreidae/microbiology

10.

Tumor haplotype assembly algorithms for cancer genomics.

Aguiar, Derek; Wong, Wendy S W; Istrail, Sorin.

Pac Symp Biocomput ; : 3-14, 2014.

Article in English | MEDLINE | ID: mdl-24297529

ABSTRACT

The growing availability of inexpensive high-throughput sequence data is enabling researchers to sequence tumor populations within a single individual at high coverage. But, cancer genome sequence evolution and mutational phenomena like driver mutations and gene fusions are difficult to investigate without first reconstructing tumor haplotype sequences. Haplotype assembly of single individual tumor populations is an exceedingly difficult task complicated by tumor haplotype heterogeneity, tumor or normal cell sequence contamination, polyploidy, and complex patterns of variation. While computational and experimental haplotype phasing of diploid genomes has seen much progress in recent years, haplotype assembly in cancer genomes remains uncharted territory. In this work, we describe HapCompass-Tumor a computational modeling and algorithmic framework for haplotype assembly of copy number variable cancer genomes containing haplotypes at different frequencies and complex variation. We extend our polyploid haplotype assembly model and present novel algorithms for (1) complex variations, including copy number changes, as varying numbers of disjoint paths in an associated graph, (2) variable haplotype frequencies and contamination, and (3) computation of tumor haplotypes using simple cycles of the compass graph which constrain the space of haplotype assembly solutions. The model and algorithm are implemented in the software package HapCompass-Tumor which is available for download from http://www.brown.edu/Research/Istrail_Lab/.

Subject(s)

Algorithms , Haplotypes , Neoplasms/genetics , Computational Biology , DNA Copy Number Variations , Genome, Human , Genomics/statistics & numerical data , Humans , Models, Genetic , Polyploidy , Translocation, Genetic

11.

Pathway-based analysis of genomic variation data.

Atias, Nir; Istrail, Sorin; Sharan, Roded.

Curr Opin Genet Dev ; 23(6): 622-6, 2013 Dec.

Article in English | MEDLINE | ID: mdl-24209906

ABSTRACT

A holy grail of genetics is to decipher the mapping from genotype to phenotype. Recent advances in sequencing technologies allow the efficient genotyping of thousands of individuals carrying a particular phenotype in an effort to reveal its genetic determinants. However, the interpretation of these data entails tackling significant statistical and computational problems that stem from the complexity of human phenotypes and the huge genotypic search space. Recently, an alternative pathway-level analysis has been employed to combat these problems. In this review we discuss these developments, describe the challenges involved and outline possible solutions and future directions for improvement.

Subject(s)

Computational Biology/methods , Gene Regulatory Networks , Genome-Wide Association Study/methods , Signal Transduction/genetics , Genotype , Humans , Models, Genetic , Phenotype , Polymorphism, Single Nucleotide

12.

Haplotype assembly in polyploid genomes and identical by descent shared tracts.

Aguiar, Derek; Istrail, Sorin.

Bioinformatics ; 29(13): i352-60, 2013 Jul 01.

Article in English | MEDLINE | ID: mdl-23813004

ABSTRACT

MOTIVATION: Genome-wide haplotype reconstruction from sequence data, or haplotype assembly, is at the center of major challenges in molecular biology and life sciences. For complex eukaryotic organisms like humans, the genome is vast and the population samples are growing so rapidly that algorithms processing high-throughput sequencing data must scale favorably in terms of both accuracy and computational efficiency. Furthermore, current models and methodologies for haplotype assembly (i) do not consider individuals sharing haplotypes jointly, which reduces the size and accuracy of assembled haplotypes, and (ii) are unable to model genomes having more than two sets of homologous chromosomes (polyploidy). Polyploid organisms are increasingly becoming the target of many research groups interested in the genomics of disease, phylogenetics, botany and evolution but there is an absence of theory and methods for polyploid haplotype reconstruction. RESULTS: In this work, we present a number of results, extensions and generalizations of compass graphs and our HapCompass framework. We prove the theoretical complexity of two haplotype assembly optimizations, thereby motivating the use of heuristics. Furthermore, we present graph theory-based algorithms for the problem of haplotype assembly using our previously developed HapCompass framework for (i) novel implementations of haplotype assembly optimizations (minimum error correction), (ii) assembly of a pair of individuals sharing a haplotype tract identical by descent and (iii) assembly of polyploid genomes. We evaluate our methods on 1000 Genomes Project, Pacific Biosciences and simulated sequence data. AVAILABILITY AND IMPLEMENTATION: HapCompass is available for download at http://www.brown.edu/Research/Istrail_Lab/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Genome, Human , Haplotypes , Polyploidy , Sequence Analysis, DNA/methods , Algorithms , Genomics/methods , Humans

13.

Intellectual disability is associated with increased runs of homozygosity in simplex autism.

Gamsiz, Ece D; Viscidi, Emma W; Frederick, Abbie M; Nagpal, Shailender; Sanders, Stephan J; Murtha, Michael T; Schmidt, Michael; Triche, Elizabeth W; Geschwind, Daniel H; State, Matthew W; Istrail, Sorin; Cook, Edwin H; Devlin, Bernie; Morrow, Eric M.

Am J Hum Genet ; 93(1): 103-9, 2013 Jul 11.

Article in English | MEDLINE | ID: mdl-23830515

ABSTRACT

Intellectual disability (ID), often attributed to autosomal-recessive mutations, occurs in 40% of autism spectrum disorders (ASDs). For this reason, we conducted a genome-wide analysis of runs of homozygosity (ROH) in simplex ASD-affected families consisting of a proband diagnosed with ASD and at least one unaffected sibling. In these families, probands with an IQ ≤ 70 show more ROH than their unaffected siblings, whereas probands with an IQ > 70 do not show this excess. Although ASD is far more common in males than in females, the proportion of females increases with decreasing IQ. Our data do support an association between ROH burden and autism diagnosis in girls; however, we are not able to show that this effect is independent of low IQ. We have also discovered several autism candidate genes on the basis of finding (1) a single gene that is within an ROH interval and that is recurrent in autism or (2) a gene that is within an autism ROH block and that harbors a homozygous, rare deleterious variant upon analysis of exome-sequencing data. In summary, our data suggest a distinct genetic architecture for participants with autism and co-occurring intellectual disability and that this architecture could involve a role for recessively inherited loci for this autism subgroup.

Subject(s)

Child Development Disorders, Pervasive/genetics , Genetic Association Studies/methods , Intellectual Disability/genetics , Child , Chromosomes, Human/genetics , Female , Genetic Diseases, Inborn/genetics , Genetic Predisposition to Disease/genetics , Genetics, Population/methods , Homozygote , Humans , Intelligence Tests , Male , Pedigree , Phenotype , Sex Factors

14.

A quantitative reference transcriptome for Nematostella vectensis early embryonic development: a pipeline for de novo assembly in emerging model systems.

Tulin, Sarah; Aguiar, Derek; Istrail, Sorin; Smith, Joel.

Evodevo ; 4: 16, 2013.

Article in English | MEDLINE | ID: mdl-23731568

ABSTRACT

BACKGROUND: The de novo assembly of transcriptomes from short shotgun sequences raises challenges due to random and non-random sequencing biases and inherent transcript complexity. We sought to define a pipeline for de novo transcriptome assembly to aid researchers working with emerging model systems where well annotated genome assemblies are not available as a reference. To detail this experimental and computational method, we used early embryos of the sea anemone, Nematostella vectensis, an emerging model system for studies of animal body plan evolution. We performed RNA-seq on embryos up to 24 h of development using Illumina HiSeq technology and evaluated independent de novo assembly methods. The resulting reads were assembled using either the Trinity assembler on all quality controlled reads or both the Velvet and Oases assemblers on reads passing a stringent digital normalization filter. A control set of mRNA standards from the National Institute of Standards and Technology (NIST) was included in our experimental pipeline to invest our transcriptome with quantitative information on absolute transcript levels and to provide additional quality control. RESULTS: We generated >200 million paired-end reads from directional cDNA libraries representing well over 20 Gb of sequence. The Trinity assembler pipeline, including preliminary quality control steps, resulted in more than 86% of reads aligning with the reference transcriptome thus generated. Nevertheless, digital normalization combined with assembly by Velvet and Oases required far less computing power and decreased processing time while still mapping 82% of reads. We have made the raw sequencing reads and assembled transcriptome publically available. CONCLUSIONS: Nematostella vectensis was chosen for its strategic position in the tree of life for studies into the origins of the animal body plan, however, the challenge of reference-free transcriptome assembly is relevant to all systems for which well annotated gene models and independently verified genome assembly may not be available. To navigate this new territory, we have constructed a pipeline for library preparation and computational analysis for de novo transcriptome assembly. The gene models defined by this reference transcriptome define the set of genes transcribed in early Nematostella development and will provide a valuable dataset for further gene regulatory network investigations.

15.

Pathway-based genetic analysis of preterm birth.

Uzun, Alper; Dewan, Andrew T; Istrail, Sorin; Padbury, James F.

Genomics ; 101(3): 163-70, 2013 Mar.

Article in English | MEDLINE | ID: mdl-23298525

ABSTRACT

Preterm birth in the United States is now 12%. Multiple genes, gene networks, and variants have been associated with this disease. Using a custom database for preterm birth (dbPTB) with a refined set of genes extensively curated from literature and biological databases, we analyzed GWAS of preterm birth for complete genotype data on nearly 2000 preterm and term mothers. We used both the curated genes and a genome-wide approach to carry out a pathway-based analysis. There were 19 significant pathways, which withstood FDR correction for multiple testing that were identified using both the curated genes and the genome-wide approach. The analysis based on the curated genes was more significant than genome-wide in 15 out of 19 pathways. This approach demonstrates the use of a validated set of genes, in the analysis of otherwise unsuccessful GWAS data, to identify gene-gene interactions in a way that enhances statistical power and discovery.

Subject(s)

Genome-Wide Association Study , Metabolic Networks and Pathways/genetics , Premature Birth/genetics , Databases, Genetic , Epistasis, Genetic , Female , Genetic Predisposition to Disease , Humans , Infant, Newborn , Polymorphism, Single Nucleotide , Pregnancy , Premature Birth/physiopathology

16.

Global analysis of sequence diversity within HIV-1 subtypes across geographic regions.

Huang, Austin; Hogan, Joseph W; Istrail, Sorin; Delong, Allison; Katzenstein, David A; Kantor, Rami.

Future Virol ; 7(5): 505-517, 2012 May.

Article in English | MEDLINE | ID: mdl-22822410

ABSTRACT

AIMS: HIV-1 sequence diversity can affect host immune responses and phenotypic characteristics such as antiretroviral drug resistance. Current HIV-1 sequence diversity classification uses phylogeny-based methods to identify subtypes and recombinants, which may overlook distinct subpopulations within subtypes. While local epidemic studies have characterized sequence-level clustering within subtypes using phylogeny, identification of new genotype - phenotype associations are based on mutational correlations at individual sequence positions. We perform a systematic, global analysis of position-specific pol gene sequence variation across geographic regions within HIV-1 subtypes to characterize subpopulation differences that may be missed by standard subtyping methods and sequence-level phylogenetic clustering analyses. MATERIALS #ENTITYSTARTX00026; METHODS: Analysis was performed on a large, globally diverse, cross-sectional pol sequence dataset. Sequences were partitioned into subtypes and geographic subpopulations within subtypes. For each subtype, we identified positions that varied according to geography using VESPA (viral epidemiology signature pattern analysis) to identify sequence signature differences and a likelihood ratio test adjusted for multiple comparisons to characterize differences in amino acid (AA) frequencies, including minority mutations. Synonymous nonsynonymous analysis program (SNAP) was used to explore the role of evolutionary selection witihin subtype C. RESULTS: In 7693 protease (PR) and reverse transcriptase (RT) sequences from untreated patients in multiple geographic regions, 11 PR and 11 RT positions exhibited sequence signature differences within subtypes. Thirty six PR and 80 RT positions exhibited within-subtype geography-dependent differences in AA distributions, including minority mutations, at both conserved and variable loci. Among subtype C samples from India and South Africa, nine PR and nine RT positions had significantly different AA distributions, including one PR and five RT positions that differed in consensus AA between regions. A selection analysis of subtype C using SNAP demonstrated that estimated rates of nonsynonymous and synonymous mutations are consistent with the possibility of positive selection across geographic subpopulations within subtypes. CONCLUSION: We characterized systematic genotypic pol differences across geographic regions within subtypes that are not captured by the subtyping nomenclature. Awareness of such differences may improve the interpretation of future studies determining the phenotypic consequences of genetic backgrounds.

17.

DELISHUS: an efficient and exact algorithm for genome-wide detection of deletion polymorphism in autism.

Aguiar, Derek; Halldórsson, Bjarni V; Morrow, Eric M; Istrail, Sorin.

Bioinformatics ; 28(12): i154-62, 2012 Jun 15.

Article in English | MEDLINE | ID: mdl-22689755

ABSTRACT

MOTIVATION: The understanding of the genetic determinants of complex disease is undergoing a paradigm shift. Genetic heterogeneity of rare mutations with deleterious effects is more commonly being viewed as a major component of disease. Autism is an excellent example where research is active in identifying matches between the phenotypic and genomic heterogeneities. A considerable portion of autism appears to be correlated with copy number variation, which is not directly probed by single nucleotide polymorphism (SNP) array or sequencing technologies. Identifying the genetic heterogeneity of small deletions remains a major unresolved computational problem partly due to the inability of algorithms to detect them. RESULTS: In this article, we present an algorithmic framework, which we term DELISHUS, that implements three exact algorithms for inferring regions of hemizygosity containing genomic deletions of all sizes and frequencies in SNP genotype data. We implement an efficient backtracking algorithm-that processes a 1 billion entry genome-wide association study SNP matrix in a few minutes-to compute all inherited deletions in a dataset. We further extend our model to give an efficient algorithm for detecting de novo deletions. Finally, given a set of called deletions, we also give a polynomial time algorithm for computing the critical regions of recurrent deletions. DELISHUS achieves significantly lower false-positive rates and higher power than previously published algorithms partly because it considers all individuals in the sample simultaneously. DELISHUS may be applied to SNP array or sequencing data to identify the deletion spectrum for family-based association studies. AVAILABILITY: DELISHUS is available at http://www.brown.edu/Research/Istrail_Lab/.

Subject(s)

Algorithms , Autistic Disorder/genetics , Genome-Wide Association Study , Polymorphism, Single Nucleotide , Computational Biology/methods , DNA Copy Number Variations , Genotype , Humans , Inheritance Patterns , Phenotype , Sequence Deletion

18.

HapCompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data.

Aguiar, Derek; Istrail, Sorin.

J Comput Biol ; 19(6): 577-90, 2012 Jun.

Article in English | MEDLINE | ID: mdl-22697235

ABSTRACT

Genome assembly methods produce haplotype phase ambiguous assemblies due to limitations in current sequencing technologies. Determining the haplotype phase of an individual is computationally challenging and experimentally expensive. However, haplotype phase information is crucial in many bioinformatics workflows such as genetic association studies and genomic imputation. Current computational methods of determining haplotype phase from sequence data--known as haplotype assembly--have difficulties producing accurate results for large (1000 genomes-type) data or operate on restricted optimizations that are unrealistic considering modern high-throughput sequencing technologies. We present a novel algorithm, HapCompass, for haplotype assembly of densely sequenced human genome data. The HapCompass algorithm operates on a graph where single nucleotide polymorphisms (SNPs) are nodes and edges are defined by sequence reads and viewed as supporting evidence of co-occurring SNP alleles in a haplotype. In our graph model, haplotype phasings correspond to spanning trees. We define the minimum weighted edge removal optimization on this graph and develop an algorithm based on cycle basis local optimizations for resolving conflicting evidence. We then estimate the amount of sequencing required to produce a complete haplotype assembly of a chromosome. Using these estimates together with metrics borrowed from genome assembly and haplotype phasing, we compare the accuracy of HapCompass, the Genome Analysis ToolKit, and HapCut for 1000 Genomes Project and simulated data. We show that HapCompass performs significantly better for a variety of data and metrics. HapCompass is freely available for download (www.brown.edu/Research/Istrail_Lab/).

Subject(s)

Algorithms , Chromosome Mapping/methods , Computational Biology/methods , Genome, Human , Sequence Analysis, DNA/methods , Alleles , Haplotypes , High-Throughput Nucleotide Sequencing , Humans , Polymorphism, Single Nucleotide

19.

dbPTB: a database for preterm birth.

Uzun, Alper; Laliberte, Alyse; Parker, Jeremy; Andrew, Caroline; Winterrowd, Emily; Sharma, Surendra; Istrail, Sorin; Padbury, James F.

Database (Oxford) ; 2012: bar069, 2012.

Article in English | MEDLINE | ID: mdl-22323062

ABSTRACT

Genome-wide association studies (GWAS) query the entire genome in a hypothesis-free, unbiased manner. Since they have the potential for identifying novel genetic variants, they have become a very popular approach to the investigation of complex diseases. Nonetheless, since the success of the GWAS approach varies widely, the identification of genetic variants for complex diseases remains a difficult problem. We developed a novel bioinformatics approach to identify the nominal genetic variants associated with complex diseases. To test the feasibility of our approach, we developed a web-based aggregation tool to organize the genes, genetic variations and pathways involved in preterm birth. We used semantic data mining to extract all published articles related to preterm birth. All articles were reviewed by a team of curators. Genes identified from public databases and archives of expression arrays were aggregated with genes curated from the literature. Pathway analysis was used to impute genes from pathways identified in the curations. The curated articles and collected genetic information form a unique resource for investigators interested in preterm birth. The Database for Preterm Birth exemplifies an approach that is generalizable to other disorders for which there is evidence of significant genetic contributions.

Subject(s)

Databases, Genetic , Premature Birth/genetics , Adult , Chromosomes, Human/genetics , Female , Genes , Humans , Infant, Newborn , Oligonucleotide Array Sequence Analysis , Periodicals as Topic , Signal Transduction/genetics , Workflow

20.

The Clark phaseable sample size problem: long-range phasing and loss of heterozygosity in GWAS.

Halldórsson, Bjarni V; Aguiar, Derek; Tarpine, Ryan; Istrail, Sorin.

J Comput Biol ; 18(3): 323-33, 2011 Mar.

Article in English | MEDLINE | ID: mdl-21385037

ABSTRACT

A phase transition is taking place today. The amount of data generated by genome resequencing technologies is so large that in some cases it is now less expensive to repeat the experiment than to store the information generated by the experiment. In the next few years, it is quite possible that millions of Americans will have been genotyped. The question then arises of how to make the best use of this information and jointly estimate the haplotypes of all these individuals. The premise of this article is that long shared genomic regions (or tracts) are unlikely unless the haplotypes are identical by descent. These tracts can be used as input for a Clark-like phasing method to obtain a phasing solution of the sample. We show on simulated data that the algorithm will get an almost perfect solution if the number of individuals being genotyped is large enough and the correctness of the algorithm grows with the number of individuals being genotyped. We also study a related problem that connects copy number variation with phasing algorithm success. A loss of heterozygosity (LOH) event is when, by the laws of Mendelian inheritance, an individual should be heterozygote but, due to a deletion polymorphism, is not. Such polymorphisms are difficult to detect using existing algorithms, but play an important role in the genetics of disease and will confuse haplotype phasing algorithms if not accounted for. We will present an algorithm for detecting LOH regions across the genomes of thousands of individuals. The design of the long-range phasing algorithm and the loss of heterozygosity inference algorithms was inspired by our analysis of the Multiple Sclerosis (MS) GWAS dataset of the International Multiple Sclerosis Genetics Consortium. We present similar results to those obtained from the MS data.

Subject(s)

Genome-Wide Association Study/methods , Genomics/methods , Loss of Heterozygosity , Algorithms , Computer Simulation , Genotype , Haplotypes , Humans , Models, Genetic , Sample Size

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL