Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 7 de 7
Filter
Add more filters










Database
Language
Publication year range
1.
J Bioinform Comput Biol ; 14(4): 1650017, 2016 08.
Article in English | MEDLINE | ID: mdl-27216711

ABSTRACT

To reduce the cost of large-scale re-sequencing, multiple individuals are pooled together and sequenced called pooled sequencing. Pooled sequencing could provide a cost-effective alternative to sequencing individuals separately. To facilitate the application of pooled sequencing in haplotype-based diseases association analysis, the critical procedure is to accurately estimate haplotype frequencies from pooled samples. Here we present Ehapp2 for estimating haplotype frequencies from pooled sequencing data by utilizing a database which provides prior information of known haplotypes. We first translate the problem of estimating frequency for each haplotype into finding a sparse solution for a system of linear equations, where the NNREG algorithm is employed to achieve the solution. Simulation experiments reveal that Ehapp2 is robust to sequencing errors and able to estimate the frequencies of haplotypes with less than 3% average relative difference for pooled sequencing of mixture of real Drosophila haplotypes with 50× total coverage even when the sequencing error rate is as high as 0.05. Owing to the strategy that proportions for local haplotypes spanning multiple SNPs are accurately calculated first, Ehapp2 retains excellent estimation for recombinant haplotypes resulting from chromosomal crossover. Comparisons with present methods reveal that Ehapp2 is state-of-the-art for many sequencing study designs and more suitable for current massive parallel sequencing.


Subject(s)
Algorithms , Gene Frequency , Haplotypes , Animals , Databases, Genetic , Drosophila/genetics , High-Throughput Nucleotide Sequencing , Polymorphism, Single Nucleotide
2.
Bioinformatics ; 31(4): 515-22, 2015 Feb 15.
Article in English | MEDLINE | ID: mdl-25304780

ABSTRACT

MOTIVATION: A variety of hypotheses have been proposed for finding the missing heritability of complex diseases in genome-wide association studies. Studies have focused on the value of haplotype to improve the power of detecting associations with disease. To facilitate haplotype-based association analysis, it is necessary to accurately estimate haplotype frequencies of pooled samples. RESULTS: Taking advantage of databases that contain prior haplotypes, we present Ehapp based on the algorithm for solving the system of linear equations to estimate the frequencies of haplotypes from pooled sequencing data. Effects of various factors in sequencing on the performance are evaluated using simulated data. Our method could estimate the frequencies of haplotypes with only about 3% average relative difference for pooled sequencing of the mixture of 10 haplotypes with total coverage of 50×. When unknown haplotypes exist, our method maintains excellent performance for haplotypes with actual frequencies >0.05. Comparisons with present method on simulated data in conjunction with publicly available Illumina sequencing data indicate that our method is state of the art for many sequencing study designs. We also demonstrate the feasibility of applying overlapping pool sequencing to identify rare haplotype carriers cost-effectively. AVAILABILITY AND IMPLEMENTATION: Ehapp (in Perl) for the Linux platforms is available online (http://bioinfo.seu.edu.cn/Ehapp/). CONTACT: xsun@seu.edu.cn SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Computational Biology/methods , Genome, Human , Haplotypes/genetics , Sequence Analysis, DNA/economics , Sequence Analysis, DNA/methods , Software , Databases, Factual , Genome-Wide Association Study , Humans
3.
BMC Bioinformatics ; 15: 195, 2014 Jun 17.
Article in English | MEDLINE | ID: mdl-24934981

ABSTRACT

BACKGROUND: Genome-wide association studies have revealed that rare variants are responsible for a large portion of the heritability of some complex human diseases. This highlights the increasing importance of detecting and screening for rare variants. Although the massively parallel sequencing technologies have greatly reduced the cost of DNA sequencing, the identification of rare variant carriers by large-scale re-sequencing remains prohibitively expensive because of the huge challenge of constructing libraries for thousands of samples. Recently, several studies have reported that techniques from group testing theory and compressed sensing could help identify rare variant carriers in large-scale samples with few pooled sequencing experiments and a dramatically reduced cost. RESULTS: Based on quantitative group testing, we propose an efficient overlapping pool sequencing strategy that allows the efficient recovery of variant carriers in numerous individuals with much lower costs than conventional methods. We used random k-set pool designs to mix samples, and optimized the design parameters according to an indicative probability. Based on a mathematical model of sequencing depth distribution, an optimal threshold was selected to declare a pool positive or negative. Then, using the quantitative information contained in the sequencing results, we designed a heuristic Bayesian probability decoding algorithm to identify variant carriers. Finally, we conducted in silico experiments to find variant carriers among 200 simulated Escherichia coli strains. With the simulated pools and publicly available Illumina sequencing data, our method correctly identified the variant carriers for 91.5-97.9% variants with the variant frequency ranging from 0.5 to 1.5%. CONCLUSIONS: Using the number of reads, variant carriers could be identified precisely even though samples were randomly selected and pooled. Our method performed better than the published DNA Sudoku design and compressed sequencing, especially in reducing the required data throughput and cost.


Subject(s)
Genetic Variation , Genomics/methods , High-Throughput Nucleotide Sequencing/methods , Sequence Analysis, DNA/methods , Algorithms , Base Sequence , Bayes Theorem , Computer Simulation , Escherichia coli/genetics , Genome-Wide Association Study , Humans
4.
J Theor Biol ; 353: 9-18, 2014 Jul 21.
Article in English | MEDLINE | ID: mdl-24631045

ABSTRACT

Because a vast majority (99%) of microbes in a given community is likely to be non-cultivable, metagenomics has gradually entered the mainstream of microbial research methods. With the development of high-throughput sequencing techniques, an increasing number of sequencing read data sets of metagenomes from various microbial communities have become available. For these data sets, metagenomic analysis based on mapping reads to microbial genomes has been hampered by the limited number of microbial genomes that are available. Further, this type of analysis is computationally intensive. Thus alignment-free methods, which characterize the sequencing reads with a genomic signature instead of with genomic alignments, can be applied. However, the main requirement of these alignment-free methods is a stable genomic signature that performs reliably. Here, we propose a novel genomic signature of microbial genomes called the intrinsic correlation of oligonucleotides (ICOs). This signature represents the quantification of an intrinsic relationship between any two oligonucleotides. We analyzed microbial genomes at different taxonomic levels using ICO profiles and confirmed the wide availability of useful ICOs. We used intra-genomic and inter-genomic distances and relational grades to evaluate the performance of ICOs as a genomic signature. The results of these experiments showed that ICOs can characterize microbial genomes well, and ICOs were better at distinguishing species than tetranucleotide composition, not only in terms of whole genomes but also in terms of sequence fragments. In addition, we evaluated the performance of a hybrid feature that combined ICOs and tetranucleotide composition. The experimental results showed that the hybrid feature performed better than ICOs or tetranucleotide composition alone. ICOs can characterize microbial genomes successfully and are capable of distinguishing organisms at different taxonomic levels. ICOs perform better than tetranucleotide composition in characterizing microbial genomes. The hybrid feature that used a combination of the two kinds of sequence features had advantages over a single sequence feature.


Subject(s)
Bacteria/genetics , Metagenome/genetics , Metagenomics/methods , Oligonucleotides/genetics , Base Pairing/genetics
5.
Genet Epidemiol ; 37(8): 820-30, 2013 Dec.
Article in English | MEDLINE | ID: mdl-24166758

ABSTRACT

Genome-wide association studies have identified hundreds of genetic variants associated with complex diseases although most variants identified so far explain only a small proportion of heritability, suggesting that rare variants are responsible for missing heritability. Identification of rare variants through large-scale resequencing becomes increasing important but still prohibitively expensive despite the rapid decline in the sequencing costs. Nevertheless, group testing based overlapping pool sequencing in which pooled rather than individual samples are sequenced will greatly reduces the efforts of sample preparation as well as the costs to screen for rare variants. Here, we proposed an overlapping pool sequencing to screen rare variants with optimal sequencing depth and a corresponding cost model. We formulated a model to compute the optimal depth for sufficient observations of variants in pooled sequencing. Utilizing shifted transversal design algorithm, appropriate parameters for overlapping pool sequencing could be selected to minimize cost and guarantee accuracy. Due to the mixing constraint and high depth for pooled sequencing, results showed that it was more cost-effective to divide a large population into smaller blocks which were tested using optimized strategies independently. Finally, we conducted an experiment to screen variant carriers with frequency equaled 1%. With simulated pools and publicly available human exome sequencing data, the experiment achieved 99.93% accuracy. Utilizing overlapping pool sequencing, the cost for screening variant carriers with frequency equaled 1% in 200 diploid individuals dropped to at least 66% at which target sequencing region was set to 30 Mb.


Subject(s)
Genetic Variation/genetics , Sequence Analysis, DNA/economics , Sequence Analysis, DNA/methods , Algorithms , Exome/genetics , Humans , Models, Genetic , Research Design , Sequence Analysis, DNA/standards
6.
Cell Res ; 23(9): 1091-105, 2013 Sep.
Article in English | MEDLINE | ID: mdl-23917531

ABSTRACT

Crocodilians are diving reptiles that can hold their breath under water for long periods of time and are crepuscular animals with excellent sensory abilities. They comprise a sister lineage of birds and have no sex chromosome. Here we report the genome sequence of the endangered Chinese alligator (Alligator sinensis) and describe its unique features. The next-generation sequencing generated 314 Gb of raw sequence, yielding a genome size of 2.3 Gb. A total of 22 200 genes were predicted in Alligator sinensis using a de novo, homology- and RNA-based combined model. The genetic basis of long-diving behavior includes duplication of the bicarbonate-binding hemoglobin gene, co-functioning of routine phosphate-binding and special bicarbonate-binding oxygen transport, and positively selected energy metabolism, ammonium bicarbonate excretion and cardiac muscle contraction. Further, we elucidated the robust Alligator sinensis sensory system, including a significantly expanded olfactory receptor repertoire, rapidly evolving nerve-related cellular components and visual perception, and positive selection of the night vision-related opsin and sound detection-associated otopetrin. We also discovered a well-developed immune system with a considerable number of lineage-specific antigen-presentation genes for adaptive immunity as well as expansion of the tripartite motif-containing C-type lectin and butyrophilin genes for innate immunity and expression of antibacterial peptides. Multifluorescence in situ hybridization showed that alligator chromosome 3, which encodes DMRT1, exhibits significant synteny with chicken chromosome Z. Finally, population history analysis indicated population admixture 0.60-1.05 million years ago, when the Qinghai-Tibetan Plateau was uplifted.


Subject(s)
Alligators and Crocodiles/genetics , Genome/genetics , Alligators and Crocodiles/classification , Alligators and Crocodiles/metabolism , Animals , Base Composition/genetics , Base Sequence , Bicarbonates/metabolism , Biological Transport/genetics , DNA Transposable Elements/genetics , Energy Metabolism/genetics , Hemoglobins/genetics , Immune System , Muscle Contraction/genetics , Night Vision/genetics , Olfactory Pathways/cytology , Opsins/genetics , Oxygen/metabolism , Sequence Analysis, DNA , Sex Determination Processes/genetics , Smell/genetics , Transcription Factors/genetics , Visual Perception/genetics
7.
Nat Commun ; 4: 1426, 2013.
Article in English | MEDLINE | ID: mdl-23385571

ABSTRACT

Chinese tree shrews (Tupaia belangeri chinensis) possess many features valuable in animals used as experimental models in biomedical research. Currently, there are numerous attempts to employ tree shrews as models for a variety of human disorders: depression, myopia, hepatitis B and C virus infections, and hepatocellular carcinoma, to name a few. Here we present a publicly available annotated genome sequence for the Chinese tree shrew. Phylogenomic analysis of the tree shrew and other mammalians highly support its close affinity to primates. By characterizing key factors and signalling pathways in nervous and immune systems, we demonstrate that tree shrews possess both shared common and unique features, and provide a genetic basis for the use of this animal as a potential model for biomedical research.


Subject(s)
Genome/genetics , Tupaia/genetics , Animals , China , Genetic Variation , Hepacivirus/physiology , Hepatitis C/genetics , Hepatitis C/virology , Humans , Immune System/metabolism , Inactivation, Metabolic/genetics , Mice , Nervous System/metabolism , Phylogeny , Sequence Analysis, DNA , Tupaia/immunology
SELECTION OF CITATIONS
SEARCH DETAIL
...