Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 22
Filter
Add more filters










Publication year range
1.
PLoS Comput Biol ; 17(6): e1009078, 2021 06.
Article in English | MEDLINE | ID: mdl-34153026

ABSTRACT

It is computationally challenging to detect variation by aligning single-molecule sequencing (SMS) reads, or contigs from SMS assemblies. One approach to efficiently align SMS reads is sparse dynamic programming (SDP), where optimal chains of exact matches are found between the sequence and the genome. While straightforward implementations of SDP penalize gaps with a cost that is a linear function of gap length, biological variation is more accurately represented when gap cost is a concave function of gap length. We have developed a method, lra, that uses SDP with a concave-cost gap penalty, and used lra to align long-read sequences from PacBio and Oxford Nanopore (ONT) instruments as well as de novo assembly contigs. This alignment approach increases sensitivity and specificity for SV discovery, particularly for variants above 1kb and when discovering variation from ONT reads, while having runtime that are comparable (1.05-3.76×) to current methods. When applied to calling variation from de novo assembly contigs, there is a 3.2% increase in Truvari F1 score compared to minimap2+htsbox. lra is available in bioconda (https://anaconda.org/bioconda/lra) and github (https://github.com/ChaissonLab/LRA).


Subject(s)
Contig Mapping/statistics & numerical data , Sequence Alignment/statistics & numerical data , Software , Cluster Analysis , Computational Biology , Computer Simulation , Databases, Nucleic Acid/statistics & numerical data , Genetic Variation , Genome, Human , High-Throughput Nucleotide Sequencing , Humans , Programming, Linear , Sequence Analysis, DNA
2.
PLoS One ; 14(9): e0216885, 2019.
Article in English | MEDLINE | ID: mdl-31498807

ABSTRACT

Unknown sequences, or gaps, are present in many published genomes across public databases. Gap filling is an important finishing step in de novo genome assembly, especially in large genomes. The gap filling problem is nontrivial and while there are many computational tools partially solving the problem, several have shortcomings as to the reliability and correctness of the output, i.e. the gap filled draft genome. SSPACE-LongRead is a scaffolding tool that utilizes long reads from multiple third-generation sequencing platforms in finding links between contigs and combining them. The long reads potentially contain sequence information to fill the gaps created in the scaffolding, but SSPACE-LongRead currently lacks this functionality. We present an automated pipeline called gapFinisher to process SSPACE-LongRead output to fill gaps after the scaffolding. gapFinisher is based on the controlled use of a previously published gap filling tool FGAP and works on all standard Linux/UNIX command lines. We compare the performance of gapFinisher against two other published gap filling tools PBJelly and GMcloser. We conclude that gapFinisher can fill gaps in draft genomes quickly and reliably. In addition, the serial design of gapFinisher makes it scale well from prokaryote genomes to larger genomes with no increase in the computational footprint.


Subject(s)
Algorithms , Contig Mapping/statistics & numerical data , Genome , Genomics/methods , Sequence Analysis, DNA/statistics & numerical data , Software , Animals , Bacteria/genetics , Benchmarking , Databases, Genetic , Genomics/statistics & numerical data , High-Throughput Nucleotide Sequencing , Seals, Earless/genetics
3.
PLoS One ; 13(1): e0190938, 2018.
Article in English | MEDLINE | ID: mdl-29351302

ABSTRACT

When human samples are sequenced, many assembled contigs are "unknown", as conventional alignments find no similarity to known sequences. Hidden Markov models (HMM) exploit the positions of specific nucleotides in protein-encoding codons in various microbes. The algorithm HMMER3 implements HMM using a reference set of sequences encoding viral proteins, "vFam". We used HMMER3 analysis of "unknown" human sample-derived sequences and identified 510 contigs distantly related to viruses (Anelloviridae (n = 1), Baculoviridae (n = 34), Circoviridae (n = 35), Caulimoviridae (n = 3), Closteroviridae (n = 5), Geminiviridae (n = 21), Herpesviridae (n = 10), Iridoviridae (n = 12), Marseillevirus (n = 26), Mimiviridae (n = 80), Phycodnaviridae (n = 165), Poxviridae (n = 23), Retroviridae (n = 6) and 89 contigs related to described viruses not yet assigned to any taxonomic family). In summary, we find that analysis using the HMMER3 algorithm and the "vFam" database greatly extended the detection of viruses in biospecimens from humans.


Subject(s)
Microbiota , Viruses/genetics , Viruses/isolation & purification , Algorithms , Computational Biology , Contig Mapping/statistics & numerical data , Databases, Protein/statistics & numerical data , Humans , Markov Chains , Metagenomics/statistics & numerical data , Phylogeny , Viral Proteins/genetics , Viruses/classification
4.
Proc Natl Acad Sci U S A ; 114(47): 12512-12517, 2017 11 21.
Article in English | MEDLINE | ID: mdl-29078313

ABSTRACT

Accurate detection of variants and long-range haplotypes in genomes of single human cells remains very challenging. Common approaches require extensive in vitro amplification of genomes of individual cells using DNA polymerases and high-throughput short-read DNA sequencing. These approaches have two notable drawbacks. First, polymerase replication errors could generate tens of thousands of false-positive calls per genome. Second, relatively short sequence reads contain little to no haplotype information. Here we report a method, which is dubbed SISSOR (single-stranded sequencing using microfluidic reactors), for accurate single-cell genome sequencing and haplotyping. A microfluidic processor is used to separate the Watson and Crick strands of the double-stranded chromosomal DNA in a single cell and to randomly partition megabase-size DNA strands into multiple nanoliter compartments for amplification and construction of barcoded libraries for sequencing. The separation and partitioning of large single-stranded DNA fragments of the homologous chromosome pairs allows for the independent sequencing of each of the complementary and homologous strands. This enables the assembly of long haplotypes and reduction of sequence errors by using the redundant sequence information and haplotype-based error removal. We demonstrated the ability to sequence single-cell genomes with error rates as low as 10-8 and average 500-kb-long DNA fragments that can be assembled into haplotype contigs with N50 greater than 7 Mb. The performance could be further improved with more uniform amplification and more accurate sequence alignment. The ability to obtain accurate genome sequences and haplotype information from single cells will enable applications of genome sequencing for diverse clinical needs.


Subject(s)
Contig Mapping/methods , Genome, Human , Haplotypes , Microfluidic Analytical Techniques/methods , Single-Cell Analysis/methods , Whole Genome Sequencing/methods , Alleles , Cell Line , Contig Mapping/statistics & numerical data , Fibroblasts/cytology , Fibroblasts/metabolism , HLA Antigens/genetics , HLA Antigens/metabolism , Humans , Microfluidic Analytical Techniques/instrumentation , Mutation , Polymorphism, Single Nucleotide , Single-Cell Analysis/instrumentation , Whole Genome Sequencing/instrumentation
5.
J Comput Biol ; 22(5): 367-76, 2015 May.
Article in English | MEDLINE | ID: mdl-25535824

ABSTRACT

Metatranscriptomic analysis provides information on how a microbial community reacts to environmental changes. Using next-generation sequencing (NGS) technology, biologists can study the microbe community by sampling short reads from a mixture of mRNAs (metatranscriptomic data). As most microbial genome sequences are unknown, it would seem that de novo assembly of the mRNAs is needed. However, NGS reads are short and mRNAs share many similar regions and differ tremendously in abundance levels, making de novo assembly challenging. The existing assembler, IDBA-MT, designed specifically for the assembly of metatranscriptomic data and performs well only on high-expressed mRNAs. This article introduces IDBA-MTP, which adopts a novel approach to metatranscriptomic assembly that makes use of the fact that there is a database of millions of known protein sequences associated with mRNAs. How to effectively use the protein information is nontrivial given the size of the database and given that different mRNAs might lead to proteins with similar functions (because different amino acids might have similar characteristics). IDBA-MTP employs a similarity measure between mRNAs and protein sequences, dynamic programming techniques, and seed-and-extend heuristics to tackle the problem effectively and efficiently. Experimental results show that IDBA-MTP outperforms existing assemblers by reconstructing 14% more mRNAs.


Subject(s)
Bacterial Proteins/chemistry , Contig Mapping/statistics & numerical data , Microbial Consortia/genetics , RNA, Messenger/chemistry , Software , Transcriptome , Algorithms , Bacterial Proteins/genetics , Contig Mapping/methods , Data Mining , High-Throughput Nucleotide Sequencing , Metagenomics/methods , Metagenomics/statistics & numerical data , Proteome/chemistry , Proteome/genetics , RNA, Bacterial/chemistry , RNA, Bacterial/genetics , RNA, Messenger/genetics , Sequence Analysis, DNA
6.
Nat Commun ; 5: 5695, 2014 Dec 17.
Article in English | MEDLINE | ID: mdl-25517223

ABSTRACT

Closing gaps in draft genome assemblies can be costly and time-consuming, and published genomes are therefore often left 'unfinished.' Here we show that genome-wide chromosome conformation capture (3C) data can be used to overcome these limitations, and present a computational approach rooted in polymer physics that determines the most likely genome structure using chromosomal contact data. This algorithm--named GRAAL--generates high-quality assemblies of genomes in which repeated and duplicated regions are accurately represented and offers a direct probabilistic interpretation of the computed structures. We first validated GRAAL on the reference genome of Saccharomyces cerevisiae, as well as other yeast isolates, where GRAAL recovered both known and unknown complex chromosomal structural variations. We then applied GRAAL to the finishing of the assembly of Trichoderma reesei and obtained a number of contigs congruent with the know karyotype of this species. Finally, we showed that GRAAL can accurately reconstruct human chromosomes from either fragments generated in silico or contigs obtained from de novo assembly. In all these applications, GRAAL compared favourably to recently published programmes implementing related approaches.


Subject(s)
Algorithms , Chromosomes, Fungal , Chromosomes, Human , Contig Mapping/statistics & numerical data , Genome , Models, Statistical , Contig Mapping/methods , High-Throughput Nucleotide Sequencing , Humans , Karyotype , Saccharomyces cerevisiae/genetics , Sequence Analysis, DNA , Trichoderma/genetics
7.
Comput Biol Chem ; 53 Pt A: 97-107, 2014 Dec.
Article in English | MEDLINE | ID: mdl-25262360

ABSTRACT

Selecting the values of parameters used by de novo genomic assembly programs, or choosing an optimal de novo assembly from several runs obtained with different parameters or programs, are tasks that can require complex decision-making. A key parameter that must be supplied to typical next generation sequencing (NGS) assemblers is the k-mer length, i.e., the word size that determines which de Bruijn graph the program should map out and use. The topic of assembly selection criteria was recently revisited in the Assemblathon 2 study (Bradnam et al., 2013). Although no clear message was delivered with regard to optimal k-mer lengths, it was shown with examples that it is sometimes important to decide if one is most interested in optimizing the sequences of protein-coding genes (the gene space) or in optimizing the whole genome sequence including the intergenic DNA, as what is best for one criterion may not be best for the other. In the present study, our aim was to better understand how the assembly of unicellular fungi (which are typically intermediate in size and complexity between prokaryotes and metazoan eukaryotes) can change as one varies the k-mer values over a wide range. We used two different de novo assembly programs (SOAPdenovo2 and ABySS), and simple assembly metrics that also focused on success in assembling the gene space and repetitive elements. A recent increase in Illumina read length to around 150 bp allowed us to attempt de novo assemblies with a larger range of k-mers, up to 127 bp. We applied these methods to Illumina paired-end sequencing read sets of fungal strains of Paracoccidioides brasiliensis and other species. By visualizing the results in simple plots, we were able to track the effect of changing k-mer size and assembly program, and to demonstrate how such plots can readily reveal discontinuities or other unexpected characteristics that assembly programs can present in practice, especially when they are used in a traditional molecular microbiology laboratory with a 'genomics corner'. Here we propose and apply a component of a first pass validation methodology for benchmarking and understanding fungal genome de novo assembly processes.


Subject(s)
Algorithms , Contig Mapping/statistics & numerical data , Genome, Fungal , Paracoccidioides/genetics , Sequence Analysis, DNA/statistics & numerical data , Benchmarking , DNA, Intergenic , High-Throughput Nucleotide Sequencing , Open Reading Frames , Repetitive Sequences, Nucleic Acid
8.
BMC Res Notes ; 7: 371, 2014 Jun 18.
Article in English | MEDLINE | ID: mdl-24938749

ABSTRACT

BACKGROUND: The fast reduction of prices of DNA sequencing allowed rapid accumulation of genome data. However, the process of obtaining complete genome sequences is still very time consuming and labor demanding. In addition, data produced from various sequencing technologies or alternative assemblies remain underexplored to improve assembly of incomplete genome sequences. FINDINGS: We have developed FGAP, a tool for closing gaps of draft genome sequences that takes advantage of different datasets. FGAP uses BLAST to align multiple contigs against a draft genome assembly aiming to find sequences that overlap gaps. The algorithm selects the best sequence to fill and eliminate the gap. CONCLUSIONS: FGAP reduced the number of gaps by 78% in an E. coli draft genome assembly using two different sequencing technologies, Illumina and 454. Using PacBio long reads, 98% of gaps were solved. In human chromosome 14 assemblies, FGAP reduced the number of gaps by 35%. All the inserted sequences were validated with a reference genome using QUAST. The source code and a web tool are available at http://www.bioinfo.ufpr.br/fgap/.


Subject(s)
Contig Mapping/methods , Escherichia coli/genetics , Genome, Bacterial , Genome, Human , Software , Algorithms , Base Sequence , Chromosomes, Human, Pair 14 , Contig Mapping/statistics & numerical data , High-Throughput Nucleotide Sequencing , Humans , Molecular Sequence Data
9.
Genome Biol ; 14(9): R101, 2013.
Article in English | MEDLINE | ID: mdl-24034426

ABSTRACT

BACKGROUND: The short reads output by first- and second-generation DNA sequencing instruments cannot completely reconstruct microbial chromosomes. Therefore, most genomes have been left unfinished due to the significant resources required to manually close gaps in draft assemblies. Third-generation, single-molecule sequencing addresses this problem by greatly increasing sequencing read length, which simplifies the assembly problem. RESULTS: To measure the benefit of single-molecule sequencing on microbial genome assembly, we sequenced and assembled the genomes of six bacteria and analyzed the repeat complexity of 2,267 complete bacteria and archaea. Our results indicate that the majority of known bacterial and archaeal genomes can be assembled without gaps, at finished-grade quality, using a single PacBio RS sequencing library. These single-library assemblies are also more accurate than typical short-read assemblies and hybrid assemblies of short and long reads. CONCLUSIONS: Automated assembly of long, single-molecule sequencing data reduces the cost of microbial finishing to $1,000 for most genomes, and future advances in this technology are expected to drive the cost lower. This is expected to increase the number of completed genomes, improve the quality of microbial genome databases, and enable high-fidelity, population-scale studies of pan-genomes and chromosomal organization.


Subject(s)
Contig Mapping/methods , Genome, Archaeal , Genome, Bacterial , Sequence Analysis, DNA/methods , Software , Algorithms , Base Sequence , Contig Mapping/statistics & numerical data , Escherichia coli/genetics , Francisella tularensis/genetics , Genome Size , Genomic Library , Mannheimia haemolytica/genetics , Molecular Sequence Data , Salmonella enterica/genetics , Sequence Analysis, DNA/economics , Sequence Analysis, DNA/statistics & numerical data
10.
Genome Biol ; 14(9): R100, 2013.
Article in English | MEDLINE | ID: mdl-24028704

ABSTRACT

BACKGROUND: Haplotypes are important for assessing genealogy and disease susceptibility of individual genomes,but are difficult to obtain with routine sequencing approaches. Experimental haplotype reconstruction based on assembling fragments of individual chromosomes is promising, but with variable yields due to incompletely understood parameter choices. RESULTS: We parameterize the clone-based haplotyping problem in order to provide theoretical and empirical assessments of the impact of different parameters on haplotype assembly. We confirm the intuition that long clones help link together heterozygous variants and thus improve haplotype length. Furthermore, given the length of the clones, we address how to choose the other parameters, including number of pools, clone coverage and sequencing coverage, so as to maximize haplotype length. We model the problem theoretically and show empirically the benefits of using larger clones with moderate number of pools and sequencing coverage. In particular, using 140 kb BAC clones, we construct haplotypes for a personal genome and assemble haplotypes with N50 values greater than 2.6 Mb. These assembled haplotypes are longer and at least as accurate as haplotypes of existing clone-based strategies, whether in vivo or in vitro. CONCLUSIONS: Our results provide practical guidelines for the development and design of clone-based methods to achieve long range, high-resolution and accurate haplotypes.


Subject(s)
Algorithms , Contig Mapping/methods , Genome, Human , HLA Antigens/genetics , Haplotypes , Molecular Typing/methods , Chromosomes, Artificial, Bacterial , Cloning, Molecular , Contig Mapping/statistics & numerical data , Humans , Molecular Typing/statistics & numerical data , Polymorphism, Single Nucleotide , Sequence Analysis, DNA
11.
BMC Res Notes ; 6: 334, 2013 Aug 22.
Article in English | MEDLINE | ID: mdl-23965294

ABSTRACT

BACKGROUND: The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments "read" by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. These "gold standards" can be expensive to produce and may only cover a small fraction of the genome, which limits their applicability to newly generated genome sequences. Here we introduce a de novo probabilistic measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics. RESULTS: We demonstrate that our de novo score can be computed quickly and accurately in a practical setting even for large datasets, by estimating the score from a relatively small sample of the reads. To demonstrate the benefits of our score, we measure the quality of the assemblies generated in the GAGE and Assemblathon 1 assembly "bake-offs" with our metric. Even without knowledge of the true reference sequence, our de novo metric closely matches the reference-based evaluation metrics used in the studies and outperforms other de novo metrics traditionally used to measure assembly quality (such as N50). Finally, we highlight the application of our score to optimize assembly parameters used in genome assemblers, which enables better assemblies to be produced, even without prior knowledge of the genome being assembled. CONCLUSION: Likelihood-based measures, such as ours proposed here, will become the new standard for de novo assembly evaluation.


Subject(s)
Contig Mapping/statistics & numerical data , Genome, Bacterial , Rhodobacter sphaeroides/genetics , Software , Staphylococcus aureus/genetics , Staphylococcus epidermidis/genetics , Algorithms , Genomics/methods , Likelihood Functions , Sequence Analysis, DNA
12.
BMC Genomics ; 5: 84, 2004 Nov 03.
Article in English | MEDLINE | ID: mdl-15527499

ABSTRACT

BACKGROUND: The ongoing efforts to sequence the honey bee genome require additional initiatives to define its transcriptome. Towards this end, we employed the Open Reading frame ESTs (ORESTES) strategy to generate profiles for the life cycle of Apis mellifera workers. RESULTS: Of the 5,021 ORESTES, 35.2% matched with previously deposited Apis ESTs. The analysis of the remaining sequences defined a set of putative orthologs whose majority had their best-match hits with Anopheles and Drosophila genes. CAP3 assembly of the Apis ORESTES with the already existing 15,500 Apis ESTs generated 3,408 contigs. BLASTX comparison of these contigs with protein sets of organisms representing distinct phylogenetic clades revealed a total of 1,629 contigs that Apis mellifera shares with different taxa. Most (41%) represent genes that are in common to all taxa, another 21% are shared between metazoans (Bilateria), and 16% are shared only within the Insecta clade. A set of 23 putative genes presented a best match with human genes, many of which encode factors related to cell signaling/signal transduction. 1,779 contigs (52%) did not match any known sequence. Applying a correction factor deduced from a parallel analysis performed with Drosophila melanogaster ORESTES, we estimate that approximately half of these no-match ESTs contigs (22%) should represent Apis-specific genes. CONCLUSIONS: The versatile and cost-efficient ORESTES approach produced minilibraries for honey bee life cycle stages. Such information on central gene regions contributes to genome annotation and also lends itself to cross-transcriptome comparisons to reveal evolutionary trends in insect genomes.


Subject(s)
Bees/genetics , Expressed Sequence Tags , Open Reading Frames/genetics , Transcription, Genetic/genetics , Animals , Anopheles/genetics , Caenorhabditis elegans , Classification , Cluster Analysis , Contig Mapping/statistics & numerical data , Drosophila melanogaster/genetics , Genes, Helminth/genetics , Genes, Insect/genetics , Genome , Genome, Fungal , Genome, Human , Genome, Protozoan , Humans
13.
BMC Genomics ; 5: 89, 2004 Nov 16.
Article in English | MEDLINE | ID: mdl-15546486

ABSTRACT

BACKGROUND: The cellular response of plants to water-deficits has both economic and evolutionary importance directly affecting plant productivity in agriculture and plant survival in the natural environment. Genes induced by water-deficit stress have been successfully enumerated in plants that are relatively sensitive to cellular dehydration, however we have little knowledge as to the adaptive role of these genes in establishing tolerance to water loss at the cellular level. Our approach to address this problem has been to investigate the genetic responses of plants that are capable of tolerating extremes of dehydration, in particular the desiccation-tolerant bryophyte, Tortula ruralis. To establish a sound basis for characterizing the Tortula genome in regards to desiccation tolerance, we analyzed 10,368 expressed sequence tags (ESTs) from rehydrated rapid-dried Tortula gametophytes, a stage previously determined to exhibit the maximum stress induced change in gene expression. RESULTS: The 10, 368 ESTs formed 5,563 EST clusters (contig groups representing individual genes) of which 3,321 (59.7%) exhibited similarity to genes present in the public databases and 2,242 were categorized as unknowns based on protein homology scores. The 3,321 clusters were classified by function using the Gene Ontology (GO) hierarchy and the KEGG database. The results indicate that the transcriptome contains a diverse population of transcripts that reflects, as expected, a period of metabolic upheaval in the gametophyte cells. Much of the emphasis within the transcriptome is centered on the protein synthetic machinery, ion and metabolite transport, and membrane biosynthesis and repair. Rehydrating gametophytes also have an abundance of transcripts that code for enzymes involved in oxidative stress metabolism and phosphorylating activities. The functional classifications reflect a remarkable consistency with what we have previously established with regards to the metabolic activities that are important in the recovery of the gametophytes from desiccation. A comparison of the GO distribution of Tortula clusters with an identical analysis of 9,981 clusters from the desiccation sensitive bryophyte species Physcomitrella patens, revealed, and accentuated, the differences between stressed and unstressed transcriptomes. Cross species sequence comparisons indicated that on the whole the Tortula clusters were more closely related to those from Physcomitrella than Arabidopsis (complete genome BLASTx comparison) although because of the differences in the databases there were more high scoring matches to the Arabidopsis sequences. The most abundant transcripts contained within the Tortula ESTs encode Late Embryogenesis Abundant (LEA) proteins that are normally associated with drying plant tissues. This suggests that LEAs may also play a role in recovery from desiccation when water is reintroduced into a dried tissue. CONCLUSION: The establishment of a rehydration EST collection for Tortula ruralis, an important plant model for plant stress responses and vegetative desiccation tolerance, is an important step in understanding the genome level response to cellular dehydration. The type of transcript analysis performed here has laid the foundation for more detailed functional and genome level analyses of the genes involved in desiccation tolerance in plants.


Subject(s)
Bryophyta/genetics , Bryophyta/metabolism , DNA, Plant/classification , Desiccation , Genes, Plant/genetics , Transcription, Genetic/genetics , Water/metabolism , Arabidopsis/genetics , Cluster Analysis , Conserved Sequence/genetics , Contig Mapping/statistics & numerical data , Databases, Genetic , Expressed Sequence Tags , Open Reading Frames/genetics
14.
Genome Res ; 14(4): 493-506, 2004 Apr.
Article in English | MEDLINE | ID: mdl-15059990

ABSTRACT

We assessed the content, structure, and distribution of segmental duplications (> or =90% sequence identity, > or =5 kb length) within the published version of the Rattus norvegicus genome assembly (v.3.1). The overall fraction of duplicated sequence within the rat assembly (2.92%) is greater than that of the mouse (1%-1.2%) but significantly less than that of human ( approximately 5%). Duplications were nonuniformly distributed, occurring predominantly as tandem and tightly clustered intrachromosomal duplications. Regions containing extensive interchromosomal duplications were observed, particularly within subtelomeric and pericentromeric regions. We identified 41 discrete genomic regions greater than 1 Mb in size, termed "duplication blocks." These appear to have been the target of extensive duplication over millions of years of evolution. Gene content within duplicated regions ( approximately 1%) was lower than expected based on the genome representation. Interestingly, sequence contigs lacking chromosome assignment ("the unplaced chromosome") showed a marked enrichment for segmental duplication (45% of 75.2 Mb), indicating that segmental duplications have been problematic for sequence and assembly of the rat genome. Further targeted efforts are required to resolve the organization and complexity of these regions.


Subject(s)
Gene Duplication , Rats, Inbred BN/genetics , Animals , Base Composition/genetics , Chromosomes/genetics , Computational Biology/methods , Computational Biology/statistics & numerical data , Contig Mapping/methods , Contig Mapping/statistics & numerical data , Gene Conversion/genetics , Genes/genetics , Genome , Rats
15.
Genome Res ; 14(4): 679-84, 2004 Apr.
Article in English | MEDLINE | ID: mdl-15060010

ABSTRACT

CLONEPICKER is a software pipeline that integrates sequence data with BAC clone fingerprints to dynamically select a minimal overlapping clone set covering the whole genome. In the Rat Genome Sequencing Project (RGSP), a hybrid strategy of "clone by clone" and "whole genome shotgun" approaches was used to maximize the merits of both approaches. Like the "clone by clone" method, one key challenge for this strategy was to select a low-redundancy clone set that covered the whole genome while the sequencing is in progress. The CLONEPICKER pipeline met this challenge using restriction enzyme fingerprint data, BAC end sequence data, and sequences generated from individual BAC clones as well as WGS reads. In the RGSP, an average of 7.5 clones was identified from each side of a seed clone, and the minimal overlapping clones were reliably selected. Combined with the assembled BAC fingerprint map, a set of BAC clones that covered >97% of the genome was identified and used in the RGSP.


Subject(s)
Chromosomes, Artificial, Bacterial/genetics , Contig Mapping/methods , Genome , Sequence Analysis, DNA/methods , Animals , Computational Biology/methods , Computational Biology/statistics & numerical data , Contig Mapping/statistics & numerical data , DNA Fingerprinting/methods , DNA Fingerprinting/statistics & numerical data , Rats , Sequence Analysis, DNA/statistics & numerical data , Software/statistics & numerical data
16.
Genome Res ; 14(1): 99-108, 2004 Jan.
Article in English | MEDLINE | ID: mdl-14672978

ABSTRACT

Comprehensive identification of DNA cis-regulatory elements is crucial for a predictive understanding of transcriptional network dynamics. Strong evidence suggests that these DNA sequence motifs are highly conserved between related species, reflecting strong selection on the network of regulatory interactions that underlie common cellular behavior. Here, we exploit a systems-level aspect of this conservation-the network-level topology of these interactions-to map transcription factor (TF) binding sites on a genomic scale. Using network-level conservation as a constraint, our algorithm finds 71% of known TF binding sites in the yeast Saccharomyces cerevisiae, using only 12% of the sequence of a phylogenetic neighbor. Most of the novel predicted motifs show strong features of known TF binding sites, such as functional category and/or expression profile coherence of their corresponding genes. Network-level conservation should provide a powerful constraint for the systematic mapping of TF binding sites in the larger genomes of higher eukaryotes.


Subject(s)
Conserved Sequence/genetics , Genome, Fungal , Saccharomyces cerevisiae/genetics , Transcription Factors/genetics , Transcription Factors/metabolism , Algorithms , Base Composition/genetics , Binding Sites/genetics , Binding Sites/physiology , Contig Mapping/methods , Contig Mapping/statistics & numerical data , DNA, Fungal/genetics , Humans , Models, Genetic , Models, Statistical , Predictive Value of Tests
17.
Plant Cell ; 14(7): 1441-56, 2002 Jul.
Article in English | MEDLINE | ID: mdl-12119366

ABSTRACT

Analysis of a collection of 120,892 single-pass ESTs, derived from 26 different tomato cDNA libraries and reduced to a set of 27,274 unique consensus sequences (unigenes), revealed that 70% of the unigenes have identifiable homologs in the Arabidopsis genome. Genes corresponding to metabolism have remained most conserved between these two genomes, whereas genes encoding transcription factors are among the fastest evolving. The majority of the 10 largest conserved multigene families share similar copy numbers in tomato and Arabidopsis, suggesting that the multiplicity of these families may have occurred before the divergence of these two species. An exception to this multigene conservation was observed for the E8-like protein family, which is associated with fruit ripening and has higher copy number in tomato than in Arabidopsis. Finally, six BAC clones from different parts of the tomato genome were isolated, genetically mapped, sequenced, and annotated. The combined analysis of the EST database and these six sequenced BACs leads to the prediction that the tomato genome encodes approximately 35,000 genes, which are sequestered largely in euchromatic regions corresponding to less than one-quarter of the total DNA in the tomato nucleus.


Subject(s)
Expressed Sequence Tags , Genome, Plant , Solanum lycopersicum/genetics , Arabidopsis/genetics , Chromosome Mapping , Chromosomes, Bacterial/genetics , Cloning, Molecular , Consensus Sequence/genetics , Conserved Sequence/genetics , Contig Mapping/methods , Contig Mapping/statistics & numerical data , DNA, Bacterial/genetics , DNA, Plant/genetics , Evolution, Molecular , Genomic Library , Medicago/genetics , Molecular Sequence Data , Multigene Family , Sequence Analysis, DNA/methods
18.
Bioinformatics ; 18(3): 484-5, 2002 Mar.
Article in English | MEDLINE | ID: mdl-11934749

ABSTRACT

SUMMARY: One of the more common uses of the program FingerPrint Contigs (FPC) is to assemble random restriction digest 'fingerprints' of overlapping genomic clones into contigs. To improve the rate of assembling contigs from large fingerprint databases we have adapted FPC so that it can be run in parallel on multiple processors and servers. The current version of 'parallelized FPC' has been used in our laboratory to assemble mammalian BAC fingerprint databases, each containing more than 300000 BAC fingerprints. AVAILABILITY: This parallelized version of FPC is available under the GNU GPL licence, and can be downloaded from ftp://ftp.bcgsc.bc.ca/pub/fpcd.


Subject(s)
Algorithms , Cloning, Molecular , Computing Methodologies , Contig Mapping/methods , Databases, Genetic , Animals , Base Sequence , Contig Mapping/statistics & numerical data , DNA Fingerprinting/methods , Humans , Internet , Mice , Molecular Sequence Data , Restriction Mapping , Sensitivity and Specificity , Sequence Tagged Sites , Software , Time Factors
19.
Genome Biol ; 3(12): RESEARCH0074, 2002.
Article in English | MEDLINE | ID: mdl-12537563

ABSTRACT

BACKGROUND: Cardiovascular diseases are the primary cause of death worldwide; the identification of genes specifically expressed in the heart is thus of major biomedical interest. We carried out a comprehensive analysis of gene-expression profiles using expressed sequence tags (ESTs) to identify genes overexpressed in the human adult heart. The initial set of genes expressed in the heart was constructed by clustering and assembling ESTs from heart cDNA libraries. Expression profiles were then generated for each gene by counting their cognate ESTs in all libraries. Differential expression was assessed by applying a previously published statistical procedure to these profiles. RESULTS: We identified 35 cardiac-specific genes overexpressed in the heart, some of which displayed significant coexpression. Some genes had no previously recognized cardiac function. Of the 35 genes, 32 were mapped back onto the human genome sequence. According to Online Mendelian Inheritance in Man (OMIM), five genes were previously known as heart-disease genes and one gene was located in the locus of a bleeding disorder. Analysis of the promoter regions of this collection of genes provides the first list of putative regulatory elements associated with differential cardiac expression. CONCLUSION: This study shows that ESTs are still a powerful tool to identify differentially expressed genes. We present a list of genes specifically expressed in the human heart, one of which is a candidate for a bleeding disorder. In addition, we provide the first set of putative regulatory elements, the combination of which appears correlated with heart-specific gene expression.


Subject(s)
Expressed Sequence Tags , Gene Library , Myocardium/metabolism , Adult , Chromosome Mapping/methods , Cluster Analysis , Contig Mapping/methods , Contig Mapping/statistics & numerical data , Databases, Genetic , Gene Expression Profiling/methods , Gene Expression Profiling/statistics & numerical data , Gene Expression Regulation/genetics , Gene Expression Regulation/physiology , Heart Diseases/genetics , Humans , Muscle, Smooth, Vascular/chemistry , Muscle, Smooth, Vascular/metabolism , Myocardium/chemistry , Organ Specificity/genetics , Promoter Regions, Genetic/genetics
20.
Genes Chromosomes Cancer ; 32(2): 144-54, 2001 Oct.
Article in English | MEDLINE | ID: mdl-11550282

ABSTRACT

The Philadelphia translocation, t(9;22)(q34;q11), is the microscopically visible product of recombination between two genes, ABL1 on chromosome 9 and BCR on chromosome 22, and gives rise to a functional hybrid BCR-ABL1 gene with demonstrated leukemogenic properties. Breakpoints in BCR occur mostly within one of two regions: a 5 kb major breakpoint cluster region (M-Bcr) and a larger 35 kb minor breakpoint cluster region (m-Bcr) towards the 3' end of the first BCR intron. By contrast, breakpoints in ABL1 are reported to occur more widely across a >200 kb region which spans the large first and second introns. The mechanisms that determine preferential breakage sites in BCR, and which cause recombination between BCR and ABL1, are presently unknown. In some cases, Alu repeats have been identified at or near sequenced breakpoint sites in both genes, providing indications, albeit controversial, that they may be relevant. For the present study, we carried out a detailed analysis of genomic BCR and ABL1 sequences to identify, classify, and locate interspersed repeat sequences and to relate their distribution to precisely mapped BCR-ABL1 recombination sites. Our findings confirm that Alu are the most abundant class of repeat in both genes, but that they occupy fewer sites than previously estimated and that they are distributed nonrandomly. r-Scan statistics were applied to provide a measure of repeat distribution and to evaluate extremes in repeat spacing. A significant lack of Alu elements was observed across the major and minor breakpoint cluster regions of BCR and across a 25-kb region showing a high frequency of breakage in ABL1. These findings counter the suggestion that occurrence of Alu at BCR-ABL1 recombination sites is likely by chance because of the high density of Alu in these two genes. Instead, as yet unidentified DNA conformation or nucleotide characteristics peculiar to the preferentially recombining regions, including those Alu elements present within them, more likely influence their fragility.


Subject(s)
Alu Elements/genetics , Chromosome Breakage/genetics , Genes, abl/genetics , Interspersed Repetitive Sequences/genetics , Oncogene Proteins/genetics , Protein-Tyrosine Kinases , Proto-Oncogene Proteins , Contig Mapping/methods , Contig Mapping/statistics & numerical data , Databases, Factual , Humans , Proto-Oncogene Proteins c-bcr , Statistical Distributions
SELECTION OF CITATIONS
SEARCH DETAIL
...