A statistical approach designed for finding mathematically defined repeats in shotgun data and determining the length distribution of clone-inserts / 基因组蛋白质组与生物信息学报·英文版
Genomics, Proteomics & Bioinformatics
;
(4): 43-51, 2003.
Article
in English
| WPRIM
| ID: wpr-339525
ABSTRACT
The large amount of repeats, especially high copy repeats, in the genomes of higher animals and plants makes whole genome assembly (WGA) quite difficult. In order to solve this problem, we tried to identify repeats and mask them prior to assembly even at the stage of genome survey. It is known that repeats of different copy number have different probabilities of appearance in shotgun data, so based on this principle, we constructed a statistical model and inferred criteria for mathematically defined repeats (MDRs) at different shotgun coverages. According to these criteria, we developed software MDRmasker to identify and mask MDRs in shotgun data. With repeats masked prior to assembly, the speed of assembly was increased with lower error probability. In addition, clone-insert size affect the accuracy of repeat assembly and scaffold construction, we also designed length distribution of clone-inserts using our model. In our simulated genomes of human and rice, the length distribution of repeats is different, so their optimal length distributions of clone-inserts were not the same. Thus with optimal length distribution of clone-inserts, a given genome could be assembled better at lower coverage.
Full text:
Available
Index:
WPRIM (Western Pacific)
Main subject:
Oryza
/
Genome, Human
/
Models, Statistical
/
Genome
/
Cloning, Molecular
/
Sequence Analysis, DNA
/
Genomics
/
Genetics
/
Methods
/
Models, Genetic
Type of study:
Diagnostic study
/
Prognostic study
/
Risk factors
Limits:
Animals
/
Humans
Language:
English
Journal:
Genomics, Proteomics & Bioinformatics
Year:
2003
Type:
Article
Similar
MEDLINE
...
LILACS
LIS