Search | VHL Regional Portal

Preparata, Franco P.

J Comput Biol ; 20(6): 424-32, 2013 Jun.

Article in English | MEDLINE | ID: mdl-23675928

ABSTRACT

This work revisits the classic problem of coverage in genomic shotgun assembly (the "Lander-Waterman statistics"). A novel formulation, based on the analysis of an autonomous Markov automaton, is presented, and two main conclusions are derived. The first is an evaluation of the minimum multiplicity ("coverage") required to achieve uninterrupted covering (one single contig) with a prescribed confidence level. The second is a detailed analysis of the effect of replacing the hypothesis of fixed-length genomic fragments with that of an arbitrary distribution of lengths over a finite interval.

Subject(s)

Contig Mapping , Genome , Genomics/methods , Models, Statistical , Sequence Analysis, DNA

Spectrum-based de novo repeat detection in genomic sequences.

Do, Huy Hoang; Choi, Kwok Pui; Preparata, Franco P; Sung, Wing Kin; Zhang, Louxin.

J Comput Biol ; 15(5): 469-87, 2008 Jun.

Article in English | MEDLINE | ID: mdl-18549302

ABSTRACT

A novel approach to the detection of genomic repeats is presented in this paper. The technique, dubbed SAGRI (Spectrum Assisted Genomic Repeat Identifier), is based on the spectrum (set of sequence k-mers, for some k) of the genomic sequence. Specifically, the genome is scanned twice. The first scan (FindHit) detects candidate pairs of repeat-segments, by effectively reconstructing portions of the Euler path of the (k-1)-mer graph of the genome only in correspondence with likely repeat sites. This process produces candidate repeat pairs, for which the location of the leftmost term is unknown. Candidate pairs are then subjected to validation in a second scan, in which the genome is labelled for hits in the (much smaller) spectrum of the repeat candidates: high hit density is taken as evidence of the location of the first segment of a repeat, and the pair of segments is then certified by pairwise alignment. The design parameters of the technique are selected on the basis of a careful probabilistic analysis (based on random sequences). SAGRI is compared with three leading repeat-finding tools on both synthetic and natural DNA sequences, and found to be uniformly superior in versatility (ability to detect repeats of different lengths) and accuracy (the central goal of repeat finding), while being quite competitive in speed. An executable program can be downloaded at http://sagri.comp.nus.edu.sg.

Subject(s)

Algorithms , Pattern Recognition, Automated , Repetitive Sequences, Nucleic Acid , Genome, Human , Humans , Probability , Sequence Analysis, DNA/methods

A novel approach to the detection of genomic approximate tandem repeats in the Levenshtein metric.

Domaniç, Nevzat Onur; Preparata, Franco P.

J Comput Biol ; 14(7): 873-91, 2007 Sep.

Article in English | MEDLINE | ID: mdl-17803368

ABSTRACT

An efficient algorithm for detecting approximate tandem repeats in genomic sequences is presented. The algorithm is based on innovative statistical criteria to detect candidate regions which may include tandem repeats; these regions are subsequently verified by alignments based on dynamic programming. No prior information about the period size or pattern is needed. Also, the algorithm is virtually capable of detecting repeats with any period. An implementation of the algorithm is compared with the two state-of-the-art tandem repeats detection tools to demonstrate its effectiveness both on natural and synthetic data. The algorithm is available at www.cs.brown.edu/people/domanic/tandem/.

Subject(s)

Algorithms , Genome , Tandem Repeat Sequences , Animals , Base Sequence , Humans , Mathematics , Models, Genetic , Sequence Analysis, DNA

Quick, practical selection of effective seeds for homology search.

Preparata, Franco P; Zhang, Louxin; Choi, Kwok Pui.

J Comput Biol ; 12(9): 1137-52, 2005 Nov.

Article in English | MEDLINE | ID: mdl-16305325

ABSTRACT

It has been observed that in homology search gapped seeds have better sensitivity than ungapped ones for the same cost (weight). In this paper, we propose a probability leakage model (a dissipative Markov system) to elucidate the mechanism that confers power to spaced seeds. Based on this model, we identify desirable features of gapped search seeds and formulate an extremely efficient procedure for seed design: it samples from the set of spaced seed exhibiting those features, evaluates their sensitivity, and then selects the best. The sensitivity of the constructed seeds is negligibly less than that of the corresponding known optimal seeds. While the challenging mathematical question of characterizing optimal search seeds remains open, we believe that our eminently efficient and effective approach represents a satisfactory solution from a practitioner's viewpoint.

Subject(s)

Sequence Alignment/statistics & numerical data , Algorithms , Computational Biology , DNA/genetics , Genomics/statistics & numerical data , Markov Chains , Mathematics , Models, Statistical , Sensitivity and Specificity

Adaptive control of hybridization noise in DNA sequencing-by-hybridization.

Leong, Hon-Wai; Preparata, Franco P; Sung, Wing-Kin; Willy, Hugo.

J Bioinform Comput Biol ; 3(1): 79-98, 2005 Feb.

Article in English | MEDLINE | ID: mdl-15751113

ABSTRACT

We consider the problem of sequence reconstruction in sequencing-by-hybridization in the presence of spectrum errors. As suggested by intuition, and reported in the literature, false-negatives (i.e., missing spectrum probes) are by far the leading cause of reconstruction failures. In a recent paper we have described an algorithm, called "threshold-theta", designed to recover from false negatives. This algorithm is based on overcompensating for missing extensions by allowing larger reconstruction subtrees. We demonstrated, both analytically and with simulations, the increasing effectiveness of the approach as the parameter theta grows, but also pointed out that for larger error rates the size of the extension trees translates into an unacceptable computational burden. To obviate this shortcoming, in this paper we propose an adaptive approach which is both effective and efficient. Effective, because for a fixed value of theta it performs as well as its single-threshold counterpart, efficient because it exhibits substantial speed-ups over it. The idea is that, for moderate error rates a small fraction of the target sequence can be involved in error recovery; thus, expectedly the remainder of the sequence is reconstructible by the standard noiseless algorithm, with the provision to switch to operation with increasingly higher thresholds after detecting failure. This policy generates interesting and complex interplays between fooling probes and false negatives. These phenomena are carefully analyzed for random sequences and the results are found to be in excellent agreement with the simulations. In addition, the experimental algorithmic speed-ups of the multithreshold approach are explained in terms of the interaction amongst the different threshold regimes.

Subject(s)

Algorithms , DNA/analysis , DNA/genetics , In Situ Hybridization/methods , Models, Genetic , Models, Statistical , Sequence Analysis, DNA/methods , DNA Probes/genetics , Feedback , Reproducibility of Results , Sensitivity and Specificity , Stochastic Processes

DNA sequencing by hybridization using semi-degenerate bases.

Preparata, Franco P; Oliver, John S.

J Comput Biol ; 11(4): 753-65, 2004.

Article in English | MEDLINE | ID: mdl-15579243

ABSTRACT

One way to enhance the performance of hybridization microarrrays for DNA de novo sequencing is the use of probing patterns with gaps of unsampled positions. Ideally, such gaps could be realized by the inclusion into microarray oligos (probes) of wild-card compounds, referred to as universal bases (which bind nonspecifically to natural bases). The suggested alternative is to deploy in the gap positions degenerate bases, i.e., uniform mixtures of the four natural bases, with ensuing deterioration of the hybridization signal. In this paper, we show that such signal loss is a minor shortcoming, compared with the fact that degenerate bases cannot be treated as universal. Indeed, the substantial spread of hybridization energies at any microarray feature is such that on overwhelming number of mismatches bind more strongly than legal matches. We observed, however, that much narrower energy spreads are exhibited by pairs of bases in the same strength class (A-T and C-G). We call semi-degenerate a gap position realized with bases in the same energy class and show that well-known sequence reconstruction algorithms can be modified to achieve substantial improvements in sequencing effectiveness. For example, with a 4(9)-feature microarray and an acceptable weakening of the hybridization signal, one may achieve lengths of about 4,000 bases (compared with < 250 of the standard uniform method). Our approach also incorporates the use of a spectrum expressed in terms of observed feature melting temperatures (analog spectrum), rather than binary decisions made directly at the biochemical level (digital spectrum). While universal bases represent the ultimate goal of sequencing by hybridization, semidegenerate natural bases are the most effective known substitute.

Subject(s)

Sequence Analysis, DNA/statistics & numerical data , Algorithms , Base Pairing , Computational Biology , DNA/chemistry , DNA/genetics , Nucleic Acid Hybridization , Oligonucleotide Array Sequence Analysis/statistics & numerical data , Thermodynamics

Sequencing-by-hybridization revisited: the analog-spectrum proposal.

Preparata, Franco P.

IEEE/ACM Trans Comput Biol Bioinform ; 1(1): 46-52, 2004.

Article in English | MEDLINE | ID: mdl-17048407

ABSTRACT

All published approaches to DNA sequencing by hybridization (SBH) consist of the biochemical acquisition of the spectrum of a target sequence (the set of its subsequences conforming to a given probing pattern) followed by the algorithmic reconstruction of the sequence from its spectrum. In the "standard" or "uniform" approach, the probing pattern is a string of length L and the length of reliably reconstructible sequences is known to be mlen = O(2(L)). For a fixed microarray area, higher sequencing performance can be achieved by inserting nonprobing gaps ("wild-cards") in the probing pattern. The reconstruction, however, must cope with the emergence of fooling probes due to the gaps and algorithmic failure occurs when the spectrum becomes too densely populated, although we can achieve mcomp = 0(4(L)). Despite the combinatorial success of gapped probing, all current approaches are based on a biochemically unrealistic spectrum-acquisition model (digital-spectrum). The reality of hybridization is much more complex. Departing from the conventional model, in this paper, we propose an alternative, called the analog-spectrum model, which more closely reflects the biochemical process. This novel modeling reestablishes probe length as the performance-governing factor, adopting "semidegenerate bases" as suitable emulators of currently inadequate universal bases. One important conclusion is that accurate biochemical measurements are pivotal to the success of SBH. The theoretical proposal presented in this paper should be a convincing stimulus for the needed biotechnological work.

Subject(s)

Algorithms , Computational Biology/methods , Oligonucleotide Array Sequence Analysis/methods , Sequence Analysis, DNA/methods , DNA/chemistry , DNA/genetics , Nucleic Acid Hybridization , Thermodynamics , Transition Temperature

Sequencing by hybridization by cooperating direct and reverse spectra.

Heath, Samuel A; Preparata, Franco P; Young, Joel.

J Comput Biol ; 10(3-4): 499-508, 2003.

Article in English | MEDLINE | ID: mdl-12935340

ABSTRACT

DNA sequencing by hybridization, potentially a powerful alternative to standard wet lab techniques, has received renewed interest after a novel probing scheme has been recently proposed whose performance for the first time asymptotically meets the information theory bound. After settlement of the question of asymptotic performance, there remains the issue of algorithmic fine tunings aimed at improving the performance "constants," with substantial practical implications. In this paper, we show that a probing scheme based on the joint use of direct and reverse spectra (tandem spectra) for a given gapped probing pattern achieves a performance improvement per unit of microarray area of about 5/4 and does not appear to be susceptible to further improvement by increasing the number of cooperating spectra. In other words, tandem-spectrum reconstruction is the best known technique for sequencing by hybridization.

Subject(s)

Computational Biology/methods , Data Interpretation, Statistical , Oligonucleotide Array Sequence Analysis/methods , Sequence Analysis, DNA/methods , Spectrum Analysis/methods

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL