Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 20
Filter
Add more filters










Publication year range
1.
J Comput Biol ; 31(7): 597-615, 2024 Jul.
Article in English | MEDLINE | ID: mdl-38980804

ABSTRACT

Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a decycling set of the de Bruijn graph, which is a set of unavoidable k-mers. Any long enough sequence, by definition, must contain a k-mer from any decycling set (hence, the unavoidable property). Conversely, a decycling set also defines a sketching method by choosing the k-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.


Subject(s)
Algorithms , Computational Biology , Software , Computational Biology/methods , Sequence Alignment/methods , Humans , Sequence Analysis, DNA/methods
2.
Bioinformatics ; 40(Supplement_1): i11-i19, 2024 Jun 28.
Article in English | MEDLINE | ID: mdl-38940154

ABSTRACT

MOTIVATION: Wikipedia is a vital open educational resource in computational biology. The quality of computational biology coverage in English-language Wikipedia has improved steadily in recent years. However, there is an increasingly large 'knowledge gap' between computational biology resources in English-language Wikipedia, and Wikipedias in non-English languages. Reducing this knowledge gap by providing educational resources in non-English languages would reduce language barriers which disadvantage non-native English speaking learners across multiple dimensions in computational biology. RESULTS: Here, we provide a comprehensive assessment of computational biology coverage in Spanish-language Wikipedia, the second most accessed Wikipedia worldwide. Using Spanish-language Wikipedia as a case study, we generate quantitative and qualitative data before and after a targeted educational event, specifically, a Spanish-focused student editing competition. Our data demonstrates how such events and activities can narrow the knowledge gap between English and non-English educational resources, by improving existing articles and creating new articles. Finally, based on our analysis, we suggest ways to prioritize future initiatives to improve open educational resources in other languages. AVAILABILITY AND IMPLEMENTATION: Scripts for data analysis are available at: https://github.com/ISCBWikiTeam/spanish.


Subject(s)
Computational Biology , Computational Biology/methods , Humans , Language , Internet
3.
ArXiv ; 2023 Nov 06.
Article in English | MEDLINE | ID: mdl-37986724

ABSTRACT

Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a Decycling Set, an unavoidable sets of k-mers. Any long enough sequence, by definition, must contain a k-mer from any decycling set (hence, it is unavoidable). Conversely, a decycling set also defines a sketching method by choosing the k-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger, and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.

4.
J Comput Biol ; 27(8): 1181-1189, 2020 08.
Article in English | MEDLINE | ID: mdl-32315544

ABSTRACT

Computational tools used for genomic analyses are becoming more accurate but also increasingly sophisticated and complex. This introduces a new problem in that these pieces of software have a large number of tunable parameters that often have a large influence on the results that are reported. We quantify the impact of parameter choice on transcript assembly and take some first steps toward generating a truly automated genomic analysis pipeline by developing a method for automatically choosing input-specific parameter values for reference-based transcript assembly using the Scallop tool. By choosing parameter values for each input, the area under the receiver operator characteristic curve (AUC) when comparing assembled transcripts to a reference transcriptome is increased by an average of 28.9% over using only the default parameter choices on 1595 RNA-Seq samples in the Sequence Read Archive. This approach is general, and when applied to StringTie, it increases the AUC by an average of 13.1% on a set of 65 RNA-Seq experiments from ENCODE. Parameter advisors for both Scallop and StringTie are available on Github.


Subject(s)
Computational Biology/trends , Genome/genetics , Sequence Analysis, RNA/methods , Software , Algorithms , Genomics , Molecular Sequence Annotation , RNA/genetics , Transcriptome/genetics
5.
F1000Res ; 82019.
Article in English | MEDLINE | ID: mdl-31508204

ABSTRACT

Regional Student Groups (RSGs) of the International Society for Computational Biology Student Council (ISCB-SC) have been instrumental to connect computational biologists globally and to create more awareness about bioinformatics education. This article highlights the initiatives carried out by the RSGs both nationally and internationally to strengthen the present and future of the bioinformatics community. Moreover, we discuss the future directions the organization will take and the challenges to advance further in the ISCB-SC main mission: "Nurture the new generation of computational biologists".


Subject(s)
Computational Biology , Students , Humans , Interprofessional Relations
6.
Bioinformatics ; 35(14): i127-i135, 2019 07 15.
Article in English | MEDLINE | ID: mdl-31510667

ABSTRACT

MOTIVATION: Sequence alignment is a central operation in bioinformatics pipeline and, despite many improvements, remains a computationally challenging problem. Locality-sensitive hashing (LSH) is one method used to estimate the likelihood of two sequences to have a proper alignment. Using an LSH, it is possible to separate, with high probability and relatively low computation, the pairs of sequences that do not have high-quality alignment from those that may. Therefore, an LSH reduces the overall computational requirement while not introducing many false negatives (i.e. omitting to report a valid alignment). However, current LSH methods treat sequences as a bag of k-mers and do not take into account the relative ordering of k-mers in sequences. In addition, due to the lack of a practical LSH method for edit distance, in practice, LSH methods for Jaccard similarity or Hamming similarity are used as a proxy. RESULTS: We present an LSH method, called Order Min Hash (OMH), for the edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH is sensitive not only to the k-mer contents of the sequences but also to the relative order of the k-mers in the sequences. We present theoretical guarantees of the OMH as a gapped LSH. AVAILABILITY AND IMPLEMENTATION: The code to generate the results is available at http://github.com/Kingsford-Group/omhismb2019. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Algorithms , Sequence Alignment , Software
7.
F1000Res ; 82019.
Article in English | MEDLINE | ID: mdl-30647915

ABSTRACT

The Student Council of the International Society for Computational Biology (ISCB-SC) is a student-focused organization for researchers from all early career levels of training (undergraduates, masters, PhDs and postdocs) that organizes bioinformatics and computational biology activities across the globe. Among its activities, the ISCB-SC organizes several symposia in different continents, many times, with the help of the Regional Student Groups (RSGs) that are based on each region. In this editorial we highlight various key moments and learned lessons from the 14th Student Council Symposium (SCS, Chicago, USA), the 5th European Student Council Symposium (ESCS, Athens, Greece) and the 3rd Latin American Student Council Symposium (LA-SCS, Viña del Mar, Chile).


Subject(s)
Computational Biology , Leadership , Students , Chile , Humans , Research Personnel
8.
BMC Bioinformatics ; 19(Suppl 12): 347, 2018 Oct 09.
Article in English | MEDLINE | ID: mdl-30301451

ABSTRACT

This article describes the motivation, origin and evolution of the student symposia series organised by the ISCB Student Council. The meeting series started thirteen years ago in Madrid and has spread to four continents. The article concludes with the highlights of the most recent edition of annual Student Council Symposium held in conjunction with the 25th Conference on Intelligent Systems for Molecular Biology and the 16th European Conference on Computational Biology, in Prague, in July 2017.


Subject(s)
Computational Biology , Congresses as Topic , Students , Fellowships and Scholarships , Humans , Peer Review, Research , Publications , Research Support as Topic/economics
9.
Bioinformatics ; 34(13): i13-i22, 2018 07 01.
Article in English | MEDLINE | ID: mdl-29949995

ABSTRACT

Motivation: The minimizers technique is a method to sample k-mers that is used in many bioinformatics software to reduce computation, memory usage and run time. The number of applications using minimizers keeps on growing steadily. Despite its many uses, the theoretical understanding of minimizers is still very limited. In many applications, selecting as few k-mers as possible (i.e. having a low density) is beneficial. The density is highly dependent on the choice of the order on the k-mers. Different applications use different orders, but none of these orders are optimal. A better understanding of minimizers schemes, and the related local and forward schemes, will allow designing schemes with lower density and thereby making existing and future bioinformatics tools even more efficient. Results: From the analysis of the asymptotic behavior of minimizers, forward and local schemes, we show that the previously believed lower bound on minimizers schemes does not hold, and that schemes with density lower than thought possible actually exist. The proof is constructive and leads to an efficient algorithm to compare k-mers. These orders are the first known orders that are asymptotically optimal. Additionally, we give improved bounds on the density achievable by the three type of schemes.


Subject(s)
Algorithms , Computational Biology/methods
10.
J Comput Biol ; 25(7): 780-793, 2018 07.
Article in English | MEDLINE | ID: mdl-29889553

ABSTRACT

While mutation rates can vary markedly over the residues of a protein, multiple sequence alignment tools typically use the same values for their scoring-function parameters across a protein's entire length. We present a new approach, called adaptive local realignment, that in contrast automatically adapts to the diversity of mutation rates along protein sequences. This builds upon a recent technique known as parameter advising, which finds global parameter settings for an aligner, to now adaptively find local settings. Our approach in essence identifies local regions with low estimated accuracy, constructs a set of candidate realignments using a carefully-chosen collection of parameter settings, and replaces the region if a realignment has higher estimated accuracy. This new method of local parameter advising, when combined with prior methods for global advising, boosts alignment accuracy as much as 26% over the best default setting on hard-to-align protein benchmarks, and by 6.4% over global advising alone. Adaptive local realignment has been implemented within the Opal aligner using the Facet accuracy estimator.


Subject(s)
Computational Biology , Proteins/genetics , Software , Algorithms , Amino Acid Sequence/genetics , Sequence Alignment
11.
PLoS Comput Biol ; 14(1): e1005802, 2018 01.
Article in English | MEDLINE | ID: mdl-29346365

ABSTRACT

Education and training are two essential ingredients for a successful career. On one hand, universities provide students a curriculum for specializing in one's field of study, and on the other, internships complement coursework and provide invaluable training experience for a fruitful career. Consequently, undergraduates and graduates are encouraged to undertake an internship during the course of their degree. The opportunity to explore one's research interests in the early stages of their education is important for students because it improves their skill set and gives their career a boost. In the long term, this helps to close the gap between skills and employability among students across the globe and balance the research capacity in the field of computational biology. However, training opportunities are often scarce for computational biology students, particularly for those who reside in less-privileged regions. Aimed at helping students develop research and academic skills in computational biology and alleviating the divide across countries, the Student Council of the International Society for Computational Biology introduced its Internship Program in 2009. The Internship Program is committed to providing access to computational biology training, especially for students from developing regions, and improving competencies in the field. Here, we present how the Internship Program works and the impact of the internship opportunities so far, along with the challenges associated with this program.


Subject(s)
Computational Biology/education , Internship and Residency , Algorithms , Australia , Curriculum , Developing Countries , Europe , Geography , Humans , Program Development , Students , Universities
12.
IEEE/ACM Trans Comput Biol Bioinform ; 14(5): 1028-1041, 2017.
Article in English | MEDLINE | ID: mdl-28991725

ABSTRACT

While the multiple sequence alignment output by an aligner strongly depends on the parameter values used for the alignment scoring function (such as the choice of gap penalties and substitution scores), most users rely on the single default parameter setting provided by the aligner. A different parameter setting, however, might yield a much higher-quality alignment for the specific set of input sequences. The problem of picking a good choice of parameter values for specific input sequences is called parameter advising. A parameter advisor has two ingredients: (i) a set of parameter choices to select from, and (ii) an estimator that provides an estimate of the accuracy of the alignment computed by the aligner using a parameter choice. The parameter advisor picks the parameter choice from the set whose resulting alignment has highest estimated accuracy. In this paper, we consider for the first time the problem of learning the optimal set of parameter choices for a parameter advisor that uses a given accuracy estimator. The optimal set is one that maximizes the expected true accuracy of the resulting parameter advisor, averaged over a collection of training data. While we prove that learning an optimal set for an advisor is NP-complete, we show there is a natural approximation algorithm for this problem, and prove a tight bound on its approximation ratio. Experiments with an implementation of this approximation algorithm on biological benchmarks, using various accuracy estimators from the literature, show it finds sets for advisors that are surprisingly close to optimal. Furthermore, the resulting parameter advisors are significantly more accurate in practice than simply aligning with a single default parameter choice.


Subject(s)
Algorithms , Machine Learning , Sequence Alignment/methods , Sequence Analysis, Protein/methods , Computational Biology
13.
Genome Announc ; 5(30)2017 Jul 27.
Article in English | MEDLINE | ID: mdl-28751397

ABSTRACT

Ophidiomyces ophiodiicola, which belongs to the order Onygenales, is an emerging fungal pathogen of snakes in the United States. This study reports the 21.9-Mb genome sequence of an isolate of this reptilian pathogen obtained from a black racer snake in Pennsylvania.

14.
Algorithms Mol Biol ; 12: 11, 2017.
Article in English | MEDLINE | ID: mdl-28435440

ABSTRACT

BACKGROUND: In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference alignment are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the known three-dimensional structures of the proteins. Typically the accuracy of a protein multiple sequence alignment that has been computed for a benchmark is only measured with respect to the core columns of the reference alignment. When computing an alignment in practice, however, a reference alignment is not known, so the coreness of its columns can only be predicted. RESULTS: We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment's accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner's scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy.

15.
F1000Res ; 62017.
Article in English | MEDLINE | ID: mdl-29333232

ABSTRACT

Student Council Symposiums (SCSs) have been found to be very useful for students and young researchers. This is especially true given that the events are held directly before large international conferences, giving attendees a chance to gain exposure and have a warm up to the social nuances involved in attending such a meeting. This was the second SCS held in Africa in conjunction with the International Society for Computational Biology (ISCB) and the African Society for Bioinformatics and Computational Biology's (ASBCB) biennial meeting. This symposium was organised by students within the society inside Africa and was held on the 10 th of October 2017 in Entebbe, Uganda.

16.
PeerJ ; 4: e2359, 2016.
Article in English | MEDLINE | ID: mdl-27635331

ABSTRACT

We present the phylogeny analysis software SICLE (Sister Clade Extractor), an easy-to-use, high-throughput tool to describe the nearest neighbors to a node of interest in a phylogenetic tree as well as the support value for the relationship. The application is a command line utility that can be embedded into a phylogenetic analysis pipeline or can be used as a subroutine within another C++ program. As a test case, we applied this new tool to the published phylome of Salinibacter ruber, a species of halophilic Bacteriodetes, identifying 13 unique sister relationships to S. ruber across the 4,589 gene phylogenies. S. ruber grouped with bacteria, most often other Bacteriodetes, in the majority of phylogenies, but 91 phylogenies showed a branch-supported sister association between S. ruber and Archaea, an evolutionarily intriguing relationship indicative of horizontal gene transfer. This test case demonstrates how SICLE makes it possible to summarize the phylogenetic information produced by automated phylogenetic pipelines to rapidly identify and quantify the possible evolutionary relationships that merit further investigation. SICLE is available for free for noncommercial use at http://eebweb.arizona.edu/sicle/.

17.
BMC Bioinformatics ; 16 Suppl 2: A1-10, 2015.
Article in English | MEDLINE | ID: mdl-25708534

ABSTRACT

This report summarizes the scientific content and activities of the annual symposium organized by the Student Council of the International Society for Computational Biology (ISCB), held in conjunction with the Intelligent Systems for Molecular Biology (ISMB) conference in Boston, USA, on July 11th, 2014.


Subject(s)
Computational Biology , Drug Resistance, Multiple , High-Throughput Nucleotide Sequencing , Microsatellite Repeats/genetics , Peer Review, Research , Publishing , RNA, Messenger/metabolism , Sequence Analysis, DNA
18.
J Comput Biol ; 20(4): 259-79, 2013 Apr.
Article in English | MEDLINE | ID: mdl-23489379

ABSTRACT

Abstract We develop a novel and general approach to estimating the accuracy of multiple sequence alignments without knowledge of a reference alignment, and use our approach to address a new task that we call parameter advising: the problem of choosing values for alignment scoring function parameters from a given set of choices to maximize the accuracy of a computed alignment. For protein alignments, we consider twelve independent features that contribute to a quality alignment. An accuracy estimator is learned that is a polynomial function of these features; its coefficients are determined by minimizing its error with respect to true accuracy using mathematical optimization. Compared to prior approaches for estimating accuracy, our new approach (a) introduces novel feature functions that measure nonlocal properties of an alignment yet are fast to evaluate, (b) considers more general classes of estimators beyond linear combinations of features, and (c) develops new regression formulations for learning an estimator from examples; in addition, for parameter advising, we (d) determine the optimal parameter set of a given cardinality, which specifies the best parameter values from which to choose. Our estimator, which we call Facet (for "feature-based accuracy estimator"), yields a parameter advisor that on the hardest benchmarks provides more than a 27% improvement in accuracy over the best default parameter choice, and for parameter advising significantly outperforms the best prior approaches to assessing alignment quality.


Subject(s)
Proteins/chemistry , Sequence Alignment/methods , Amino Acid Sequence , Databases, Protein , Protein Structure, Secondary
19.
PLoS One ; 6(9): e24922, 2011.
Article in English | MEDLINE | ID: mdl-21949788

ABSTRACT

Invasive melanoma is the most lethal form of skin cancer. The treatment of melanoma-derived cell lines with 5-aza-2'-deoxycytidine (5-Aza-dC) markedly increases the expression of several miRNAs, suggesting that the miRNA-encoding genes might be epigenetically regulated, either directly or indirectly, by DNA methylation. We have identified a group of epigenetically regulated miRNA genes in melanoma cells, and have confirmed that the upstream CpG island sequences of several such miRNA genes are hypermethylated in cell lines derived from different stages of melanoma, but not in melanocytes and keratinocytes. We used direct DNA bisulfite and immunoprecipitated DNA (Methyl-DIP) to identify changes in CpG island methylation in distinct melanoma patient samples classified as primary in situ, regional metastatic, and distant metastatic. Two melanoma cell lines (WM1552C and A375 derived from stage 3 and stage 4 human melanoma, respectively) were engineered to ectopically express one of the epigenetically modified miRNA: miR-34b. Expression of miR-34b reduced cell invasion and motility rates of both WM1552C and A375, suggesting that the enhanced cell invasiveness and motility observed in metastatic melanoma cells may be related to their reduced expression of miR-34b. Total RNA isolated from control or miR-34b-expressing WM1552C cells was subjected to deep sequencing to identify gene networks around miR-34b. We identified network modules that are potentially regulated by miR-34b, and which suggest a mechanism for the role of miR-34b in regulating normal cell motility and cytokinesis.


Subject(s)
Cell Movement , Epigenomics , Gene Expression Regulation, Neoplastic , Melanoma/genetics , Melanoma/secondary , MicroRNAs/genetics , Biomarkers, Tumor/genetics , Biomarkers, Tumor/metabolism , Blotting, Northern , Cell Adhesion , Cell Line, Tumor , CpG Islands , DNA Methylation , Gene Expression Profiling , Gene Silencing , Humans , Neoplasm Invasiveness , Oligonucleotide Array Sequence Analysis , Promoter Regions, Genetic/genetics , RNA, Messenger/genetics , Real-Time Polymerase Chain Reaction , Skin Neoplasms/genetics , Skin Neoplasms/metabolism , Skin Neoplasms/pathology , Wound Healing
20.
FEBS Lett ; 585(15): 2467-76, 2011 Aug 04.
Article in English | MEDLINE | ID: mdl-21723283

ABSTRACT

To identify epigenetically regulated miRNAs in melanoma, we treated a stage 3 melanoma cell line WM1552C, with 5AzadC and/or 4-PBA. Several hypermethylated miRNAs were detected, one of which, miR-375, was highly methylated and was studied further. Minimal CpG island methylation was observed in melanocytes, keratinocytes, normal skin, and nevus but hypermethylation was observed in patient tissue samples from primary, regional, distant, and nodular metastatic melanoma. Ectopic expression of miR-375 inhibited melanoma cell proliferation, invasion, and cell motility, and induced cell shape changes, strongly suggesting that miR-375 may have an important function in the development and progression of human melanomas.


Subject(s)
Epigenesis, Genetic/physiology , Melanoma/pathology , MicroRNAs/physiology , Cell Movement , Cell Proliferation , Cell Shape , DNA Methylation , Humans , Melanoma/genetics , MicroRNAs/analysis , Neoplasm Invasiveness , Tumor Cells, Cultured
SELECTION OF CITATIONS
SEARCH DETAIL
...