Search | VHL Regional Portal

MCPNet: a parallel maximum capacity-based genome-scale gene network construction framework.

Pan, Tony C; Chockalingam, Sriram P; Aluru, Maneesha; Aluru, Srinivas.

Bioinformatics ; 39(6)2023 06 01.

Article in English | MEDLINE | ID: mdl-37289522

ABSTRACT

MOTIVATION: Gene network reconstruction from gene expression profiles is a compute- and data-intensive problem. Numerous methods based on diverse approaches including mutual information, random forests, Bayesian networks, correlation measures, as well as their transforms and filters such as data processing inequality, have been proposed. However, an effective gene network reconstruction method that performs well in all three aspects of computational efficiency, data size scalability, and output quality remains elusive. Simple techniques such as Pearson correlation are fast to compute but ignore indirect interactions, while more robust methods such as Bayesian networks are prohibitively time consuming to apply to tens of thousands of genes. RESULTS: We developed maximum capacity path (MCP) score, a novel maximum-capacity-path-based metric to quantify the relative strengths of direct and indirect gene-gene interactions. We further present MCPNet, an efficient, parallelized gene network reconstruction software based on MCP score, to reverse engineer networks in unsupervised and ensemble manners. Using synthetic and real Saccharomyces cervisiae datasets as well as real Arabidopsis thaliana datasets, we demonstrate that MCPNet produces better quality networks as measured by AUPRC, is significantly faster than all other gene network reconstruction software, and also scales well to tens of thousands of genes and hundreds of CPU cores. Thus, MCPNet represents a new gene network reconstruction tool that simultaneously achieves quality, performance, and scalability requirements. AVAILABILITY AND IMPLEMENTATION: Source code freely available for download at https://doi.org/10.5281/zenodo.6499747 and https://github.com/AluruLab/MCPNet, implemented in C++ and supported on Linux.

Subject(s)

Algorithms , Arabidopsis , Gene Regulatory Networks , Bayes Theorem , Software , Genome , Arabidopsis/genetics

EnGRaiN: a supervised ensemble learning method for recovery of large-scale gene regulatory networks.

Aluru, Maneesha; Shrivastava, Harsh; Chockalingam, Sriram P; Shivakumar, Shruti; Aluru, Srinivas.

Bioinformatics ; 38(5): 1312-1319, 2022 02 07.

Article in English | MEDLINE | ID: mdl-34888624

ABSTRACT

MOTIVATION: Reconstruction of genome-scale networks from gene expression data is an actively studied problem. A wide range of methods that differ between the types of interactions they uncover with varying trade-offs between sensitivity and specificity have been proposed. To leverage benefits of multiple such methods, ensemble network methods that combine predictions from resulting networks have been developed, promising results better than or as good as the individual networks. Perhaps owing to the difficulty in obtaining accurate training examples, these ensemble methods hitherto are unsupervised. RESULTS: In this article, we introduce EnGRaiN, the first supervised ensemble learning method to construct gene networks. The supervision for training is provided by small training datasets of true edge connections (positives) and edges known to be absent (negatives) among gene pairs. We demonstrate the effectiveness of EnGRaiN using simulated datasets as well as a curated collection of Arabidopsis thaliana datasets we created from microarray datasets available from public repositories. EnGRaiN shows better results not only in terms of receiver operating characteristic and PR characteristics for both real and simulated datasets compared with unsupervised methods for ensemble network construction, but also generates networks that can be mined for elucidating complex biological interactions. AVAILABILITY AND IMPLEMENTATION: EnGRaiN software and the datasets used in the study are publicly available at the github repository: https://github.com/AluruLab/EnGRaiN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Subject(s)

Arabidopsis , Gene Regulatory Networks , Software , Genome , Arabidopsis/genetics , Machine Learning

An alignment-free heuristic for fast sequence comparisons with applications to phylogeny reconstruction.

Chockalingam, Sriram P; Pannu, Jodh; Hooshmand, Sahar; Thankachan, Sharma V; Aluru, Srinivas.

BMC Bioinformatics ; 21(Suppl 6): 404, 2020 Nov 18.

Article in English | MEDLINE | ID: mdl-33203364

ABSTRACT

BACKGROUND: Alignment-free methods for sequence comparisons have become popular in many bioinformatics applications, specifically in the estimation of sequence similarity measures to construct phylogenetic trees. Recently, the average common substring measure, ACS, and its k-mismatch counterpart, ACSk, have been shown to produce results as effective as multiple-sequence alignment based methods for reconstruction of phylogeny trees. Since computing ACSk takes O(n logkn) time and hence impractical for large datasets, multiple heuristics that can approximate ACSk have been introduced. RESULTS: In this paper, we present a novel linear-time heuristic to approximate ACSk, which is faster than computing the exact ACSk while being closer to the exact ACSk values compared to previously published linear-time greedy heuristics. Using four real datasets, containing both DNA and protein sequences, we evaluate our algorithm in terms of accuracy, runtime and demonstrate its applicability for phylogeny reconstruction. Our algorithm provides better accuracy than previously published heuristic methods, while being comparable in its applications to phylogeny reconstruction. CONCLUSIONS: Our method produces a better approximation for ACSk and is applicable for the alignment-free comparison of biological sequences at highly competitive speed. The algorithm is implemented in Rust programming language and the source code is available at https://github.com/srirampc/adyar-rs .

Subject(s)

Computational Biology , Heuristics , Phylogeny , Algorithms , Sequence Alignment , Software

A greedy alignment-free distance estimator for phylogenetic inference.

Thankachan, Sharma V; Chockalingam, Sriram P; Liu, Yongchao; Krishnan, Ambujam; Aluru, Srinivas.

BMC Bioinformatics ; 18(Suppl 8): 238, 2017 Jun 07.

Article in English | MEDLINE | ID: mdl-28617225

ABSTRACT

BACKGROUND: Alignment-free sequence comparison approaches have been garnering increasing interest in various data- and compute-intensive applications such as phylogenetic inference for large-scale sequences. While k-mer based methods are predominantly used in real applications, the average common substring (ACS) approach is emerging as one of the prominent alignment-free approaches. This ACS approach has been further generalized by some recent work, either greedily or exactly, by allowing a bounded number of mismatches in the common substrings. RESULTS: We present ALFRED-G, a greedy alignment-free distance estimator for phylogenetic tree reconstruction based on the concept of the generalized ACS approach. In this algorithm, we have investigated a new heuristic to efficiently compute the lengths of common strings with mismatches allowed, and have further applied this heuristic to phylogeny reconstruction. Performance evaluation using real sequence datasets shows that our heuristic is able to reconstruct comparable, or even more accurate, phylogenetic tree topologies than the kmacs heuristic algorithm at highly competitive speed. CONCLUSIONS: ALFRED-G is an alignment-free heuristic for evolutionary distance estimation between two biological sequences. This algorithm is implemented in C++ and has been incorporated into our open-source ALFRED software package ( http://alurulab.cc.gatech.edu/phylo ).

Subject(s)

Algorithms , Computational Biology/methods , Phylogeny , Sequence Analysis/methods

Efficient detection of viral transmissions with Next-Generation Sequencing data.

Rytsareva, Inna; Campo, David S; Zheng, Yueli; Sims, Seth; Thankachan, Sharma V; Tetik, Cansu; Chirag, Jain; Chockalingam, Sriram P; Sue, Amanda; Aluru, Srinivas; Khudyakov, Yury.

BMC Genomics ; 18(Suppl 4): 372, 2017 05 24.

Article in English | MEDLINE | ID: mdl-28589864

ABSTRACT

BACKGROUND: Hepatitis C is a major public health problem in the United States and worldwide. Outbreaks of hepatitis C virus (HCV) infections associated with unsafe injection practices, drug diversion, and other exposures to blood are difficult to detect and investigate. Molecular analysis has been frequently used in the study of HCV outbreaks and transmission chains; helping identify a cluster of sequences as linked by transmission if their genetic distances are below a previously defined threshold. However, HCV exists as a population of numerous variants in each infected individual and it has been observed that minority variants in the source are often the ones responsible for transmission, a situation that precludes the use of a single sequence per individual because many such transmissions would be missed. The use of Next-Generation Sequencing immensely increases the sensitivity of transmission detection but brings a considerable computational challenge because all sequences need to be compared among all pairs of samples. METHODS: We developed a three-step strategy that filters pairs of samples according to different criteria: (i) a k-mer bloom filter, (ii) a Levenhstein filter and (iii) a filter of identical sequences. We applied these three filters on a set of samples that cover the spectrum of genetic relationships among HCV cases, from being part of the same transmission cluster, to belonging to different subtypes. RESULTS: Our three-step filtering strategy rapidly removes 85.1% of all the pairwise sample comparisons and 91.0% of all pairwise sequence comparisons, accurately establishing which pairs of HCV samples are below the relatedness threshold. CONCLUSIONS: We present a fast and efficient three-step filtering strategy that removes most sequence comparisons and accurately establishes transmission links of any threshold-based method. This highly efficient workflow will allow a faster response and molecular detection capacity, improving the rate of detection of viral transmissions with molecular data.

Subject(s)

Hepacivirus/genetics , Hepacivirus/physiology , High-Throughput Nucleotide Sequencing , Algorithms , Statistics as Topic

ALFRED: A Practical Method for Alignment-Free Distance Computation.

Thankachan, Sharma V; Chockalingam, Sriram P; Liu, Yongchao; Apostolico, Alberto; Aluru, Srinivas.

J Comput Biol ; 23(6): 452-60, 2016 06.

Article in English | MEDLINE | ID: mdl-27138275

ABSTRACT

Alignment-free approaches are gaining persistent interest in many sequence analysis applications such as phylogenetic inference and metagenomic classification/clustering, especially for large-scale sequence datasets. Besides the widely used k-mer methods, the average common substring (ACS) approach has emerged to be one of the well-known alignment-free approaches. Two recent works further generalize this ACS approach by allowing a bounded number k of mismatches in the common substrings, relying on approximation (linear time) and exact computation, respectively. Albeit having a good worst-case time complexity [Formula: see text], the exact approach is complex and unlikely to be efficient in practice. Herein, we present ALFRED, an alignment-free distance computation method, which solves the generalized common substring search problem via exact computation. Compared to the theoretical approach, our algorithm is easier to implement and more practical to use, while still providing highly competitive theoretical performances with an expected run-time of [Formula: see text]. By applying our program to phylogenetic inference as a case study, we find that our program facilitates to exactly reconstruct the topology of the reference phylogenetic tree for a set of 27 primate mitochondrial genomes, at reasonably acceptable speed. ALFRED is implemented in C++ programming language and the source code is freely available online.

Subject(s)

Computational Biology/methods , Primates/genetics , Sequence Alignment/methods , Algorithms , Animals , Genome, Mitochondrial , Metagenomics , Phylogeny

A survey of error-correction methods for next-generation sequencing.

Yang, Xiao; Chockalingam, Sriram P; Aluru, Srinivas.

Brief Bioinform ; 14(1): 56-66, 2013 Jan.

Article in English | MEDLINE | ID: mdl-22492192

ABSTRACT

UNLABELLED: Error Correction is important for most next-generation sequencing applications because highly accurate sequenced reads will likely lead to higher quality results. Many techniques for error correction of sequencing data from next-gen platforms have been developed in the recent years. However, compared with the fast development of sequencing technologies, there is a lack of standardized evaluation procedure for different error-correction methods, making it difficult to assess their relative merits and demerits. In this article, we provide a comprehensive review of many error-correction methods, and establish a common set of benchmark data and evaluation criteria to provide a comparative assessment. We present experimental results on quality, run-time, memory usage and scalability of several error-correction methods. Apart from providing explicit recommendations useful to practitioners, the review serves to identify the current state of the art and promising directions for future research. AVAILABILITY: All error-correction programs used in this article are downloaded from hosting websites. The evaluation tool kit is publicly available at: http://aluru-sun.ece.iastate.edu/doku.php?id=ecr.

Subject(s)

Sequence Analysis, DNA/trends , Software , Algorithms , Animals , Chromosome Mapping/statistics & numerical data , Chromosome Mapping/trends , Computational Biology , Databases, Genetic/statistics & numerical data , Databases, Genetic/trends , Forecasting , Humans , Sequence Alignment/statistics & numerical data , Sequence Alignment/trends , Sequence Analysis, DNA/statistics & numerical data

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL