Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 20
Filter
1.
Biol Direct ; 18(1): 46, 2023 08 14.
Article in English | MEDLINE | ID: mdl-37574542

ABSTRACT

BACKGROUND: Although the genome of Saccharomyces cerevisiae (S. cerevisiae) was the first one of a eukaryote organism that was fully sequenced (in 1996), a complete understanding of the potential of encoded biomolecular mechanisms has not yet been achieved. Here, we wish to quantify how far the goal of a full list of S. cerevisiae gene functions still is. RESULTS: The scientific literature about S. cerevisiae protein-coding genes has been mapped onto the yeast genome via the mentioning of names for genomic regions in scientific publications. The match was quantified with the ratio of a given gene name's occurrences to those of any gene names in the article. We find that ~ 230 elite genes with ≥ 75 full publication equivalents (FPEs, FPE = 1 is an idealized publication referring to just a single gene) command ~ 45% of all literature. At the same time, about two thirds of the genes (each with less than 10 FPEs) are described in just 12% of the literature (in average each such gene has just ~ 1.5% of the literature of an elite gene). About 600 genes have not been mentioned in any dedicated article. Compared with other groups of genes, the literature growth rates were highest for uncharacterized or understudied genes until late nineties of the twentieth century. Yet, these growth rates deteriorated and became negative thereafter. Thus, yeast function discovery for previously uncharacterized genes has returned to the level of ~ 1980. At the same time, literature for anyhow well-studied genes (with a threshold T10 (≥ 10 FPEs) and higher) remains steadily growing. CONCLUSIONS: Did the early full genome sequencing of yeast boost gene function discovery? The data proves that the moment of publishing the full genome in reality coincides with the onset of decline of gene function discovery for previously uncharacterized genes. If the current status of literature about yeast molecular mechanisms can be extrapolated into the future, it will take about another ~ 50 years to complete the yeast gene function list. We found that a small group of scientific journals contributed extraordinarily to publishing early reports relevant to yeast gene function discoveries.


Subject(s)
Genomics , Saccharomyces cerevisiae , Saccharomyces cerevisiae/genetics , Base Sequence , Phenotype
3.
Biol Direct ; 18(1): 7, 2023 02 28.
Article in English | MEDLINE | ID: mdl-36855185

ABSTRACT

BACKGROUND: Although Escherichia coli (E. coli) is the most studied prokaryote organism in the history of life sciences, many molecular mechanisms and gene functions encoded in its genome remain to be discovered. This work aims at quantifying the illumination of the E. coli gene function space by the scientific literature and how close we are towards the goal of a complete list of E. coli gene functions. RESULTS: The scientific literature about E. coli protein-coding genes has been mapped onto the genome via the mentioning of names for genomic regions in scientific articles both for the case of the strain K-12 MG1655 as well as for the 95%-threshold softcore genome of 1324 E. coli strains with known complete genome. The article match was quantified with the ratio of a given gene name's occurrence to the mentioning of any gene names in the paper. The various genome regions have an extremely uneven literature coverage. A group of elite genes with ≥ 100 full publication equivalents (FPEs, FPE = 1 is an idealized publication devoted to just a single gene) attracts the lion share of the papers. For K-12, ~ 65% of the literature covers just 342 elite genes; for the softcore genome, ~ 68% of the FPEs is about only 342 elite gene families (GFs). We also find that most genes/GFs have at least one mentioning in a dedicated scientific article (with the exception of at least 137 protein-coding transcripts for K-12 and 26 GFs from the softcore genome). Whereas the literature growth rates were highest for uncharacterized or understudied genes until 2005-2010 compared with other groups of genes, they became negative thereafter. At the same time, literature for anyhow well-studied genes started to grow explosively with threshold T10 (≥ 10 FPEs). Typically, a body of ~ 20 actual articles generated over ~ 15 years of research effort was necessary to reach T10. Lineage-specific co-occurrence analysis of genes belonging to the accessory genome of E. coli together with genomic co-localization and sequence-analytic exploration hints previously completely uncharacterized genes yahV and yddL being associated with osmotic stress response/motility mechanisms. CONCLUSION: If the numbers of scientific articles about uncharacterized and understudied genes remain at least at present levels, full gene function lists for the strain K-12 MG1655 and the E. coli softcore genome are in reach within the next 25-30 years. Once the literature body for a gene crosses 10 FPEs, most of the critical fundamental research risk appears overcome and steady incremental research becomes possible.


Subject(s)
Escherichia coli , Lighting , Escherichia coli/genetics , Genomics
4.
BMC Biol ; 20(1): 146, 2022 06 16.
Article in English | MEDLINE | ID: mdl-35710371

ABSTRACT

BACKGROUND: Escherichia coli (E. coli) has been one of the most studied model organisms in the history of life sciences. Initially thought just to be commensal bacteria, E. coli has shown wide phenotypic diversity including pathogenic isolates with great relevance to public health. Though pangenome analysis has been attempted several times, there is no systematic functional characterization of the E. coli subgroups according to the gene profile. RESULTS: Systematically scanning for optimal parametrization, we have built the E. coli pangenome from 1324 complete genomes. The pangenome size is estimated to be ~25,000 gene families (GFs). Whereas the core genome diminishes as more genomes are added, the softcore genome (≥95% of strains) is stable with ~3000 GFs regardless of the total number of genomes. Apparently, the softcore genome (with a 92% or 95% generation threshold) can define the genome of a bacterial species listing the critically relevant, evolutionarily most conserved or important classes of GFs. Unsupervised clustering of common E. coli sequence types using the presence/absence GF matrix reveals distinct characteristics of E. coli phylogroups B1, B2, and E. We highlight the bi-lineage nature of B1, the variation of the secretion and of the iron acquisition systems in ST11 (E), and the incorporation of a highly conserved prophage into the genome of ST131 (B2). The tail structure of the prophage is evolutionarily related to R2-pyocin (a tailocin) from Pseudomonas aeruginosa PAO1. We hypothesize that this molecular machinery is highly likely to play an important role in protecting its own colonies; thus, contributing towards the rapid rise of pandemic E. coli ST131. CONCLUSIONS: This study has explored the optimized pangenome development in E. coli. We provide complete GF lists and the pangenome matrix as supplementary data for further studies. We identified biological characteristics of different E. coli subtypes, specifically for phylogroups B1, B2, and E. We found an operon-like genome region coding for a tailocin specific for ST131 strains. The latter is a potential killer weapon providing pandemic E. coli ST131 with an advantage in inter-bacterial competition and, suggestively, explains their dominance as human pathogen among E. coli strains.


Subject(s)
Escherichia coli Infections , Escherichia coli Proteins , Escherichia coli/genetics , Escherichia coli/metabolism , Escherichia coli Infections/epidemiology , Escherichia coli Infections/microbiology , Escherichia coli Proteins/genetics , Genome, Bacterial , Humans , Pandemics , Phylogeny , Prophages
5.
Methods Mol Biol ; 2449: 299-324, 2022.
Article in English | MEDLINE | ID: mdl-35507269

ABSTRACT

The paradigm shift associated with the introduction of the pan-genome concept has drawn the attention from singular reference genomes toward the actual sequence diversity within organism populations, strain collections, clades, etc. A single genome is no longer sufficient to describe bacteria of interest, but instead, the genomic repertoire of all existing strains is the key to the metabolic, evolutionary, or pathogenic potential of a species. The classification of orthologous genes derived from a collection of taxonomically related genome sequences is central to bacterial pan-genome computational analysis. In this work, we present a review of methods for computing pan-genome gene clusters including their comparative analysis for the case of Streptococcus pyogenes strain genomes. We exhaustively scanned the parametrization space of the homologue searching procedures and find optimal parameters (sequence identity (60%) and coverage (50-60%) in the pairwise alignment) for the orthologous clustering of gene sequences. We find that the sequence identity threshold influences the number of gene families ~3 times stronger than the sequence coverage threshold.


Subject(s)
Genome, Bacterial , Streptococcus pyogenes , Cluster Analysis , Genomics/methods , Multigene Family , Phylogeny , Streptococcus pyogenes/genetics
6.
Cancer Res ; 81(10): 2788-2798, 2021 05 15.
Article in English | MEDLINE | ID: mdl-33558338

ABSTRACT

Gastric cancer cases are often diagnosed at an advanced stage with poor prognosis. Platinum-based chemotherapy has been internationally accepted as first-line therapy for inoperable or metastatic gastric cancer. To achieve greater benefits, selection of patients eligible for this treatment is critical. Although gene expression profiling has been widely used as a genomic classifier to identify molecular subtypes of gastric cancer and to stratify patients for different chemotherapy regimens, its prediction accuracy can be improved. Adenosine-to-inosine (A-to-I) RNA editing has emerged as a new player contributing to gastric cancer development and progression, offering potential clinical utility for diagnosis and treatment. Using a systematic computational approach followed by both in vitro validations and in silico validations in The Cancer Genome Atlas (TCGA), we conducted a transcriptome-wide RNA editing analysis of a cohort of 104 patients with advanced gastric cancer and identified an RNA editing (GCRE) signature to guide gastric cancer chemotherapy. RNA editing events stood as a prognostic and predictive biomarker in advanced gastric cancer. A GCRE score based on the GCRE signature consisted of 50 editing sites associated with 29 genes, predicting response to chemotherapy with a high accuracy (84%). Of note, patients demonstrating higher editing levels of this panel of sites presented a better overall response. Consistently, gastric cancer cell lines with higher editing levels showed higher chemosensitivity. Applying the GCRE score on TCGA dataset confirmed that responders had significantly higher levels of editing in advanced gastric cancer. Overall, this newly defined GCRE signature reliably stratifies patients with advanced gastric cancer and predicts response from chemotherapy. SIGNIFICANCE: This study describes a novel A-to-I RNA editing signature as a prognostic and predictive biomarker in advanced gastric cancer, providing a new tool to improve patient stratification and response to therapy.


Subject(s)
Adenocarcinoma/drug therapy , Antineoplastic Combined Chemotherapy Protocols/therapeutic use , Biomarkers, Tumor/genetics , Gene Expression Regulation, Neoplastic , Neoplasm Recurrence, Local/drug therapy , RNA Editing , Stomach Neoplasms/drug therapy , Adenocarcinoma/genetics , Adenocarcinoma/pathology , Clinical Trials as Topic , Cohort Studies , Gene Expression Profiling , Humans , Neoplasm Recurrence, Local/genetics , Neoplasm Recurrence, Local/pathology , Prognosis , Stomach Neoplasms/genetics , Stomach Neoplasms/pathology , Survival Rate
7.
Asian Bioeth Rev ; 11(2): 189-207, 2019 Jun.
Article in English | MEDLINE | ID: mdl-33717311

ABSTRACT

Whether due to simplicity or hypocrisy, the question of access to patient data for biomedical research is widely seen in the public discourse only from the angle of patient privacy. At the same time, the desire to live and to live without disability is of much higher value to the patients. This goal can only be achieved by extracting research insight from patient data in addition to working on model organisms, something that is well understood by many patients. Yet, most biomedical researchers working outside of clinics and hospitals are denied access to patient records when, at the same time, clinicians who guard the patient data are not optimally prepared for the data's analysis. Medical data collection is a time- and cost-intensive process that is most of all tedious, with few elements of intellectual and emotional satisfaction on its own. In this process, clinicians and bioinformaticians, each group with their own interests, have to join forces with the goal to generate medical data sets both from clinical trials and from routinely collected electronic health records that are, as much as possible, free from errors and obvious inconsistencies. The data cleansing effort as we have learned during curation of Singaporean clinical trial data is not a trivial task. The introduction of omics and sophisticated imaging modalities into clinical practice that are only partially interpreted in terms of diagnosis and therapy with today's level of knowledge warrant the creation of clinical databases with full patient history. This opens up opportunities for re-analyses and cross-trial studies at future time points with more sophisticated analyses of the same data, the collection of which is very expensive.

8.
Biol Direct ; 13(1): 2, 2018 02 12.
Article in English | MEDLINE | ID: mdl-29433547

ABSTRACT

BACKGROUND: Though earlier works on modelling transcript abundance from vertebrates to lower eukaroytes have specifically singled out the Zip's law, the observed distributions often deviate from a single power-law slope. In hindsight, while power-laws of critical phenomena are derived asymptotically under the conditions of infinite observations, real world observations are finite where the finite-size effects will set in to force a power-law distribution into an exponential decay and consequently, manifests as a curvature (i.e., varying exponent values) in a log-log plot. If transcript abundance is truly power-law distributed, the varying exponent signifies changing mathematical moments (e.g., mean, variance) and creates heteroskedasticity which compromises statistical rigor in analysis. The impact of this deviation from the asymptotic power-law on sequencing count data has never truly been examined and quantified. RESULTS: The anecdotal description of transcript abundance being almost Zipf's law-like distributed can be conceptualized as the imperfect mathematical rendition of the Pareto power-law distribution when subjected to the finite-size effects in the real world; This is regardless of the advancement in sequencing technology since sampling is finite in practice. Our conceptualization agrees well with our empirical analysis of two modern day NGS (Next-generation sequencing) datasets: an in-house generated dilution miRNA study of two gastric cancer cell lines (NUGC3 and AGS) and a publicly available spike-in miRNA data; Firstly, the finite-size effects causes the deviations of sequencing count data from Zipf's law and issues of reproducibility in sequencing experiments. Secondly, it manifests as heteroskedasticity among experimental replicates to bring about statistical woes. Surprisingly, a straightforward power-law correction that restores the distribution distortion to a single exponent value can dramatically reduce data heteroskedasticity to invoke an instant increase in signal-to-noise ratio by 50% and the statistical/detection sensitivity by as high as 30% regardless of the downstream mapping and normalization methods. Most importantly, the power-law correction improves concordance in significant calls among different normalization methods of a data series averagely by 22%. When presented with a higher sequence depth (4 times difference), the improvement in concordance is asymmetrical (32% for the higher sequencing depth instance versus 13% for the lower instance) and demonstrates that the simple power-law correction can increase significant detection with higher sequencing depths. Finally, the correction dramatically enhances the statistical conclusions and eludes the metastasis potential of the NUGC3 cell line against AGS of our dilution analysis. CONCLUSIONS: The finite-size effects due to undersampling generally plagues transcript count data with reproducibility issues but can be minimized through a simple power-law correction of the count distribution. This distribution correction has direct implication on the biological interpretation of the study and the rigor of the scientific findings. REVIEWERS: This article was reviewed by Oliviero Carugo, Thomas Dandekar and Sandor Pongor.


Subject(s)
Models, Theoretical , Animals , Cell Line, Tumor , Humans , MicroRNAs/genetics
9.
Nat Commun ; 8(1): 653, 2017 09 21.
Article in English | MEDLINE | ID: mdl-28935855

ABSTRACT

The Singapore Integrative Omics Study provides valuable insights on establishing population reference measurement in 364 Chinese, Malay, and Indian individuals. These measurements include > 2.5 millions genetic variants, 21,649 transcripts expression, 282 lipid species quantification, and 284 clinical, lifestyle, and dietary variables. This concept paper introduces the depth of the data resource, and investigates the extent of ethnic variation at these omics and non-omics biomarkers. It is evident that there are specific biomarkers in each of these platforms to differentiate between the ethnicities, and intra-population analyses suggest that Chinese and Indians are the most biologically homogeneous and heterogeneous, respectively, of the three groups. Consistent patterns of correlations between lipid species also suggest the possibility of lipid tagging to simplify future lipidomics assays. The Singapore Integrative Omics Study is expected to allow the characterization of intra-omic and inter-omic correlations within and across all three ethnic groups through a systems biology approach.The Singapore Genome Variation projects characterized the genetics of Singapore's Chinese, Malay, and Indian populations. The Singapore Integrative Omics Study introduced here goes further in providing multi-omic measurements in individuals from these populations, including genetic, transcriptome, lipidome, and lifestyle data, and will facilitate the study of common diseases in Asian communities.


Subject(s)
Lipid Metabolism , Metagenomics/standards , Polymorphism, Single Nucleotide , Asian People/genetics , Diet , Genetic Variation , Humans , Life Style , MicroRNAs , Pharmacogenomic Variants , Principal Component Analysis , Quality Control , Reference Standards , Singapore/ethnology
10.
PLoS One ; 9(9): e106681, 2014.
Article in English | MEDLINE | ID: mdl-25203698

ABSTRACT

Next-generation genotyping microarrays have been designed with insights from large-scale sequencing of exomes and whole genomes. The exome genotyping arrays promise to query the functional regions of the human genome at a fraction of the sequencing cost, thus allowing large number of samples to be genotyped. However, two pertinent questions exist: firstly, how representative is the content of the exome chip for populations not involved in the design of the chip; secondly, can the content of the exome chip be imputed with the reference data from the 1000 Genomes Project (1KGP). By deep whole-genome sequencing two Asian populations that are not part of the 1KGP, comprising 96 Southeast Asian Malays and 36 South Asian Indians for which the same samples have also been genotyped on both the Illumina 2.5 M and exome microarrays, we discovered the exome chip is a poor representation of exonic content in our two populations. However, up to 94.1% of the variants on the exome chip that are polymorphic in our populations can be confidently imputed with existing non-exome-centric microarrays using the 1KGP panel. The coverage further increases if there exists population-specific reference data from whole-genome sequencing. There is thus limited gain in using the exome chip for populations not involved in the microarray design. Instead, for the same cost of genotyping 2,000 samples on the exome chip, performing whole-genome sequencing of at least 35 samples in that population to complement the 1KGP may yield a higher coverage of the exonic content from imputation instead.


Subject(s)
Exome/genetics , Genomics/methods , Genotyping Techniques/methods , Oligonucleotide Array Sequence Analysis/methods , Asian People/genetics , Exons/genetics , Genome, Human/genetics , Genome-Wide Association Study , Haplotypes/genetics , Humans , Polymorphism, Single Nucleotide/genetics
11.
Hum Mol Genet ; 23(16): 4443-51, 2014 Aug 15.
Article in English | MEDLINE | ID: mdl-24698974

ABSTRACT

The major histocompatibility complex (MHC) containing the classical human leukocyte antigen (HLA) Class I and Class II genes is among the most polymorphic and diverse regions in the human genome. Despite the clinical importance of identifying the HLA types, very few databases jointly characterize densely genotyped single nucleotide polymorphisms (SNPs) and HLA alleles in the same samples. To date, the HapMap presents the only public resource that provides a SNP reference panel for predicting HLA alleles, constructed with four collections of individuals of north-western European, northern Han Chinese, cosmopolitan Japanese and Yoruba Nigerian ancestry. Owing to complex patterns of linkage disequilibrium in this region, it is unclear whether the HapMap reference panels can be appropriately utilized for other populations. Here, we describe a public resource for the Singapore Genome Variation Project with: (i) dense genotyping across ∼ 9000 SNPs in the MHC; (ii) four-digit HLA typing for eight Class I and Class II loci, in 96 southern Han Chinese, 89 Southeast Asian Malays and 83 Tamil Indians. This resource provides population estimates of the frequencies of HLA alleles at these eight loci in the three population groups, particularly for HLA-DPA1 and HLA-DPB1 that were not assayed in HapMap. Comparing between population-specific reference panels and a cosmopolitan panel created from all four HapMap populations, we demonstrate that more accurate imputation is obtained with population-specific panels than with the cosmopolitan panel, especially for the Malays and Indians but even when imputing between northern and southern Han Chinese. As with SNP imputation, common HLA alleles were imputed with greater accuracy than low-frequency variants.


Subject(s)
Alleles , HLA Antigens/genetics , HLA-DP alpha-Chains/genetics , HLA-DP beta-Chains/genetics , Polymorphism, Single Nucleotide , Asian People/genetics , Asian People/statistics & numerical data , Genetic Loci , Humans , Major Histocompatibility Complex/genetics
12.
Bioinformatics ; 30(12): 1714-20, 2014 Jun 15.
Article in English | MEDLINE | ID: mdl-24567545

ABSTRACT

MOTIVATION: Next-generation genotyping microarrays have been designed with insights from 1000 Genomes Project and whole-exome sequencing studies. These arrays additionally include variants that are typically present at lower frequencies. Determining the genotypes of these variants from hybridization intensities is challenging because there is less support to locate the presence of the minor alleles when the allele counts are low. Existing algorithms are mainly designed for calling common variants and are notorious for failing to generate accurate calls for low-frequency and rare variants. Here, we introduce a new calling algorithm, iCall, to call genotypes for variants across the whole spectrum of allele frequencies. RESULTS: We benchmarked iCall against four of the most commonly used algorithms, GenCall, optiCall, illuminus and GenoSNP, as well as a post-processing caller zCall that adopted a two-stage calling design. Normalized hybridization intensities for 12 370 individuals genotyped on the Illumina HumanExome BeadChip were considered, of which 81 individuals were also whole-genome sequenced. The sequence calls were used to benchmark the accuracy of the genotype calling, and our comparisons indicated that iCall outperforms all four single-stage calling algorithms in terms of call rates and concordance, particularly in the calling accuracy of minor alleles, which is the principal concern for rare and low-frequency variants. The application of zCall to post-process the output from iCall also produced marginally improved performance to the combination of zCall and GenCall. AVAILABILITY AND IMPLEMENTATION: iCall is implemented in C++ for use on Linux operating systems and is available for download at http://www.statgen.nus.edu.sg/∼software/icall.html.


Subject(s)
Algorithms , Exome , Genotyping Techniques/methods , High-Throughput Nucleotide Sequencing , Oligonucleotide Array Sequence Analysis , Polymorphism, Single Nucleotide , Sequence Analysis, DNA , Cluster Analysis , Gene Frequency , Genome, Human , Humans , Software
13.
PLoS One ; 8(9): e74432, 2013.
Article in English | MEDLINE | ID: mdl-24086345

ABSTRACT

Next-generation sequencing (NGS) studies in cancer are limited by the amount, quality and purity of tissue samples. In this situation, primary xenografts have proven useful preclinical models. However, the presence of mouse-derived stromal cells represents a technical challenge to their use in NGS studies. We examined this problem in an established primary xenograft model of small cell lung cancer (SCLC), a malignancy often diagnosed from small biopsy or needle aspirate samples. Using an in silico strategy that assign reads according to species-of-origin, we prospectively compared NGS data from primary xenograft models with matched cell lines and with published datasets. We show here that low-coverage whole-genome analysis demonstrated remarkable concordance between published genome data and internal controls, despite the presence of mouse genomic DNA. Exome capture sequencing revealed that this enrichment procedure was highly species-specific, with less than 4% of reads aligning to the mouse genome. Human-specific expression profiling with RNA-Seq replicated array-based gene expression experiments, whereas mouse-specific transcript profiles correlated with published datasets from human cancer stroma. We conclude that primary xenografts represent a useful platform for complex NGS analysis in cancer research for tumours with limited sample resources, or those with prominent stromal cell populations.


Subject(s)
Disease Models, Animal , High-Throughput Nucleotide Sequencing/methods , Neoplasms/genetics , Xenograft Model Antitumor Assays , Animals , Cell Line, Tumor , DNA Copy Number Variations/genetics , Exome/genetics , Gene Expression Profiling , Genome, Human/genetics , Humans , Mice , Mice, Nude , Oligonucleotide Array Sequence Analysis , Species Specificity
14.
Genome Res ; 19(11): 2154-62, 2009 Nov.
Article in English | MEDLINE | ID: mdl-19700652

ABSTRACT

The Singapore Genome Variation Project (SGVP) provides a publicly available resource of 1.6 million single nucleotide polymorphisms (SNPs) genotyped in 268 individuals from the Chinese, Malay, and Indian population groups in Southeast Asia. This online database catalogs information and summaries on genotype and phased haplotype data, including allele frequencies, assessment of linkage disequilibrium (LD), and recombination rates in a format similar to the International HapMap Project. Here, we introduce this resource and describe the analysis of human genomic variation upon agglomerating data from the HapMap and the Human Genome Diversity Project, providing useful insights into the population structure of the three major population groups in Asia. In addition, this resource also surveyed across the genome for variation in regional patterns of LD between the HapMap and SGVP populations, and for signatures of positive natural selection using two well-established metrics: iHS and XP-EHH. The raw and processed genetic data, together with all population genetic summaries, are publicly available for download and browsing through a web browser modeled with the Generic Genome Browser.


Subject(s)
Databases, Genetic , Genetic Variation/genetics , Genome, Human/genetics , Haplotypes/genetics , China , Chromosome Mapping , Gene Frequency , Genetics, Population/methods , Genome-Wide Association Study/methods , Genomics/methods , Genotype , Humans , India , Linkage Disequilibrium , Malaysia , Polymorphism, Single Nucleotide , Principal Component Analysis , Selection, Genetic , Singapore
15.
J Biol Chem ; 283(19): 13205-15, 2008 May 09.
Article in English | MEDLINE | ID: mdl-18319255

ABSTRACT

Like other cancers, aberrant gene regulation features significantly in hepatocellular carcinoma (HCC). MicroRNAs (miRNAs) were recently found to regulate gene expression at the post-transcriptional/translational levels. The expression profiles of 157 miRNAs were examined in 19 HCC patients, and 19 up-regulated and 3 down-regulated miRNAs were found to be associated with HCC. Putative gene targets of these 22 miRNAs were predicted in silico and were significantly enriched in 34 biological pathways, most of which are frequently dysregulated during carcinogenesis. Further characterization of microRNA-224 (miR-224), the most significantly up-regulated miRNA in HCC patients, revealed that miR-224 increases apoptotic cell death as well as proliferation and targets apoptosis inhibitor-5 (API-5) to inhibit API-5 transcript expression. Significantly, miR-224 expression was found to be inversely correlated with API-5 expression in HCC patients (p < 0.05). Hence, our findings define a true in vivo target of miR-224 and reaffirm the important role of miRNAs in the dysregulation of cellular processes that may ultimately lead to tumorigenesis.


Subject(s)
Apoptosis Regulatory Proteins/genetics , Carcinoma, Hepatocellular/genetics , Gene Expression Regulation, Neoplastic/genetics , MicroRNAs/genetics , Nuclear Proteins/genetics , Up-Regulation/genetics , Apoptosis , Base Sequence , Carcinoma, Hepatocellular/pathology , Cell Transformation, Neoplastic/genetics , Gene Expression Profiling , Humans , Molecular Sequence Data , Substrate Specificity , Transcription, Genetic/genetics , Tumor Cells, Cultured
16.
J Theor Biol ; 252(1): 145-54, 2008 May 07.
Article in English | MEDLINE | ID: mdl-18342336

ABSTRACT

Remote homology detection refers to the detection of structure homology in evolutionarily related proteins with low sequence similarity. Supervised learning algorithms such as support vector machine (SVM) are currently the most accurate methods. In most of these SVM-based methods, efforts have been dedicated to developing new kernels to better use the pairwise alignment scores or sequence profiles. Moreover, amino acids' physicochemical properties are not generally used in the feature representation of protein sequences. In this article, we present a remote homology detection method that incorporates two novel features: (1) a protein's primary sequence is represented using amino acid's physicochemical properties and (2) the similarity between two proteins is measured using recurrence quantification analysis (RQA). An optimization scheme was developed to select different amino acid indices (up to 10 for a protein family) that are best to characterize the given protein family. The selected amino acid indices may enable us to draw better biological explanation of the protein family classification problem than using other alignment-based methods. An SVM-based classifier will then work on the space described by the RQA metrics. The classification scheme is named as SVM-RQA. Experiments at the superfamily level of the SCOP1.53 dataset show that, without using alignment or sequence profile information, the features generated from amino acid indices are able to produce results that are comparable to those obtained by the published state-of-the-art SVM kernels. In the future, better prediction accuracies can be expected by combining the alignment-based features with our amino acids property-based features. Supplementary information including the raw dataset, the best-performing amino acid indices for each protein family and the computed RQA metrics for all protein sequences can be downloaded from http://ym151113.ym.edu.tw/svm-rqa.


Subject(s)
Amino Acids/chemistry , Sequence Homology, Amino Acid , Amino Acid Sequence , Chemical Phenomena , Chemistry, Physical , Databases, Protein , Pattern Recognition, Automated/methods , Sequence Analysis, Protein/methods
17.
In Silico Biol ; 7(1): 61-75, 2007.
Article in English | MEDLINE | ID: mdl-17688428

ABSTRACT

P53 is probably the most important tumor suppressor known. Over the years, information about this gene has increased dramatically. We have built a comprehensive knowledgebase of p53, which aims to facilitate wet-lab biologists to formulate their experiments and new-comers to learn whatever they need about the gene and bioinformaticians to make new discoveries through data analysis. Using the information curated, including mutation information, transcription factors, transcriptional targets, and single nucleotide polymorphisms, we have performed extensive bioinformatics analysis, and made several new discoveries about p53. We have identified point missense mutations that are over-represented in cancers, but lack of functional studies. By assessing the capability of six p53 transcriptional targets' tag SNPs selected from HapMap to capture SNPs obtained from National Institute of Environmental Health Sciences (NIEHS) Environmental Genome project and vice versa, we conclude that NIEHS data is a better source for tagSNP selections of these genes in future association studies. Analysis of microRNA regulation in the transcriptional network of the p53 gene reveals potentially important regulatory relationships between oncogenic microRNAs and transcription factors of p53. By mapping transcription factors of p53 to pathways involved in cell cycle and apoptosis, we have identified distinctive transcriptional controls of p53 in these two physiological states.


Subject(s)
Genes, p53 , MicroRNAs/genetics , Mutation, Missense , Point Mutation , Polymorphism, Genetic , Tumor Suppressor Protein p53/metabolism , Apoptosis , Codon , Computational Biology/methods , Gene Expression Profiling , Gene Expression Regulation, Neoplastic , Humans , MicroRNAs/metabolism , Oligonucleotide Array Sequence Analysis , Polymorphism, Single Nucleotide , Transcription, Genetic
18.
Hum Mol Genet ; 16(11): 1367-80, 2007 Jun 01.
Article in English | MEDLINE | ID: mdl-17412754

ABSTRACT

Members of the ATP-binding cassette (ABC) superfamily of transporters have been implicated as major players in drug response. Single nucleotide polymorphisms (SNPs) in the ABC transporter genes may account for variation in drug response between individuals. Given the abundance of SNPs within the human genome, identification of functionally important SNPs is difficult. Here, we utilized signatures of recent positive selection (RPS) to identify SNPs in ABC genes that have potential functional significance by using the long-range-haplotype test to search for signatures of RPS at 18 ABC genes involved in drug transport. From the genotype data of these 18 ABC genes in four populations extracted from the HapMap database, at least one SNP in each of these genes displayed genomic signatures of RPS in at least one population. However, only 13 SNPs in 10 ABC genes from three populations retained statistical significance after Type I error reduction. The functional significance of six of these RPS SNPs, including those that failed multiple testing correction (MTC), has been reported previously. We experimentally confirmed a functional effect for two SNPs, including one that failed to show evidence of RPS after MTC. These observations suggest that Type I error reduction may inadvertently increase Type II error. Although the remaining positively selected SNPs have yet to be functionally validated, our study illustrates the feasibility of using this strategy to identify SNPs within 'adaptive' genes that may confer functional effect, prior to testing their roles in individual/population drug response variation or in complex disease susceptibility.


Subject(s)
ATP-Binding Cassette Transporters/genetics , Multigene Family , Selection, Genetic , Gene Frequency , Genetic Markers , Humans
19.
BMC Bioinformatics ; 7: 525, 2006 Dec 01.
Article in English | MEDLINE | ID: mdl-17137522

ABSTRACT

BACKGROUND: The advent of genotype data from large-scale efforts that catalog the genetic variants of different populations have given rise to new avenues for multifactorial disease association studies. Recent work shows that genotype data from the International HapMap Project have a high degree of transferability to the wider population. This implies that the design of genotyping studies on local populations may be facilitated through inferences drawn from information contained in HapMap populations. RESULTS: To facilitate analysis of HapMap data for characterizing the haplotype structure of genes or any chromosomal regions, we have developed an integrated web-based resource, iHAP. In addition to incorporating genotype and haplotype data from the International HapMap Project and gene information from the UCSC Genome Browser Database, iHAP also provides capabilities for inferring haplotype blocks and selecting tag SNPs that are representative of haplotype patterns. These include block partitioning algorithms, block definitions, tag SNP definitions, as well as SNPs to be "force included" as tags. Based on the parameters defined at the input stage, iHAP performs on-the-fly analysis and displays the result graphically as a webpage. To facilitate analysis, intermediate and final result files can be downloaded. CONCLUSION: The iHAP resource, available at http://ihap.bii.a-star.edu.sg, provides a convenient yet flexible approach for the user community to analyze HapMap data and identify candidate targets for genotyping studies.


Subject(s)
Algorithms , Chromosome Mapping/methods , Databases, Genetic , Haplotypes/genetics , Information Storage and Retrieval/methods , Sequence Analysis, DNA/methods , Software , Base Sequence , Genetic Variation/genetics , Molecular Sequence Data
20.
BMC Genomics ; 7: 238, 2006 Sep 19.
Article in English | MEDLINE | ID: mdl-16982009

ABSTRACT

BACKGROUND: The recent advancement in human genome sequencing and genotyping has revealed millions of single nucleotide polymorphisms (SNP) which determine the variation among human beings. One of the particular important projects is The International HapMap Project which provides the catalogue of human genetic variation for disease association studies. In this paper, we analyzed the genotype data in HapMap project by using National Institute of Environmental Health Sciences Environmental Genome Project (NIEHS EGP) SNPs. We first determine whether the HapMap data are transferable to the NIEHS data. Then, we study how well the HapMap SNPs capture the untyped SNPs in the region. Finally, we provide general guidelines for determining whether the SNPs chosen from HapMap may be able to capture most of the untyped SNPs. RESULTS: Our analysis shows that HapMap data are not robust enough to capture the untyped variants for most of the human genes. The performance of SNPs for European and Asian samples are marginal in capturing the untyped variants, i.e. approximately 55%. Expectedly, the SNPs from HapMap YRI panel can only capture approximately 30% of the variants. Although the overall performance is low, however, the SNPs for some genes perform very well and are able to capture most of the variants along the gene. This is observed in the European and Asian panel, but not in African panel. Through observation, we concluded that in order to have a well covered SNPs reference panel, the SNPs density and the association among reference SNPs are important to estimate the robustness of the chosen SNPs. CONCLUSION: We have analyzed the coverage of HapMap SNPs using NIEHS EGP data. The results show that HapMap SNPs are transferable to the NIEHS SNPs. However, HapMap SNPs cannot capture some of the untyped SNPs and therefore resequencing may be needed to uncover more SNPs in the missing region.


Subject(s)
Polymorphism, Single Nucleotide/genetics , Asian People/genetics , Black People/genetics , Chromosome Mapping/methods , Genetic Variation , Genome, Human , Humans , Models, Genetic , Reproducibility of Results , White People/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...