Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 20
Filter
1.
BMC Bioinformatics ; 22(1): 498, 2021 Oct 15.
Article in English | MEDLINE | ID: mdl-34654363

ABSTRACT

BACKGROUND: Identifying gene interactions is a topic of great importance in genomics, and approaches based on network models provide a powerful tool for studying these. Assuming a Gaussian graphical model, a gene association network may be estimated from multiomic data based on the non-zero entries of the inverse covariance matrix. Inferring such biological networks is challenging because of the high dimensionality of the problem, making traditional estimators unsuitable. The graphical lasso is constructed for the estimation of sparse inverse covariance matrices in such situations, using [Formula: see text]-penalization on the matrix entries. The weighted graphical lasso is an extension in which prior biological information from other sources is integrated into the model. There are however issues with this approach, as it naïvely forces the prior information into the network estimation, even if it is misleading or does not agree with the data at hand. Further, if an associated network based on other data is used as the prior, the method often fails to utilize the information effectively. RESULTS: We propose a novel graphical lasso approach, the tailored graphical lasso, that aims to handle prior information of unknown accuracy more effectively. We provide an R package implementing the method, tailoredGlasso. Applying the method to both simulated and real multiomic data sets, we find that it outperforms the unweighted and weighted graphical lasso in terms of all performance measures we consider. In fact, the graphical lasso and weighted graphical lasso can be considered special cases of the tailored graphical lasso, and a parameter determined by the data measures the usefulness of the prior information. We also find that among a larger set of methods, the tailored graphical is the most suitable for network inference from high-dimensional data with prior information of unknown accuracy. With our method, mRNA data are demonstrated to provide highly useful prior information for protein-protein interaction networks. CONCLUSIONS: The method we introduce utilizes useful prior information more effectively without involving any risk of loss of accuracy should the prior information be misleading.


Subject(s)
Algorithms , Gene Regulatory Networks , Genomics , Normal Distribution , Protein Interaction Maps
2.
Stat Med ; 39(25): 3549-3568, 2020 11 10.
Article in English | MEDLINE | ID: mdl-32851696

ABSTRACT

In many statistical regression and prediction problems, it is reasonable to assume monotone relationships between certain predictor variables and the outcome. Genomic effects on phenotypes are, for instance, often assumed to be monotone. However, in some settings, it may be reasonable to assume a partially linear model, where some of the covariates can be assumed to have a linear effect. One example is a prediction model using both high-dimensional gene expression data, and low-dimensional clinical data, or when combining continuous and categorical covariates. We study methods for fitting the partially linear monotone model, where some covariates are assumed to have a linear effect on the response, and some are assumed to have a monotone (potentially nonlinear) effect. Most existing methods in the literature for fitting such models are subject to the limitation that they have to be provided the monotonicity directions a priori for the different monotone effects. We here present methods for fitting partially linear monotone models which perform both automatic variable selection, and monotonicity direction discovery. The proposed methods perform comparably to, or better than, existing methods, in terms of estimation, prediction, and variable selection performance, in simulation experiments in both classical and high-dimensional data settings.


Subject(s)
Algorithms , Computer Simulation , Linear Models , Regression Analysis
3.
Brief Bioinform ; 21(5): 1523-1530, 2020 09 25.
Article in English | MEDLINE | ID: mdl-31624847

ABSTRACT

The generation and systematic collection of genome-wide data is ever-increasing. This vast amount of data has enabled researchers to study relations between a variety of genomic and epigenomic features, including genetic variation, gene regulation and phenotypic traits. Such relations are typically investigated by comparatively assessing genomic co-occurrence. Technically, this corresponds to assessing the similarity of pairs of genome-wide binary vectors. A variety of similarity measures have been proposed for this problem in other fields like ecology. However, while several of these measures have been employed for assessing genomic co-occurrence, their appropriateness for the genomic setting has never been investigated. We show that the choice of similarity measure may strongly influence results and propose two alternative modelling assumptions that can be used to guide this choice. On both simulated and real genomic data, the Jaccard index is strongly altered by dataset size and should be used with caution. The Forbes coefficient (fold change) and tetrachoric correlation are less influenced by dataset size, but one should be aware of increased variance for small datasets. All results on simulated and real data can be inspected and reproduced at https://hyperbrowser.uio.no/sim-measure.


Subject(s)
Genomics/methods , Algorithms , Datasets as Topic , Gene Expression Regulation , Genetic Variation , Humans
4.
PLoS Comput Biol ; 15(2): e1006731, 2019 02.
Article in English | MEDLINE | ID: mdl-30779737

ABSTRACT

Graph-based representations are considered to be the future for reference genomes, as they allow integrated representation of the steadily increasing data on individual variation. Currently available tools allow de novo assembly of graph-based reference genomes, alignment of new read sets to the graph representation as well as certain analyses like variant calling and haplotyping. We here present a first method for calling ChIP-Seq peaks on read data aligned to a graph-based reference genome. The method is a graph generalization of the peak caller MACS2, and is implemented in an open source tool, Graph Peak Caller. By using the existing tool vg to build a pan-genome of Arabidopsis thaliana, we validate our approach by showing that Graph Peak Caller with a pan-genome reference graph can trace variants within peaks that are not part of the linear reference genome, and find peaks that in general are more motif-enriched than those found by MACS2.


Subject(s)
Chromatin Immunoprecipitation/methods , Genomics/methods , Sequence Analysis, DNA/methods , Algorithms , Arabidopsis/genetics , Genome/genetics , Protein Binding , Software , Transcription Factors
5.
BMC Med Genomics ; 11(1): 24, 2018 03 07.
Article in English | MEDLINE | ID: mdl-29514638

ABSTRACT

BACKGROUND: Using high-dimensional penalized regression we studied genome-wide DNA-methylation in bone biopsies of 80 postmenopausal women in relation to their bone mineral density (BMD). The women showed BMD varying from severely osteoporotic to normal. Global gene expression data from the same individuals was available, and since DNA-methylation often affects gene expression, the overall aim of this paper was to include both of these omics data sets into an integrated analysis. METHODS: The classical penalized regression uses one penalty, but we incorporated individual penalties for each of the DNA-methylation sites. These individual penalties were guided by the strength of association between DNA-methylations and gene transcript levels. DNA-methylations that were highly associated to one or more transcripts got lower penalties and were therefore favored compared to DNA-methylations showing less association to expression. Because of the complex pathways and interactions among genes, we investigated both the association between DNA-methylations and their corresponding cis gene, as well as the association between DNA-methylations and trans-located genes. Two integrating penalized methods were used: first, an adaptive group-regularized ridge regression, and secondly, variable selection was performed through a modified version of the weighted lasso. RESULTS: When information from gene expressions was integrated, predictive performance was considerably improved, in terms of predictive mean square error, compared to classical penalized regression without data integration. We found a 14.7% improvement in the ridge regression case and a 17% improvement for the lasso case. Our version of the weighted lasso with data integration found a list of 22 interesting methylation sites. Several corresponded to genes that are known to be important in bone formation. Using BMD as response and these 22 methylation sites as covariates, least square regression analyses resulted in R2=0.726, comparable to an average R2=0.438 for 10000 randomly selected groups of DNA-methylations with group size 22. CONCLUSIONS: Two recent types of penalized regression methods were adapted to integrate DNA-methylation and their association to gene expression in the analysis of bone mineral density. In both cases predictions clearly benefit from including the additional information on gene expressions.


Subject(s)
Bone Density/genetics , DNA Methylation , Data Analysis , Gene Expression Profiling , Postmenopause/genetics , Postmenopause/physiology , Cohort Studies , Female , Genomics , Humans , Multivariate Analysis , Regression Analysis
6.
Epigenetics ; 12(8): 674-687, 2017 08.
Article in English | MEDLINE | ID: mdl-28650214

ABSTRACT

DNA methylation affects expression of associated genes and may contribute to the missing genetic effects from genome-wide association studies of osteoporosis. To improve insight into the mechanisms of postmenopausal osteoporosis, we combined transcript profiling with DNA methylation analyses in bone. RNA and DNA were isolated from 84 bone biopsies of postmenopausal donors varying markedly in bone mineral density (BMD). In all, 2529 CpGs in the top 100 genes most significantly associated with BMD were analyzed. The methylation levels at 63 CpGs differed significantly between healthy and osteoporotic women at 10% false discovery rate (FDR). Five of these CpGs at 5% FDR could explain 14% of BMD variation. To test whether blood DNA methylation reflect the situation in bone (as shown for other tissues), an independent cohort was selected and BMD association was demonstrated in blood for 13 of the 63 CpGs. Four transcripts representing inhibitors of bone metabolism-MEPE, SOST, WIF1, and DKK1-showed correlation to a high number of methylated CpGs, at 5% FDR. Our results link DNA methylation to the genetic influence modifying the skeleton, and the data suggest a complex interaction between CpG methylation and gene regulation. This is the first study in the hitherto largest number of postmenopausal women to demonstrate a strong association among bone CpG methylation, transcript levels, and BMD/fracture. This new insight may have implications for evaluation of osteoporosis stage and susceptibility.


Subject(s)
DNA Methylation , Osteoporosis, Postmenopausal/genetics , Adaptor Proteins, Signal Transducing/genetics , Adaptor Proteins, Signal Transducing/metabolism , Aged , Aged, 80 and over , Blood Cells/metabolism , Bone Density/genetics , Bone Morphogenetic Proteins/genetics , Bone Morphogenetic Proteins/metabolism , Bone and Bones/metabolism , Case-Control Studies , CpG Islands , Extracellular Matrix Proteins/genetics , Extracellular Matrix Proteins/metabolism , Female , Genetic Markers/genetics , Glycoproteins/genetics , Glycoproteins/metabolism , Humans , Intercellular Signaling Peptides and Proteins/genetics , Intercellular Signaling Peptides and Proteins/metabolism , Middle Aged , Phosphoproteins/genetics , Phosphoproteins/metabolism , Repressor Proteins/genetics , Repressor Proteins/metabolism
7.
Gigascience ; 6(7): 1-12, 2017 07 01.
Article in English | MEDLINE | ID: mdl-28459977

ABSTRACT

Background: Recent large-scale undertakings such as ENCODE and Roadmap Epigenomics have generated experimental data mapped to the human reference genome (as genomic tracks) representing a variety of functional elements across a large number of cell types. Despite the high potential value of these publicly available data for a broad variety of investigations, little attention has been given to the analytical methodology necessary for their widespread utilisation. Findings: We here present a first principled treatment of the analysis of collections of genomic tracks. We have developed novel computational and statistical methodology to permit comparative and confirmatory analyses across multiple and disparate data sources. We delineate a set of generic questions that are useful across a broad range of investigations and discuss the implications of choosing different statistical measures and null models. Examples include contrasting analyses across different tissues or diseases. The methodology has been implemented in a comprehensive open-source software system, the GSuite HyperBrowser. To make the functionality accessible to biologists, and to facilitate reproducible analysis, we have also developed a web-based interface providing an expertly guided and customizable way of utilizing the methodology. With this system, many novel biological questions can flexibly be posed and rapidly answered. Conclusions: Through a combination of streamlined data acquisition, interoperable representation of dataset collections, and customizable statistical analysis with guided setup and interpretation, the GSuite HyperBrowser represents a first comprehensive solution for integrative analysis of track collections across the genome and epigenome. The software is available at: https://hyperbrowser.uio.no.


Subject(s)
Datasets as Topic/standards , Epigenesis, Genetic , Epigenomics/methods , Genome, Human , Software , Whole Genome Sequencing/methods , Epigenomics/standards , Humans , Whole Genome Sequencing/standards
8.
BMC Bioinformatics ; 18(1): 263, 2017 May 18.
Article in English | MEDLINE | ID: mdl-28521770

ABSTRACT

BACKGROUND: It has been proposed that future reference genomes should be graph structures in order to better represent the sequence diversity present in a species. However, there is currently no standard method to represent genomic intervals, such as the positions of genes or transcription factor binding sites, on graph-based reference genomes. RESULTS: We formalize offset-based coordinate systems on graph-based reference genomes and introduce methods for representing intervals on these reference structures. We show the advantage of our methods by representing genes on a graph-based representation of the newest assembly of the human genome (GRCh38) and its alternative loci for regions that are highly variable. CONCLUSION: More complex reference genomes, containing alternative loci, require methods to represent genomic data on these structures. Our proposed notation for genomic intervals makes it possible to fully utilize the alternative loci of the GRCh38 assembly and potential future graph-based reference genomes. We have made a Python package for representing such intervals on offset-based coordinate systems, available at https://github.com/uio-cels/offsetbasedgraph . An interactive web-tool using this Python package to visualize genes on a graph created from GRCh38 is available at https://github.com/uio-cels/genomicgraphcoords .


Subject(s)
Computer Graphics , Genome, Human , Genomics/methods , Algorithms , Genetic Loci , Humans , Internet , RNA, Messenger/genetics , RNA, Messenger/metabolism , Sequence Analysis, DNA , Software
9.
Nucleic Acids Res ; 41(Web Server issue): W133-41, 2013 Jul.
Article in English | MEDLINE | ID: mdl-23632163

ABSTRACT

The immense increase in availability of genomic scale datasets, such as those provided by the ENCODE and Roadmap Epigenomics projects, presents unprecedented opportunities for individual researchers to pose novel falsifiable biological questions. With this opportunity, however, researchers are faced with the challenge of how to best analyze and interpret their genome-scale datasets. A powerful way of representing genome-scale data is as feature-specific coordinates relative to reference genome assemblies, i.e. as genomic tracks. The Genomic HyperBrowser (http://hyperbrowser.uio.no) is an open-ended web server for the analysis of genomic track data. Through the provision of several highly customizable components for processing and statistical analysis of genomic tracks, the HyperBrowser opens for a range of genomic investigations, related to, e.g., gene regulation, disease association or epigenetic modifications of the genome.


Subject(s)
Genomics/methods , Software , Data Interpretation, Statistical , Genome , Internet
10.
Nucleic Acids Res ; 41(10): 5164-74, 2013 May 01.
Article in English | MEDLINE | ID: mdl-23571755

ABSTRACT

The study of chromatin 3D structure has recently gained much focus owing to novel techniques for detecting genome-wide chromatin contacts using next-generation sequencing. A deeper understanding of the architecture of the DNA inside the nucleus is crucial for gaining insight into fundamental processes such as transcriptional regulation, genome dynamics and genome stability. Chromatin conformation capture-based methods, such as Hi-C and ChIA-PET, are now paving the way for routine genome-wide studies of chromatin 3D structure in a range of organisms and tissues. However, appropriate methods for analyzing such data are lacking. Here, we propose a hypothesis test and an enrichment score of 3D co-localization of genomic elements that handles intra- or interchromosomal interactions, both separately and jointly, and that adjusts for biases caused by structural dependencies in the 3D data. We show that maintaining structural properties during resampling is essential to obtain valid estimation of P-values. We apply the method on chromatin states and a set of mutated regions in leukemia cells, and find significant co-localization of these elements, with varying enrichment scores, supporting the role of chromatin 3D structure in shaping the landscape of somatic mutations in cancer.


Subject(s)
Chromatin/chemistry , Cell Line, Tumor , Chromosomes, Human/chemistry , Data Interpretation, Statistical , Genome , Humans , Leukemia/genetics , Mutation , Nucleic Acid Conformation , Sequence Analysis, DNA
11.
PLoS Comput Biol ; 7(12): e1002292, 2011 Dec.
Article in English | MEDLINE | ID: mdl-22144885

ABSTRACT

Integration of retroviral vectors in the human genome follows non random patterns that favor insertional deregulation of gene expression and may cause risks of insertional mutagenesis when used in clinical gene therapy. Understanding how viral vectors integrate into the human genome is a key issue in predicting these risks. We provide a new statistical method to compare retroviral integration patterns. We identified the positions where vectors derived from the Human Immunodeficiency Virus (HIV) and the Moloney Murine Leukemia Virus (MLV) show different integration behaviors in human hematopoietic progenitor cells. Non-parametric density estimation was used to identify candidate comparative hotspots, which were then tested and ranked. We found 100 significative comparative hotspots, distributed throughout the chromosomes. HIV hotspots were wider and contained more genes than MLV ones. A Gene Ontology analysis of HIV targets showed enrichment of genes involved in antigen processing and presentation, reflecting the high HIV integration frequency observed at the MHC locus on chromosome 6. Four histone modifications/variants had a different mean density in comparative hotspots (H2AZ, H3K4me1, H3K4me3, H3K9me1), while gene expression within the comparative hotspots did not differ from background. These findings suggest the existence of epigenetic or nuclear three-dimensional topology contexts guiding retroviral integration to specific chromosome areas.


Subject(s)
Genetic Vectors/genetics , Genome, Human , HIV/genetics , Models, Genetic , Moloney murine leukemia virus/genetics , Virus Integration , Antigens, CD34/genetics , Chromosomes, Human, Pair 6 , Genetic Loci , HLA Antigens/genetics , Hematopoietic Stem Cells , Histones/genetics , Humans , Reproducibility of Results
12.
BMC Genomics ; 12: 353, 2011 Jul 07.
Article in English | MEDLINE | ID: mdl-21736759

ABSTRACT

BACKGROUND: Transcription factors in disease-relevant pathways represent potential drug targets, by impacting a distinct set of pathways that may be modulated through gene regulation. The influence of transcription factors is typically studied on a per disease basis, and no current resources provide a global overview of the relations between transcription factors and disease. Furthermore, existing pipelines for related large-scale analysis are tailored for particular sources of input data, and there is a need for generic methodology for integrating complementary sources of genomic information. RESULTS: We here present a large-scale analysis of multiple diseases versus multiple transcription factors, with a global map of over-and under-representation of 446 transcription factors in 1010 diseases. This map, referred to as the differential disease regulome, provides a first global statistical overview of the complex interrelationships between diseases, genes and controlling elements. The map is visualized using the Google map engine, due to its very large size, and provides a range of detailed information in a dynamic presentation format.The analysis is achieved through a novel methodology that performs a pairwise, genome-wide comparison on the cartesian product of two distinct sets of annotation tracks, e.g. all combinations of one disease and one TF.The methodology was also used to extend with maps using alternative data sets related to transcription and disease, as well as data sets related to Gene Ontology classification and histone modifications. We provide a web-based interface that allows users to generate other custom maps, which could be based on precisely specified subsets of transcription factors and diseases, or, in general, on any categorical genome annotation tracks as they are improved or become available. CONCLUSION: We have created a first resource that provides a global overview of the complex relations between transcription factors and disease. As the accuracy of the disease regulome depends mainly on the quality of the input data, forthcoming ChIP-seq based binding data for many TFs will provide improved maps. We further believe our approach to genome analysis could allow an advance from the current typical situation of one-time integrative efforts to reproducible and upgradable integrative analysis. The differential disease regulome and its associated methodology is available at http://hyperbrowser.uio.no.


Subject(s)
Disease/genetics , Genomics/methods , Transcription Factors/genetics , Transcription Factors/metabolism , Computer Graphics , Humans , Internet , Molecular Sequence Annotation
13.
Stat Appl Genet Mol Biol ; 10(1)2011 Aug 29.
Article in English | MEDLINE | ID: mdl-23089821

ABSTRACT

The lasso is one of the most commonly used methods for high-dimensional regression, but can be unstable and lacks satisfactory asymptotic properties for variable selection. We propose to use weighted lasso with integrated relevant external information on the covariates to guide the selection towards more stable results. Weighting the penalties with external information gives each regression coefficient a covariate specific amount of penalization and can improve upon standard methods that do not use such information by borrowing knowledge from the external material. The method is applied to two cancer data sets, with gene expressions as covariates. We find interesting gene signatures, which we are able to validate. We discuss various ideas on how the weights should be defined and illustrate how different types of investigations can utilize our method exploiting different sources of external data. Through simulations, we show that our method outperforms the lasso and the adaptive lasso when the external information is from relevant to partly relevant, in terms of both variable selection and prediction.


Subject(s)
Computational Biology/methods , Regression Analysis , Software , Computer Simulation , Disease Progression , Gene Dosage , Gene Expression Regulation, Neoplastic , Genes, Neoplasm , Genome-Wide Association Study/methods , Head and Neck Neoplasms/genetics , Head and Neck Neoplasms/pathology , Humans , Predictive Value of Tests , Reproducibility of Results , Statistics, Nonparametric , Survival Analysis
14.
Genome Biol ; 11(12): R121, 2010.
Article in English | MEDLINE | ID: mdl-21182759

ABSTRACT

The immense increase in the generation of genomic scale data poses an unmet analytical challenge, due to a lack of established methodology with the required flexibility and power. We propose a first principled approach to statistical analysis of sequence-level genomic information. We provide a growing collection of generic biological investigations that query pairwise relations between tracks, represented as mathematical objects, along the genome. The Genomic HyperBrowser implements the approach and is available at http://hyperbrowser.uio.no.


Subject(s)
Computational Biology/methods , Genome , Genomics/methods , Sequence Analysis/methods , Software , Base Pairing , Exons , Gene Expression , Histones/metabolism , Models, Biological , Nucleic Acid Denaturation , Polymorphism, Single Nucleotide
15.
PLoS Genet ; 5(11): e1000719, 2009 Nov.
Article in English | MEDLINE | ID: mdl-19911042

ABSTRACT

Integrative analysis of gene dosage, expression, and ontology (GO) data was performed to discover driver genes in the carcinogenesis and chemoradioresistance of cervical cancers. Gene dosage and expression profiles of 102 locally advanced cervical cancers were generated by microarray techniques. Fifty-two of these patients were also analyzed with the Illumina expression method to confirm the gene expression results. An independent cohort of 41 patients was used for validation of gene expressions associated with clinical outcome. Statistical analysis identified 29 recurrent gains and losses and 3 losses (on 3p, 13q, 21q) associated with poor outcome after chemoradiotherapy. The intratumor heterogeneity, assessed from the gene dosage profiles, was low for these alterations, showing that they had emerged prior to many other alterations and probably were early events in carcinogenesis. Integration of the alterations with gene expression and GO data identified genes that were regulated by the alterations and revealed five biological processes that were significantly overrepresented among the affected genes: apoptosis, metabolism, macromolecule localization, translation, and transcription. Four genes on 3p (RYBP, GBE1) and 13q (FAM48A, MED4) correlated with outcome at both the gene dosage and expression level and were satisfactorily validated in the independent cohort. These integrated analyses yielded 57 candidate drivers of 24 genetic events, including novel loci responsible for chemoradioresistance. Further mapping of the connections among genetic events, drivers, and biological processes suggested that each individual event stimulates specific processes in carcinogenesis through the coordinated control of multiple genes. The present results may provide novel therapeutic opportunities of both early and advanced stage cervical cancers.


Subject(s)
Gene Dosage , Gene Expression Regulation, Neoplastic , Uterine Cervical Neoplasms/genetics , Adult , Aged , Cohort Studies , Female , Genes, Neoplasm , Humans , Kaplan-Meier Estimate , Middle Aged , Oligonucleotide Array Sequence Analysis , Proportional Hazards Models , Regression Analysis , Uterine Cervical Neoplasms/drug therapy , Uterine Cervical Neoplasms/pathology , Uterine Cervical Neoplasms/radiotherapy
16.
BMC Genomics ; 9: 258, 2008 May 30.
Article in English | MEDLINE | ID: mdl-18513391

ABSTRACT

BACKGROUND: Oligoarrays have become an accessible technique for exploring the transcriptome, but it is presently unclear how absolute transcript data from this technique compare to the data achieved with tag-based quantitative techniques, such as massively parallel signature sequencing (MPSS) and serial analysis of gene expression (SAGE). By use of the TransCount method we calculated absolute transcript concentrations from spotted oligoarray intensities, enabling direct comparisons with tag counts obtained with MPSS and SAGE. The tag counts were converted to number of transcripts per cell by assuming that the sum of all transcripts in a single cell was 5.105. Our aim was to investigate whether the less resource demanding and more widespread oligoarray technique could provide data that were correlated to and had the same absolute scale as those obtained with MPSS and SAGE. RESULTS: A number of 1,777 unique transcripts were detected in common for the three technologies and served as the basis for our analyses. The correlations involving the oligoarray data were not weaker than, but, similar to the correlation between the MPSS and SAGE data, both when the entire concentration range was considered and at high concentrations. The data sets were more strongly correlated at high transcript concentrations than at low concentrations. On an absolute scale, the number of transcripts per cell and gene was generally higher based on oligoarrays than on MPSS and SAGE, and ranged from 1.6 to 9,705 for the 1,777 overlapping genes. The MPSS data were on same scale as the SAGE data, ranging from 0.5 to 3,180 (MPSS) and 9 to1,268 (SAGE) transcripts per cell and gene. The sum of all transcripts per cell for these genes was 3.8.105 (oligoarrays), 1.1.105 (MPSS) and 7.6.104 (SAGE), whereas the corresponding sum for all detected transcripts was 1.1.106 (oligoarrays), 2.8.105 (MPSS) and 3.8.105 (SAGE). CONCLUSION: The oligoarrays and TransCount provide quantitative transcript concentrations that are correlated to MPSS and SAGE data, but, the absolute scale of the measurements differs across the technologies. The discrepancy questions whether the sum of all transcripts within a single cell might be higher than the number of 5.105 suggested in the literature and used to convert tag counts to transcripts per cell. If so, this may explain the apparent higher transcript detection efficiency of the oligoarrays, and has to be clarified before absolute transcript concentrations can be interchanged across the technologies. The ability to obtain transcript concentrations from oligoarrays opens up the possibility of efficient generation of universal transcript databases with low resource demands.


Subject(s)
Gene Expression Profiling/methods , Oligonucleotide Array Sequence Analysis/methods , Animals , Expressed Sequence Tags , Mice , RNA, Messenger/genetics , RNA, Messenger/metabolism , Retina/metabolism
17.
Bioinformatics ; 21(23): 4272-9, 2005 Dec 01.
Article in English | MEDLINE | ID: mdl-16216830

ABSTRACT

MOTIVATION: Missing values are problematic for the analysis of microarray data. Imputation methods have been compared in terms of the similarity between imputed and true values in simulation experiments and not of their influence on the final analysis. The focus has been on missing at random, while entries are missing also not at random. RESULTS: We investigate the influence of imputation on the detection of differentially expressed genes from cDNA microarray data. We apply ANOVA for microarrays and SAM and look to the differentially expressed genes that are lost because of imputation. We show that this new measure provides useful information that the traditional root mean squared error cannot capture. We also show that the type of missingness matters: imputing 5% missing not at random has the same effect as imputing 10-30% missing at random. We propose a new method for imputation (LinImp), fitting a simple linear model for each channel separately, and compare it with the widely used KNNimpute method. For 10% missing at random, KNNimpute leads to twice as many lost differentially expressed genes as LinImp. AVAILABILITY: The R package for LinImp is available at http://folk.uio.no/idasch/imp.


Subject(s)
Computational Biology/methods , Gene Expression Regulation , Oligonucleotide Array Sequence Analysis/methods , Algorithms , Analysis of Variance , Cluster Analysis , DNA, Complementary/metabolism , Data Interpretation, Statistical , Gene Expression Profiling , Likelihood Functions , Linear Models , Mathematical Computing , Models, Genetic , Models, Statistical , Models, Theoretical , Multigene Family , Normal Distribution , Reproducibility of Results , Sensitivity and Specificity , Sequence Analysis, DNA , Software , Statistics as Topic
18.
Nucleic Acids Res ; 33(17): e143, 2005 Oct 04.
Article in English | MEDLINE | ID: mdl-16204447

ABSTRACT

A method providing absolute transcript concentrations from spotted microarray intensity data is presented. Number of transcripts per microg total RNA, mRNA or per cell, are obtained for each gene, enabling comparisons of transcript levels within and between tissues. The method is based on Bayesian statistical modelling incorporating available information about the experiment from target preparation to image analysis, leading to realistically large confidence intervals for estimated concentrations. The method was validated in experiments using transcripts at known concentrations, showing accuracy and reproducibility of estimated concentrations, which were also in excellent agreement with results from quantitative real-time PCR. We determined the concentration for 10,157 genes in cervix cancers and a pool of cancer cell lines and found values in the range of 10(5)-10(10) transcripts per microg total RNA. The precision of our estimates was sufficiently high to detect significant concentration differences between two tumours and between different genes within the same tumour, comparisons that are not possible with standard intensity ratios. Our method can be used to explore the regulation of pathways and to develop individualized therapies, based on absolute transcript concentrations. It can be applied broadly, facilitating the construction of the transcriptome, continuously updating it by integrating future data.


Subject(s)
Genomics/methods , Oligonucleotide Array Sequence Analysis/methods , RNA, Messenger/analysis , RNA, Neoplasm/analysis , Bayes Theorem , Cell Line, Tumor , Female , Humans , Transcription, Genetic , Uterine Cervical Neoplasms/genetics
19.
J Steroid Biochem Mol Biol ; 95(1-5): 105-11, 2005 May.
Article in English | MEDLINE | ID: mdl-16023338

ABSTRACT

Intratumoral levels of E1 (oestrone), E1S (oestrone sulphate) and E2 (oestradiol) are significantly reduced by treatment with the aromatase inhibitor anastrozole regardless of treatment response. The purpose of the present pilot study was to look for additional markers of biochemical response to aromatase inhibitors on mRNA expression level. Whole genome expression was studied using microarray analysis of breast cancer tissue from 12 patients with locally advanced tumors, both before and following 15 weeks of treatment with the aromatase inhibitor anastrozole (Arimidex). Intratumoral mRNA levels for a subset of genes coding for steroid metabolizing enzymes, hormone receptors and some growth mediators involved in cell cycle control were analysed by quantitative RT-PCR. There was a correlation between the two methods for some but not all genes. The mRNA expression levels of the different genes were correlated to each other and to the intratumoral levels of E1, E2 and E1S, before and after the treatment. Notably, a correlation of the E1/E2 metabolic ratio to the mRNA levels of CYP19A1 was observed before treatment (r=0.745, p<0.005). Whole genome expression analysis of these 12 breast cancer patients revealed similar tumor classification to previously published larger studies. Tumors with no or low expression of ESR1 (oestrogen receptor) clustered together and were characterized by a strong basal-like signature highly expressing keratins 5/17, cadherin 3, frizzled and apolipoprotein D, among others. The luminal epithelial tumor cluster, on the other hand, highly expressed ESR1, GATA binding protein 3 and N-acetyl transferase. An evident ERBB2 cluster was observed due to the marked over-expression of the ERBB2 gene and GRB7 and PPARBP in this patient material). Using significance analysis of microarrays (SAM), we identified 298 genes significantly differently expressed between the partial response and progressive disease groups.


Subject(s)
Antineoplastic Agents, Hormonal/therapeutic use , Aromatase Inhibitors/therapeutic use , Breast Neoplasms/drug therapy , Breast Neoplasms/genetics , Gene Expression/drug effects , Nitriles/therapeutic use , Triazoles/therapeutic use , Anastrozole , Antineoplastic Agents, Hormonal/pharmacology , Aromatase Inhibitors/pharmacology , Biomarkers, Tumor/genetics , Female , Humans , Nitriles/pharmacology , Oligonucleotide Array Sequence Analysis , Triazoles/pharmacology
20.
Bioinformatics ; 21(6): 821-2, 2005 Mar.
Article in English | MEDLINE | ID: mdl-15531610

ABSTRACT

SUMMARY: CGH-Explorer is a program for visualization and statistical analysis of microarray-based comparative genomic hybridization (array-CGH) data. The program has preprocessing facilities, tools for graphical exploration of individual arrays or groups of arrays, and tools for statistical identification of regions of amplification and deletion.


Subject(s)
DNA Mutational Analysis/methods , Gene Expression Profiling/methods , In Situ Hybridization/methods , Oligonucleotide Array Sequence Analysis/methods , Sequence Analysis, DNA/methods , Software , User-Computer Interface , Computer Graphics , Gene Dosage , Genetic Variation/genetics
SELECTION OF CITATIONS
SEARCH DETAIL
...