Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 20 de 170
Filtrar
1.
Artigo em Inglês | MEDLINE | ID: mdl-38879443

RESUMO

OBJECTIVE: Investigate the use of advanced natural language processing models to streamline the time-consuming process of writing and revising scholarly manuscripts. MATERIALS AND METHODS: For this purpose, we integrate large language models into the Manubot publishing ecosystem to suggest revisions for scholarly texts. Our AI-based revision workflow employs a prompt generator that incorporates manuscript metadata into templates, generating section-specific instructions for the language model. The model then generates revised versions of each paragraph for human authors to review. We evaluated this methodology through 5 case studies of existing manuscripts, including the revision of this manuscript. RESULTS: Our results indicate that these models, despite some limitations, can grasp complex academic concepts and enhance text quality. All changes to the manuscript are tracked using a version control system, ensuring transparency in distinguishing between human- and machine-generated text. CONCLUSIONS: Given the significant time researchers invest in crafting prose, incorporating large language models into the scholarly writing process can significantly improve the type of knowledge work performed by academics. Our approach also enables scholars to concentrate on critical aspects of their work, such as the novelty of their ideas, while automating tedious tasks like adhering to specific writing styles. Although the use of AI-assisted tools in scientific authoring is controversial, our approach, which focuses on revising human-written text and provides change-tracking transparency, can mitigate concerns regarding AI's role in scientific writing.

2.
Elife ; 122024 May 28.
Artigo em Inglês | MEDLINE | ID: mdl-38804191

RESUMO

Science journalism is a critical way for the public to learn about and benefit from scientific findings. Such journalism shapes the public's view of the current state of science and legitimizes experts. Journalists can only cite and quote a limited number of sources, who they may discover in their research, including recommendations by other scientists. Biases in either process may influence who is identified and ultimately included as a source. To examine potential biases in science journalism, we analyzed 22,001 non-research articles published by Nature and compared these with Nature-published research articles with respect to predicted gender and name origin. We extracted cited authors' names and those of quoted speakers. While citations and quotations within a piece do not reflect the entire information-gathering process, they can provide insight into the demographics of visible sources. We then predicted gender and name origin of the cited authors and speakers. We compared articles with a comparator set made up of first and last authors within primary research articles in Nature and a subset of Springer Nature articles in the same time period. In our analysis, we found a skew toward quoting men in Nature science journalism. However, quotation is trending toward equal representation at a faster rate than authorship rates in academic publishing. Gender disparity in Nature quotes was dependent on the article type. We found a significant over-representation of names with predicted Celtic/English origin and under-representation of names with a predicted East Asian origin in both in extracted quotes and journal citations but dampened in citations.


Assuntos
Jornalismo , Humanos , Masculino , Feminino , Ciência , Autoria , Fatores Sexuais , Publicações Periódicas como Assunto/estatística & dados numéricos , Bibliometria , Sexismo/estatística & dados numéricos
3.
Artigo em Inglês | MEDLINE | ID: mdl-38780898

RESUMO

BACKGROUND: High-grade serous carcinoma (HGSC) gene expression subtypes are associated with differential survival. We characterized HGSC gene expression in Black individuals and considered whether gene expression differences by self-identified race may contribute to poorer HGSC survival among Black versus White individuals. METHODS: We included newly generated RNA-Seq data from Black and White individuals, and array-based genotyping data from four existing studies of White and Japanese individuals. We used K-means clustering, a method with no predefined number of clusters or dataset-specific features, to assign subtypes. Cluster- and dataset-specific gene expression patterns were summarized by moderated t-scores. We compared cluster-specific gene expression patterns across datasets by calculating the correlation between the summarized vectors of moderated t-scores. Following mapping to The Cancer Genome Atlas (TCGA)-derived HGSC subtypes, we used Cox proportional hazards models to estimate subtype-specific survival by dataset. RESULTS: Cluster-specific gene expression was similar across gene expression platforms and racial groups. Comparing the Black population to the White and Japanese populations, the immunoreactive subtype was more common (39% versus 23%-28%) and the differentiated subtype less common (7% versus 22%-31%). Patterns of subtype-specific survival were similar between the Black and White populations with RNA-Seq data; compared to mesenchymal cases, the risk of death was similar for proliferative and differentiated cases and suggestively lower for immunoreactive cases (Black population HR=0.79 [0.55, 1.13], White population HR=0.86 [0.62, 1.19]). CONCLUSIONS: While the prevalence of HGSC subtypes varied by race, subtype-specific survival was similar. IMPACT: HGSC subtypes can be consistently assigned across platforms and self-identified racial groups.

4.
eNeuro ; 11(6)2024 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-38789274

RESUMO

High-throughput gene expression profiling measures individual gene expression across conditions. However, genes are regulated in complex networks, not as individual entities, limiting the interpretability of gene expression data. Machine learning models that incorporate prior biological knowledge are a powerful tool to extract meaningful biology from gene expression data. Pathway-level information extractor (PLIER) is an unsupervised machine learning method that defines biological pathways by leveraging the vast amount of published transcriptomic data. PLIER converts gene expression data into known pathway gene sets, termed latent variables (LVs), to substantially reduce data dimensionality and improve interpretability. In the current study, we trained the first mouse PLIER model on 190,111 mouse brain RNA-sequencing samples, the greatest amount of training data ever used by PLIER. We then validated the mousiPLIER approach in a study of microglia and astrocyte gene expression across mouse brain aging. mousiPLIER identified biological pathways that are significantly associated with aging, including one latent variable (LV41) corresponding to striatal signal. To gain further insight into the genes contained in LV41, we performed k-means clustering on the training data to identify studies that respond strongly to LV41. We found that the variable was relevant to striatum and aging across the scientific literature. Finally, we built a Web server (http://mousiplier.greenelab.com/) for users to easily explore the learned latent variables. Taken together, this study defines mousiPLIER as a method to uncover meaningful biological processes in mouse brain transcriptomic studies.


Assuntos
Encéfalo , Animais , Camundongos , Encéfalo/metabolismo , Perfilação da Expressão Gênica , Envelhecimento/fisiologia , Aprendizado de Máquina não Supervisionado , Transcriptoma , Astrócitos/metabolismo , Microglia/metabolismo , Aprendizado de Máquina , Masculino , Camundongos Endogâmicos C57BL
5.
Microbiol Spectr ; 12(4): e0315723, 2024 Apr 02.
Artigo em Inglês | MEDLINE | ID: mdl-38385740

RESUMO

Chronic Pseudomonas aeruginosa lung infections are a feature of cystic fibrosis (CF) that many patients experience even with the advent of highly effective modulator therapies. Identifying factors that impact P. aeruginosa in the CF lung could yield novel strategies to eradicate infection or otherwise improve outcomes. To complement published P. aeruginosa studies using laboratory models or RNA isolated from sputum, we analyzed transcripts of strain PAO1 after incubation in sputum from different CF donors prior to RNA extraction. We compared PAO1 gene expression in this "spike-in" sputum model to that for P. aeruginosa grown in synthetic cystic fibrosis sputum medium to determine key genes, which are among the most differentially expressed or most highly expressed. Using the key genes, gene sets with correlated expression were determined using the gene expression analysis tool eADAGE. Gene sets were used to analyze the activity of specific pathways in P. aeruginosa grown in sputum from different individuals. Gene sets that we found to be more active in sputum showed similar activation in published data that included P. aeruginosa RNA isolated from sputum relative to corresponding in vitro reference cultures. In the ex vivo samples, P. aeruginosa had increased levels of genes related to zinc and iron acquisition which were suppressed by metal amendment of sputum. We also found a significant correlation between expression of the H1-type VI secretion system and CFTR corrector use by the sputum donor. An ex vivo sputum model or synthetic sputum medium formulation that imposes metal restriction may enhance future CF-related studies.IMPORTANCEIdentifying the gene expression programs used by Pseudomonas aeruginosa to colonize the lungs of people with cystic fibrosis (CF) will illuminate new therapeutic strategies. To capture these transcriptional programs, we cultured the common P. aeruginosa laboratory strain PAO1 in expectorated sputum from CF patient donors. Through bioinformatic analysis, we defined sets of genes that are more transcriptionally active in real CF sputum compared to a synthetic cystic fibrosis sputum medium. Many of the most differentially active gene sets contained genes related to metal acquisition, suggesting that these gene sets play an active role in scavenging for metals in the CF lung environment which may be inadequately represented in some models. Future studies of P. aeruginosa transcript abundance in CF may benefit from the use of an expectorated sputum model or media supplemented with factors that induce metal restriction.


Assuntos
Fibrose Cística , Infecções por Pseudomonas , Humanos , Pseudomonas aeruginosa/metabolismo , Escarro , Perfilação da Expressão Gênica , Metais , Meios de Cultura/metabolismo , RNA/metabolismo
6.
Gigascience ; 132024 Jan 02.
Artigo em Inglês | MEDLINE | ID: mdl-38323677

RESUMO

Important tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network's specific connections using network permutation to generate features that depend only on degree. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Researchers seeking to predict new or missing edges in biological networks should use our permutation approach to obtain a baseline for performance that may be nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).


Assuntos
Algoritmos , Probabilidade
7.
J Am Vet Med Assoc ; 262(5): 1-8, 2024 May 01.
Artigo em Inglês | MEDLINE | ID: mdl-38417257

RESUMO

OBJECTIVE: To compare pedigree documentation and genetic test results to evaluate whether user-provided photographs influence the breed ancestry predictions of direct-to-consumer (DTC) genetic tests for dogs. ANIMALS: 12 registered purebred pet dogs representing 12 different breeds. METHODS: Each dog owner submitted 6 buccal swabs, 1 to each of 6 DTC genetic testing companies. Experimenters registered each sample per manufacturer instructions. For half of the dogs, the registration included a photograph of the DNA donor. For the other half of the dogs, photographs were swapped between dogs. DNA analysis and breed ancestry prediction were conducted by each company. The effect of condition (ie, matching vs shuffled photograph) was evaluated for each company's breed predictions. As a positive control, a convolutional neural network was also used to predict breed based solely on the photograph. RESULTS: Results from 5 of the 6 tests always included the dog's registered breed. One test and the convolutional neural network were unlikely to identify the registered breed and frequently returned results that were more similar to the photograph than the DNA. Additionally, differences in the predictions made across all tests underscored the challenge of identifying breed ancestry, even in purebred dogs. CLINICAL RELEVANCE: Veterinarians are likely to encounter patients who have conducted DTC genetic testing and may be asked to explain the results of genetic tests they did not order. This systematic comparison of commercially available tests provides context for interpreting results from consumer-grade DTC genetic testing kits.

8.
Am J Hum Genet ; 111(1): 11-23, 2024 Jan 04.
Artigo em Inglês | MEDLINE | ID: mdl-38181729

RESUMO

Precision medicine initiatives across the globe have led to a revolution of repositories linking large-scale genomic data with electronic health records, enabling genomic analyses across the entire phenome. Many of these initiatives focus solely on research insights, leading to limited direct benefit to patients. We describe the biobank at the Colorado Center for Personalized Medicine (CCPM Biobank) that was jointly developed by the University of Colorado Anschutz Medical Campus and UCHealth to serve as a unique, dual-purpose research and clinical resource accelerating personalized medicine. This living resource currently has more than 200,000 participants with ongoing recruitment. We highlight the clinical, laboratory, regulatory, and HIPAA-compliant informatics infrastructure along with our stakeholder engagement, consent, recontact, and participant engagement strategies. We characterize aspects of genetic and geographic diversity unique to the Rocky Mountain region, the primary catchment area for CCPM Biobank participants. We leverage linked health and demographic information of the CCPM Biobank participant population to demonstrate the utility of the CCPM Biobank to replicate complex trait associations in the first 33,674 genotyped individuals across multiple disease domains. Finally, we describe our current efforts toward return of clinical genetic test results, including high-impact pathogenic variants and pharmacogenetic information, and our broader goals as the CCPM Biobank continues to grow. Bringing clinical and research interests together fosters unique clinical and translational questions that can be addressed from the large EHR-linked CCPM Biobank resource within a HIPAA- and CLIA-certified environment.


Assuntos
Sistema de Aprendizagem em Saúde , Medicina de Precisão , Humanos , Bancos de Espécimes Biológicos , Colorado , Genômica
10.
Bioinform Adv ; 4(1): vbae004, 2024.
Artigo em Inglês | MEDLINE | ID: mdl-38282973

RESUMO

Motivation: Most models can be fit to data using various optimization approaches. While model choice is frequently reported in machine-learning-based research, optimizers are not often noted. We applied two different implementations of LASSO logistic regression implemented in Python's scikit-learn package, using two different optimization approaches (coordinate descent, implemented in the liblinear library, and stochastic gradient descent, or SGD), to predict mutation status and gene essentiality from gene expression across a variety of pan-cancer driver genes. For varying levels of regularization, we compared performance and model sparsity between optimizers. Results: After model selection and tuning, we found that liblinear and SGD tended to perform comparably. liblinear models required more extensive tuning of regularization strength, performing best for high model sparsities (more nonzero coefficients), but did not require selection of a learning rate parameter. SGD models required tuning of the learning rate to perform well, but generally performed more robustly across different model sparsities as regularization strength decreased. Given these tradeoffs, we believe that the choice of optimizers should be clearly reported as a part of the model selection and validation process, to allow readers and reviewers to better understand the context in which results have been generated. Availability and implementation: The code used to carry out the analyses in this study is available at https://github.com/greenelab/pancancer-evaluation/tree/master/01_stratified_classification. Performance/regularization strength curves for all genes in the Vogelstein et al. (2013) dataset are available at https://doi.org/10.6084/m9.figshare.22728644.

11.
bioRxiv ; 2024 Apr 04.
Artigo em Inglês | MEDLINE | ID: mdl-37503097

RESUMO

While single-cell experiments provide deep cellular resolution within a single sample, some single-cell experiments are inherently more challenging than bulk experiments due to dissociation difficulties, cost, or limited tissue availability. This creates a situation where we have deep cellular profiles of one sample or condition, and bulk profiles across multiple samples and conditions. To bridge this gap, we propose BuDDI (BUlk Deconvolution with Domain Invariance). BuDDI utilizes domain adaptation techniques to effectively integrate available corpora of case-control bulk and reference scRNA-seq observations to infer cell-type-specific perturbation effects. BuDDI achieves this by learning independent latent spaces within a single variational autoencoder (VAE) encompassing at least four sources of variability: 1) cell type proportion, 2) perturbation effect, 3) structured experimental variability, and 4) remaining variability. Since each latent space is encouraged to be independent, we simulate perturbation responses by independently composing each latent space to simulate cell-type-specific perturbation responses. We evaluated BuDDI's performance on simulated and real data with experimental designs of increasing complexity. We first validated that BuDDI could learn domain invariant latent spaces on data with matched samples across each source of variability. Then we validated that BuDDI could accurately predict cell-type-specific perturbation response when no single-cell perturbed profiles were used during training; instead, only bulk samples had both perturbed and non-perturbed observations. Finally, we validated BuDDI on predicting sex-specific differences, an experimental design where it is not possible to have matched samples. In each experiment, BuDDI outperformed all other comparative methods and baselines. As more reference atlases are completed, BuDDI provides a path to combine these resources with bulk-profiled treatment or disease signatures to study perturbations, sex differences, or other factors at single-cell resolution.

12.
bioRxiv ; 2023 Dec 02.
Artigo em Inglês | MEDLINE | ID: mdl-37961178

RESUMO

Introduction: High-grade serous carcinoma (HGSC) gene expression subtypes are associated with differential survival. We characterized HGSC gene expression in Black individuals and considered whether gene expression differences by race may contribute to poorer HGSC survival among Black versus non-Hispanic White individuals. Methods: We included newly generated RNA-Seq data from Black and White individuals, and array-based genotyping data from four existing studies of White and Japanese individuals. We assigned subtypes using K-means clustering. Cluster- and dataset-specific gene expression patterns were summarized by moderated t-scores. We compared cluster-specific gene expression patterns across datasets by calculating the correlation between the summarized vectors of moderated t-scores. Following mapping to The Cancer Genome Atlas (TCGA)-derived HGSC subtypes, we used Cox proportional hazards models to estimate subtype-specific survival by dataset. Results: Cluster-specific gene expression was similar across gene expression platforms. Comparing the Black study population to the White and Japanese study populations, the immunoreactive subtype was more common (39% versus 23%-28%) and the differentiated subtype less common (7% versus 22%-31%). Patterns of subtype-specific survival were similar between the Black and White populations with RNA-Seq data; compared to mesenchymal cases, the risk of death was similar for proliferative and differentiated cases and suggestively lower for immunoreactive cases (Black population HR=0.79 [0.55, 1.13], White population HR=0.86 [0.62, 1.19]). Conclusions: A single, platform-agnostic pipeline can be used to assign HGSC gene expression subtypes. While the observed prevalence of HGSC subtypes varied by race, subtype-specific survival was similar.

13.
bioRxiv ; 2023 Oct 11.
Artigo em Inglês | MEDLINE | ID: mdl-37873416

RESUMO

Understanding the factors that shape variation in the human microbiome is a major goal of research in biology. While other genomics fields have used large, pre-compiled compendia to extract systematic insights requiring otherwise impractical sample sizes, there has been no comparable resource for the 16S rRNA sequencing data commonly used to quantify microbiome composition. To help close this gap, we have assembled a set of 168,484 publicly available human gut microbiome samples, processed with a single pipeline and combined into the largest unified microbiome dataset to date. We use this resource, which is freely available at microbiomap.org, to shed light on global variation in the human gut microbiome. We find that Firmicutes, particularly Bacilli and Clostridia, are almost universally present in the human gut. At the same time, the relative abundance of the 65 most common microbial genera differ between at least two world regions. We also show that gut microbiomes in undersampled world regions, such as Central and Southern Asia, differ significantly from the more thoroughly characterized microbiomes of Europe and Northern America. Moreover, humans in these overlooked regions likely harbor hundreds of taxa that have not yet been discovered due to this undersampling, highlighting the need for diversity in microbiome studies. We anticipate that this new compendium can serve the community and enable advanced applied and methodological research.

14.
Genome Biol ; 24(1): 239, 2023 10 20.
Artigo em Inglês | MEDLINE | ID: mdl-37864274

RESUMO

BACKGROUND: Single-cell gene expression profiling provides unique opportunities to understand tumor heterogeneity and the tumor microenvironment. Because of cost and feasibility, profiling bulk tumors remains the primary population-scale analytical strategy. Many algorithms can deconvolve these tumors using single-cell profiles to infer their composition. While experimental choices do not change the true underlying composition of the tumor, they can affect the measurements produced by the assay. RESULTS: We generated a dataset of high-grade serous ovarian tumors with paired expression profiles from using multiple strategies to examine the extent to which experimental factors impact the results of downstream tumor deconvolution methods. We find that pooling samples for single-cell sequencing and subsequent demultiplexing has a minimal effect. We identify dissociation-induced differences that affect cell composition, leading to changes that may compromise the assumptions underlying some deconvolution algorithms. We also observe differences across mRNA enrichment methods that introduce additional discrepancies between the two data types. We also find that experimental factors change cell composition estimates and that the impact differs by method. CONCLUSIONS: Previous benchmarks of deconvolution methods have largely ignored experimental factors. We find that methods vary in their robustness to experimental factors. We provide recommendations for methods developers seeking to produce the next generation of deconvolution approaches and for scientists designing experiments using deconvolution to study tumor heterogeneity.


Assuntos
Perfilação da Expressão Gênica , Neoplasias Ovarianas , Humanos , Feminino , Perfilação da Expressão Gênica/métodos , Algoritmos , Análise de Sequência de RNA/métodos , Neoplasias Ovarianas/genética , Transcriptoma , Microambiente Tumoral
15.
Nat Commun ; 14(1): 5562, 2023 09 09.
Artigo em Inglês | MEDLINE | ID: mdl-37689782

RESUMO

Genes act in concert with each other in specific contexts to perform their functions. Determining how these genes influence complex traits requires a mechanistic understanding of expression regulation across different conditions. It has been shown that this insight is critical for developing new therapies. Transcriptome-wide association studies have helped uncover the role of individual genes in disease-relevant mechanisms. However, modern models of the architecture of complex traits predict that gene-gene interactions play a crucial role in disease origin and progression. Here we introduce PhenoPLIER, a computational approach that maps gene-trait associations and pharmacological perturbation data into a common latent representation for a joint analysis. This representation is based on modules of genes with similar expression patterns across the same conditions. We observe that diseases are significantly associated with gene modules expressed in relevant cell types, and our approach is accurate in predicting known drug-disease pairs and inferring mechanisms of action. Furthermore, using a CRISPR screen to analyze lipid regulation, we find that functionally important players lack associations but are prioritized in trait-associated modules by PhenoPLIER. By incorporating groups of co-expressed genes, PhenoPLIER can contextualize genetic associations and reveal potential targets missed by single-gene strategies.


Assuntos
Repetições Palindrômicas Curtas Agrupadas e Regularmente Espaçadas , Epistasia Genética , Causalidade , Redes Reguladoras de Genes , Transcriptoma
16.
bioRxiv ; 2023 Aug 21.
Artigo em Inglês | MEDLINE | ID: mdl-37662412

RESUMO

Chronic Pseudomonas aeruginosa lung infections are a distinctive feature of cystic fibrosis (CF) pathology, that challenge adults with CF even with the advent of highly effective modulator therapies. Characterizing P. aeruginosa transcription in the CF lung and identifying factors that drive gene expression could yield novel strategies to eradicate infection or otherwise improve outcomes. To complement published P. aeruginosa gene expression studies in laboratory culture models designed to model the CF lung environment, we employed an ex vivo sputum model in which laboratory strain PAO1 was incubated in sputum from different CF donors. As part of the analysis, we compared PAO1 gene expression in this "spike-in" sputum model to that for P. aeruginosa grown in artificial sputum medium (ASM). Analyses focused on genes that were differentially expressed between sputum and ASM and genes that were most highly expressed in sputum. We present a new approach that used sets of genes with correlated expression, identified by the gene expression analysis tool eADAGE, to analyze the differential activity of pathways in P. aeruginosa grown in CF sputum from different individuals. A key characteristic of P. aeruginosa grown in expectorated CF sputum was related to zinc and iron acquisition, but this signal varied by donor sputum. In addition, a significant correlation between P. aeruginosa expression of the H1-type VI secretion system and corrector use by the sputum donor was observed. These methods may be broadly useful in looking for variable signals across clinical samples.

17.
bioRxiv ; 2023 Aug 15.
Artigo em Inglês | MEDLINE | ID: mdl-37577575

RESUMO

High throughput gene expression profiling is a powerful approach to generate hypotheses on the underlying causes of biological function and disease. Yet this approach is limited by its ability to infer underlying biological pathways and burden of testing tens of thousands of individual genes. Machine learning models that incorporate prior biological knowledge are necessary to extract meaningful pathways and generate rational hypothesis from the vast amount of gene expression data generated to date. We adopted an unsupervised machine learning method, Pathway-level information extractor (PLIER), to train the first mouse PLIER model on 190,111 mouse brain RNA-sequencing samples, the greatest amount of training data ever used by PLIER. mousiPLER converted gene expression data into a latent variables that align to known pathway or cell maker gene sets, substantially reducing data dimensionality and improving interpretability. To determine the utility of mousiPLIER, we applied it to a mouse brain aging study of microglia and astrocyte transcriptomic profiling. We found a specific set of latent variables that are significantly associated with aging, including one latent variable (LV41) corresponding to striatal signal. We next performed k-means clustering on the training data to identify studies that respond strongly to LV41, finding that the variable is relevant to striatum and aging across the scientific literature. Finally, we built a web server (http://mousiplier.greenelab.com/) for users to easily explore the learned latent variables. Taken together this study provides proof of concept that mousiPLIER can uncover meaningful biological processes in mouse transcriptomic studies.

19.
BioData Min ; 16(1): 16, 2023 May 05.
Artigo em Inglês | MEDLINE | ID: mdl-37147665

RESUMO

While we often think of words as having a fixed meaning that we use to describe a changing world, words are also dynamic and changing. Scientific research can also be remarkably fast-moving, with new concepts or approaches rapidly gaining mind share. We examined scientific writing, both preprint and pre-publication peer-reviewed text, to identify terms that have changed and examine their use. One particular challenge that we faced was that the shift from closed to open access publishing meant that the size of available corpora changed by over an order of magnitude in the last two decades. We developed an approach to evaluate semantic shift by accounting for both intra- and inter-year variability using multiple integrated models. This analysis revealed thousands of change points in both corpora, including for terms such as 'cas9', 'pandemic', and 'sars'. We found that the consistent change-points between pre-publication peer-reviewed and preprinted text are largely related to the COVID-19 pandemic. We also created a web app for exploration that allows users to investigate individual terms ( https://greenelab.github.io/word-lapse/ ). To our knowledge, our research is the first to examine semantic shift in biomedical preprints and pre-publication peer-reviewed text, and provides a foundation for future work to understand how terms acquire new meanings and how peer review affects this process.

20.
Nat Methods ; 20(6): 803-814, 2023 Jun.
Artigo em Inglês | MEDLINE | ID: mdl-37248386

RESUMO

High-throughput profiling methods (such as genomics or imaging) have accelerated basic research and made deep molecular characterization of patient samples routine. These approaches provide a rich portrait of genes, molecular pathways and cell types involved in disease phenotypes. Machine learning (ML) can be a useful tool for extracting disease-relevant patterns from high-dimensional datasets. However, depending upon the complexity of the biological question, machine learning often requires many samples to identify recurrent and biologically meaningful patterns. Rare diseases are inherently limited in clinical cases, leading to few samples to study. In this Perspective, we outline the challenges and emerging solutions for using ML for small sample sets, specifically in rare diseases. Advances in ML methods for rare diseases are likely to be informative for applications beyond rare diseases for which few samples exist with high-dimensional data. We propose that the method community prioritize the development of ML techniques for rare disease research.


Assuntos
Aprendizado de Máquina , Doenças Raras , Humanos , Doenças Raras/genética , Genômica/métodos
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...