Search | VHL Regional Portal

1.

Automated Workflows for Data Curation and Machine Learning to Develop Quantitative Structure-Activity Relationships.

Gadaleta, Domenico.

Methods Mol Biol ; 2834: 115-130, 2025.

Article in English | MEDLINE | ID: mdl-39312162

ABSTRACT

The recent advancements in machine learning and the new availability of large chemical datasets made the development of tools and protocols for computational chemistry a topic of high interest. In this chapter a standard procedure to develop Quantitative Structure-Activity Relationship (QSAR) models was presented and implemented in two freely available and easy-to-use workflows. The first workflow helps the user retrieving chemical data (SMILES) from the web, checking their correctness and curating them to produce consistent and ready-to-use datasets for cheminformatic. The second workflow implements six machine learning methods to develop classification QSAR models. Models can be additionally used to predict external chemicals. Calculation and selection of chemical descriptors, tuning of models' hyperparameters, and methods to handle data unbalancing are also incorporated in the workflow. Both the workflows are implemented in KNIME and represent a useful tool for computational scientists, as well as an intuitive and straightforward introduction to QSAR.

Subject(s)

Data Curation , Machine Learning , Quantitative Structure-Activity Relationship , Workflow , Data Curation/methods , Software , Cheminformatics/methods , Computational Biology/methods

2.

Daily life in the Open Biologist's second job, as a Data Curator.

Scorza, Livia C T; Zielinski, Tomasz; Kalita, Irina; Lepore, Alessia; El Karoui, Meriem; Millar, Andrew J.

Wellcome Open Res ; 9: 523, 2024.

Article in English | MEDLINE | ID: mdl-39360219

ABSTRACT

Background: Data reusability is the driving force of the research data life cycle. However, implementing strategies to generate reusable data from the data creation to the sharing stages is still a significant challenge. Even when datasets supporting a study are publicly shared, the outputs are often incomplete and/or not reusable. The FAIR (Findable, Accessible, Interoperable, Reusable) principles were published as a general guidance to promote data reusability in research, but the practical implementation of FAIR principles in research groups is still falling behind. In biology, the lack of standard practices for a large diversity of data types, data storage and preservation issues, and the lack of familiarity among researchers are some of the main impeding factors to achieve FAIR data. Past literature describes biological curation from the perspective of data resources that aggregate data, often from publications. Methods: Our team works alongside data-generating, experimental researchers so our perspective aligns with publication authors rather than aggregators. We detail the processes for organizing datasets for publication, showcasing practical examples from data curation to data sharing. We also recommend strategies, tools and web resources to maximize data reusability, while maintaining research productivity. Conclusion: We propose a simple approach to address research data management challenges for experimentalists, designed to promote FAIR data sharing. This strategy not only simplifies data management, but also enhances data visibility, recognition and impact, ultimately benefiting the entire scientific community.

Researchers should openly share data associated with their publications unless there is a valid reason not to. Additionally, datasets have to be described with enough detail to ensure that they are reproducible and reusable by others. Since most research institutions offer limited professional support in this area, the responsibility for data sharing largely falls to researchers themselves. However, many research groups still struggle to follow data reusability principles in practice. In this work, we describe our data curation (data organization and management) efforts working directly with the researchers who create the data. We show the steps we took to organize, standardize, and share several datasets in biological sciences, pointing out the main challenges we faced. Finally, we suggest simple and practical data management actions, as well as tools that experimentalists can integrate into their daily work, to make sharing data easier and more effective.

3.

Analyzing Racial Differences in Imaging Joint Replacement Registries Using Generative Artificial Intelligence: Advancing Orthopaedic Data Equity.

Khosravi, Bardia; Rouzrokh, Pouria; Erickson, Bradley J; Garner, Hillary W; Wenger, Doris E; Taunton, Michael J; Wyles, Cody C.

Arthroplast Today ; 29: 101503, 2024 Oct.

Article in English | MEDLINE | ID: mdl-39376670

ABSTRACT

Background: Discrepancies in medical data sets can perpetuate bias, especially when training deep learning models, potentially leading to biased outcomes in clinical applications. Understanding these biases is crucial for the development of equitable healthcare technologies. This study employs generative deep learning technology to explore and understand radiographic differences based on race among patients undergoing total hip arthroplasty. Methods: Utilizing a large institutional registry, we retrospectively analyzed pelvic radiographs from total hip arthroplasty patients, characterized by demographics and image features. Denoising diffusion probabilistic models generated radiographs conditioned on demographic and imaging characteristics. Fréchet Inception Distance assessed the generated image quality, showing the diversity and realism of the generated images. Sixty transition videos were generated that showed transforming White pelvises to their closest African American counterparts and vice versa while controlling for patients' sex, age, and body mass index. Two expert surgeons and 2 radiologists carefully studied these videos to understand the systematic differences that are present in the 2 races' radiographs. Results: Our data set included 480,407 pelvic radiographs, with a predominance of White patients over African Americans. The generative denoising diffusion probabilistic model created high-quality images and reached an Fréchet Inception Distance of 6.8. Experts identified 6 characteristics differentiating races, including interacetabular distance, osteoarthritis degree, obturator foramina shape, femoral neck-shaft angle, pelvic ring shape, and femoral cortical thickness. Conclusions: This study demonstrates the potential of generative models for understanding disparities in medical imaging data sets. By visualizing race-based differences, this method aids in identifying bias in downstream tasks, fostering the development of fairer healthcare practices.

4.

Two novel families with RUNX1 variants indicate glycine 168 as a new mutational hotspot: Implications for FPD/AML diagnosis.

Kamiya, Laureano J; Barozzi, Serena; Isidori, Federica; Ganiewich, Daiana; De Luca, Geraldine; Bozzi, Valeria; Marta, Rosana F; Melazzini, Federica; Pippucci, Tommaso; Heller, Paula G; Glembotsky, Ana C; Pecci, Alessandro.

Br J Haematol ; 2024 Oct 07.

Article in English | MEDLINE | ID: mdl-39375928

ABSTRACT

Correct interpretation of the pathogenicity of germline RUNX1 variants is essential for FPD/AML diagnosis, clinical management and leukaemia surveillance. We report two families with clear FPD/AML phenotypic features harbouring missense variants at RHD critical residue Gly168. Although classified as of unknown significance (VUS) by RUNX1-specific curation guidelines, these variants should rather be considered likely pathogenic, as supported by computational tools, structural modelling and dysregulated platelet expression of RUNX1-targets, adding Gly168 among amino acids currently recognised as mutational hotspots. Our data could help reduce the number of variants classified as VUS, providing evidence for updating RUNX1 guidelines, thus improving FPD/AML diagnosis.

5.

Development of a SNOMED CT Mapping Process and Tool at a Data Integration Centre - Lessons Learned.

Riedel, Andrea; Deppenwiese, Noemi; Thiele, Lucas; Prokosch, Hans-Ulrich; Herzog, Annalena.

Stud Health Technol Inform ; 317: 160-170, 2024 Aug 30.

Article in English | MEDLINE | ID: mdl-39234719

ABSTRACT

INTRODUCTION: 16 million German-language free-text laboratory test results are the basis of the daily diagnostic routine of 17 laboratories within the University Hospital Erlangen. As part of the Medical Informatics Initiative, the local data integration centre is responsible for the accessibility of routine care data for medical research. Following the core data set, international interoperability standards such as FHIR and the English-language medical terminology SNOMED CT are used to create harmonised data. To represent each non-numeric laboratory test result within the base module profile ObservationLab, the need for a map and supporting tooling arose. STATE OF THE ART: Due to the requirement of a n:n map and a data safety-compliant local instance, publicly available tools (e.g., SNAP2SNOMED) were insufficient. Concept and Implementation: Therefore, we developed (1) an incremental mapping-validation process with different iteration cycles and (2) a customised mapping tool via Microsoft Access. Time, labour, and cost efficiency played a decisive role. First iterations were used to define requirements (e.g., multiple user access). LESSONS LEARNED: The successful process and tool implementation and the described lessons learned (e.g., cheat sheet) will assist other German hospitals in creating local maps for inter-consortia data exchange and research. In the future, qualitative and quantitative analysis results will be published.

Subject(s)

Systematized Nomenclature of Medicine , Germany , Humans , Electronic Health Records , Systems Integration

6.

Prediction of Ca²⁺ Binding Site in Proteins With a Fast and Accurate Method Based on Statistical Mechanics and Analysis of Crystal Structures.

Basit, Abdul; Choudhury, Devapriya; Bandyopadhyay, Pradipta.

Proteins ; 2024 Sep 11.

Article in English | MEDLINE | ID: mdl-39258438

ABSTRACT

Predicting the precise locations of metal binding sites within metalloproteins is a crucial challenge in biophysics. A fast, accurate, and interpretable computational prediction method can complement the experimental studies. In the current work, we have developed a method to predict the location of Ca2+ ions in calcium-binding proteins using a physics-based method with an all-atom description of the proteins, which is substantially faster than the molecular dynamics simulation-based methods with accuracy as good as data-driven approaches. Our methodology uses the three-dimensional reference interaction site model (3D-RISM), a statistical mechanical theory, to calculate Ca2+ ion density around protein structures, and the locations of the Ca2+ ions are obtained from the density. We have taken previously used datasets to assess the efficacy of our method as compared to previous works. Our accuracy is 88%, comparable with the FEATURE program, one of the well-known data-driven methods. Moreover, our method is physical, and the reasons for failures can be ascertained in most cases. We have thoroughly examined the failed cases using different structural and crystallographic measures, such as B-factor, R-factor, electron density map, and geometry at the binding site. It has been found that x-ray structures have issues in many of the failed cases, such as geometric irregularities and dubious assignment of ion positions. Our algorithm, along with the checks for structural accuracy, is a major step in predicting calcium ion positions in metalloproteins.

7.

Germline RTEL1 Variants in Telomere Biology Disorders.

Thompson, Ashley S; Niewisch, Marena R; Giri, Neelam; McReynolds, Lisa J; Savage, Sharon A.

Am J Med Genet A ; : e63882, 2024 Sep 16.

Article in English | MEDLINE | ID: mdl-39279436

ABSTRACT

Rare germline variation in regulator of telomere elongation helicase 1 (RTEL1) is associated with telomere biology disorders (TBDs). Biallelic RTEL1 variants result in childhood onset dyskeratosis congenita and Hoyeraal-Hreidarsson syndrome whereas heterozygous individuals usually present later in life with pulmonary fibrosis or bone marrow failure. We compiled all TBD-associated RTEL1 variants in the literature and assessed phenotypes and outcomes of 44 individuals from 14 families with mono- or biallelic RTEL1 variants enrolled in clinical trial NCT00027274. Variants were classified by adapting ACMG-AMP guidelines using clinical information, telomere length, and variant allele frequency data. Compared with heterozygotes, individuals with biallelic RTEL1 variants had an earlier age at diagnosis (median age 35.5 vs. 5.1 years, p < 0.01) and worse overall survival (median age 66.5 vs. 22.9 years, p < 0.001). There were 257 unique RTEL1 variants reported in 47 publications, and 209 had a gnomAD minor allele frequency <1%. Only 38.3% (80/209) met pathogenic/likely pathogenic criteria. Notably, 8 of 209 reported disease-associated variants were benign or likely benign and the rest were variants of uncertain significance. Given the considerable differences in outcomes of TBDs associated with RTEL1 germline variants and the extent of variation in the gene, systematic functional studies and standardization of variant curation are urgently needed to inform clinical management.

8.

Main challenges on the curation of large scale datasets for pancreas segmentation using deep learning in multi-phase CT scans: Focus on cardinality, manual refinement, and annotation quality.

Cavicchioli, Matteo; Moglia, Andrea; Pierelli, Ludovica; Pugliese, Giacomo; Cerveri, Pietro.

Comput Med Imaging Graph ; 117: 102434, 2024 Sep 13.

Article in English | MEDLINE | ID: mdl-39284244

ABSTRACT

Accurate segmentation of the pancreas in computed tomography (CT) holds paramount importance in diagnostics, surgical planning, and interventions. Recent studies have proposed supervised deep-learning models for segmentation, but their efficacy relies on the quality and quantity of the training data. Most of such works employed small-scale public datasets, without proving the efficacy of generalization to external datasets. This study explored the optimization of pancreas segmentation accuracy by pinpointing the ideal dataset size, understanding resource implications, examining manual refinement impact, and assessing the influence of anatomical subregions. We present the AIMS-1300 dataset encompassing 1,300 CT scans. Its manual annotation by medical experts required 938 h. A 2.5D UNet was implemented to assess the impact of training sample size on segmentation accuracy by partitioning the original AIMS-1300 dataset into 11 smaller subsets of progressively increasing numerosity. The findings revealed that training sets exceeding 440 CTs did not lead to better segmentation performance. In contrast, nnU-Net and UNet with Attention Gate reached a plateau for 585 CTs. Tests on generalization on the publicly available AMOS-CT dataset confirmed this outcome. As the size of the partition of the AIMS-1300 training set increases, the number of error slices decreases, reaching a minimum with 730 and 440 CTs, for AIMS-1300 and AMOS-CT datasets, respectively. Segmentation metrics on the AIMS-1300 and AMOS-CT datasets improved more on the head than the body and tail of the pancreas as the dataset size increased. By carefully considering the task and the characteristics of the available data, researchers can develop deep learning models without sacrificing performance even with limited data. This could accelerate developing and deploying artificial intelligence tools for pancreas surgery and other surgical data science applications.

9.

The Long-Term Effect of Cochlear Implantation on Tinnitus: A Systematic Review and Meta-Analysis.

Li, Yutian; Yang, Huiwen; Niu, Xun; Sun, Yu.

Diagnostics (Basel) ; 14(18)2024 Sep 13.

Article in English | MEDLINE | ID: mdl-39335707

ABSTRACT

OBJECTIVE: This systematic review investigates the long-term effect of cochlear implantation (CI) on clinical outcomes in tinnitus patients with sensorineural hearing loss (SNHL). DATABASE SOURCES: PubMed, Embase, and the Cochrane Library were searched from inception to 30 April 2024. Manual searches of reference lists supplemented these searches when necessary. REVIEW METHODS: Original studies included in the meta-analysis had to contain comparative pre- and postoperative data for SNHL patients who underwent CI. Outcomes measured were the Tinnitus Handicap Inventory (THI), Visual Analog Scale (VAS), and Tinnitus Questionnaire (TQ). RESULTS: A total of 28 studies comprising 853 patients showed significant tinnitus improvement after CI: THI mean difference (MD) -14.02 [95%CI -15.29 to -12.76, p < 0.001], TQ MD -15.85 [95%CI -18.97 to -12.74, p < 0.05], and VAS MD -3.12 [95%CI -3.49 to -2.76, p < 0.05]. Subgroup analysis indicated a significant difference between follow-up periods in THI (p < 0.0001) and VAS loudness (p = 0.02). CONCLUSIONS: Cochlear implantation substantially improves tinnitus in patients with hearing loss, though the effect may diminish over time. Further research is needed to confirm these findings.

10.

Specifications of the ACMG/AMP variant curation guidelines for the analysis of germline ATM sequence variants.

Richardson, Marcy E; Holdren, Megan; Brannan, Terra; de la Hoya, Miguel; Spurdle, Amanda B; Tavtigian, Sean V; Young, Colin C; Zec, Lauren; Hiraki, Susan; Anderson, Michael J; Walker, Logan C; McNulty, Shannon; Turnbull, Clare; Tischkowitz, Marc; Schon, Katherine; Slavin, Thomas; Foulkes, William D; Cline, Melissa; Monteiro, Alvaro N; Pesaran, Tina; Couch, Fergus J.

Am J Hum Genet ; 2024 Sep 17.

Article in English | MEDLINE | ID: mdl-39317201

ABSTRACT

The ClinGen Hereditary Breast, Ovarian, and Pancreatic Cancer (HBOP) Variant Curation Expert Panel (VCEP) is composed of internationally recognized experts in clinical genetics, molecular biology, and variant interpretation. This VCEP made specifications for the American College of Medical Genetics and Association for Molecular Pathology (ACMG/AMP) guidelines for the ataxia telangiectasia mutated (ATM) gene according to the ClinGen protocol. These gene-specific rules for ATM were modified from the ACMG/AMP guidelines and were tested against 33 ATM variants of various types and classifications in a pilot curation phase. The pilot revealed a majority agreement between the HBOP VCEP classifications and the ClinVar-deposited classifications. Six pilot variants had conflicting interpretations in ClinVar, and re-evaluation with the VCEP's ATM-specific rules resulted in four that were classified as benign, one as likely pathogenic, and one as a variant of uncertain significance (VUS) by the VCEP, improving the certainty of interpretations in the public domain. Overall, 28 of the 33 pilot variants were not VUS, leading to an 85% classification rate. The ClinGen-approved, modified rules demonstrated value for improved interpretation of variants in ATM.

11.

Shifting power: data democracy in engineering solutions.

Cutts, Bethany B; Osia, Uchenna; Bray, Laura A; Harris, Angela R; C Long, Hanna; Goins, Hannah; McLean, Sallie; MacDonald Gibson, Jacqueline; Ben-Horin, Tal; Schnetzer, Astrid.

Environ Res Lett ; 19(10): 101004, 2024 Oct 01.

Article in English | MEDLINE | ID: mdl-39296316

12.

SparkDWM: a scalable design of a Data Washing Machine using Apache Spark.

Hagan, Nicholas Kofi Akortia; Talburt, John R.

Front Big Data ; 7: 1446071, 2024.

Article in English | MEDLINE | ID: mdl-39314986

ABSTRACT

Data volume has been one of the fast-growing assets of most real-world applications. This increases the rate of human errors such as duplication of records, misspellings, and erroneous transpositions, among other data quality issues. Entity Resolution is an ETL process that aims to resolve data inconsistencies by ensuring entities are referring to the same real-world objects. One of the main challenges of most traditional Entity Resolution systems is ensuring their scalability to meet the rising data needs. This research aims to refactor a working proof-of-concept entity resolution system called the Data Washing Machine to be highly scalable using Apache Spark distributed data processing framework. We solve the single-threaded design problem of the legacy Data Washing Machine by using PySpark's Resilient Distributed Dataset and improve the Data Washing Machine design to use intrinsic metadata information from references. We prove that our systems achieve the same results as the legacy Data Washing Machine using 18 synthetically generated datasets. We also test the scalability of our system using a variety of real-world benchmark ER datasets from a few thousand to millions. Our experimental results show that our proposed system performs better than a MapReduce-based Data Washing Machine. We also compared our system with Famer and concluded that our system can find more clusters when given optimal starting parameters for clustering.

13.

Treating chronic atrophic gastritis: identifying sub-population based on real-world TCM electronic medical records.

Wang, Yu-Man; Sun, Jian-Hui; Sun, Run-Xue; Liu, Xiao-Yu; Li, Jing-Fan; Li, Run-Ze; Du, Yan-Ru; Zhou, Xue-Zhong.

Front Pharmacol ; 15: 1444733, 2024.

Article in English | MEDLINE | ID: mdl-39170704

ABSTRACT

Background and Objective: Chronic atrophic gastritis (CAG) is a complex chronic disease caused by multiple factors that frequently occurs disease in the clinic. The worldwide prevalence of CAG is high. Interestingly, clinical CAG patients often present with a variety of symptom phenotypes, which makes it more difficult for clinicians to treat. Therefore, there is an urgent need to improve our understanding of the complexity of the clinical CAG population, obtain more accurate disease subtypes, and explore the relationship between clinical symptoms and medication. Therefore, based on the integrated platform of complex networks and clinical research, we classified the collected patients with CAG according to their different clinical characteristics and conducted correlation analysis on the classification results to identify more accurate disease subtypes to aid in personalized clinical treatment. Method: Traditional Chinese medicine (TCM) offers an empirical understanding of the clinical subtypes of complicated disorders since TCM therapy is tailored to the patient's symptom profile. We gathered 6,253 TCM clinical electronic medical records (EMRs) from CAG patients and manually annotated, extracted, and preprocessed the data. A shared symptom-patient similarity network (PSN) was created. CAG patient subgroups were established, and their clinical features were determined through enrichment analysis employing community identification methods. Different clinical features of relevant subgroups were correlated based on effectiveness to identify symptom-botanical botanical drugs correspondence. Moreover, network pharmacology was employed to identify possible biological relationships between screened symptoms and medications and to identify various clinical and molecular aspects of the key subtypes using Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis. Results: 5,132 patients were included in the study: 2,699 males (52.60%) and 2,433 females (47.41%). The population was divided into 176 modules. We selected the first 3 modules (M29, M3, and M0) to illustrate the characteristic phenotypes and genotypes of CAG disease subtypes. The M29 subgroup was characterized by gastric fullness disease and internal syndrome of turbidity and poison. The M3 subgroup was characterized by epigastric pain and disharmony between the liver and stomach. The M0 subgroup was characterized by epigastric pain and dampness-heat syndrome. In symptom analysis, The top symptoms for symptom improvement in all three subgroups were stomach pain, bloating, insomnia, poor appetite, and heartburn. However, the three groups were different. The M29 subgroup was more likely to have stomach distention, anorexia, and palpitations. Citrus medica, Solanum nigrum, Jiangcan, Shan ci mushrooms, and Dillon were the most popular botanical drugs. The M3 subgroup has a higher incidence of yellow urine, a bitter tongue, and stomachaches. Smilax glabra, Cyperus rotundus, Angelica sinensis, Conioselinum anthriscoides, and Paeonia lactiflora were the botanical drugs used. Vomiting, nausea, stomach pain, and appetite loss are common in the M0 subgroup. The primary medications are Scutellaria baicalensis, Smilax glabra, Picrorhiza kurroa, Lilium lancifolium, and Artemisia scoparia. Through GO and KEGG pathway analysis, We found that in the M29 subgroup, Citrus medica, Solanum nigrum, Jiangcan, Shan ci mushrooms, and Dillon may exert their therapeutic effects on the symptoms of gastric distension, anorexia, and palpitations by modulating apoptosis and NF-κB signaling pathways. In the M3 subgroup, Smilax glabra, Cyperus rotundus, Angelica sinensis, Conioselinum anthriscoides, and Paeonia lactiflora may be treated by NF-κB and JAK-STAT signaling pathway for the treatment of stomach pain, bitter mouth, and yellow urine. In the M0 subgroup, Scutellaria baicalensis, Smilax glabra, Picrorhiza kurroa, Lilium lancifolium, and Artemisia scoparia may exert their therapeutic effects on poor appetite, stomach pain, vomiting, and nausea through the PI3K-Akt signaling pathway. Conclusion: Based on PSN identification and community detection analysis, CAG population division can provide useful recommendations for clinical CAG treatment. This method is useful for CAG illness classification and genotyping investigations and can be used for other complicated chronic diseases.

14.

Evidence-based recommendations for gene-specific ACMG/AMP variant classification from the ClinGen ENIGMA BRCA1 and BRCA2 Variant Curation Expert Panel.

Parsons, Michael T; de la Hoya, Miguel; Richardson, Marcy E; Tudini, Emma; Anderson, Michael; Berkofsky-Fessler, Windy; Caputo, Sandrine M; Chan, Raymond C; Cline, Melissa S; Feng, Bing-Jian; Fortuno, Cristina; Gomez-Garcia, Encarna; Hadler, Johanna; Hiraki, Susan; Holdren, Megan; Houdayer, Claude; Hruska, Kathleen; James, Paul; Karam, Rachid; Leong, Huei San; Martins, Alexandra; Mensenkamp, Arjen R; Monteiro, Alvaro N; Nathan, Vaishnavi; O'Connor, Robert; Pedersen, Inge Sokilde; Pesaran, Tina; Radice, Paolo; Schmidt, Gunnar; Southey, Melissa; Tavtigian, Sean; Thompson, Bryony A; Toland, Amanda E; Turnbull, Clare; Vogel, Maartje J; Weyandt, Jamie; Wiggins, George A R; Zec, Lauren; Couch, Fergus J; Walker, Logan C; Vreeswijk, Maaike P G; Goldgar, David E; Spurdle, Amanda B.

Am J Hum Genet ; 111(9): 2044-2058, 2024 Sep 05.

Article in English | MEDLINE | ID: mdl-39142283

ABSTRACT

The ENIGMA research consortium develops and applies methods to determine clinical significance of variants in hereditary breast and ovarian cancer genes. An ENIGMA BRCA1/2 classification sub-group, formed in 2015 as a ClinGen external expert panel, evolved into a ClinGen internal Variant Curation Expert Panel (VCEP) to align with Food and Drug Administration recognized processes for ClinVar contributions. The VCEP reviewed American College of Medical Genetics and Genomics/Association of Molecular Pathology (ACMG/AMP) classification criteria for relevance to interpreting BRCA1 and BRCA2 variants. Statistical methods were used to calibrate evidence strength for different data types. Pilot specifications were tested on 40 variants and documentation revised for clarity and ease of use. The original criterion descriptions for 13 evidence codes were considered non-applicable or overlapping with other criteria. Scenario of use was extended or re-purposed for eight codes. Extensive analysis and/or data review informed specification descriptions and weights for all codes. Specifications were applied to pilot variants with pre-existing ClinVar classification as follows: 13 uncertain significance or conflicting, 14 pathogenic and/or likely pathogenic, and 13 benign and/or likely benign. Review resolved classification for 11/13 uncertain significance or conflicting variants and retained or improved confidence in classification for the remaining variants. Alignment of pre-existing ENIGMA research classification processes with ACMG/AMP classification guidelines highlighted several gaps in the research processes and the baseline ACMG/AMP criteria. Calibration of evidence strength was key to justify utility and strength of different data types for gene-specific application. The gene-specific criteria demonstrated value for improving ACMG/AMP-aligned classification of BRCA1 and BRCA2 variants.

Subject(s)

BRCA1 Protein , BRCA2 Protein , Genetic Variation , Humans , BRCA2 Protein/genetics , BRCA1 Protein/genetics , Female , Breast Neoplasms/genetics , Genomics/methods , Databases, Genetic , Ovarian Neoplasms/genetics , Genetic Predisposition to Disease , Genetic Testing/methods

15.

ECOTOXr: An R package for reproducible and transparent retrieval of data from EPA's ECOTOX database.

de Vries, Pepijn.

Chemosphere ; 364: 143078, 2024 Sep.

Article in English | MEDLINE | ID: mdl-39181462

ABSTRACT

The US EPA ECOTOX database provides key ecotoxicological data that are crucial in environmental risk assessment. It can be used for computational predictions of toxicity or indications of hazard in a wide range of situations. There is no standardised or formalised method for extracting and subsetting data from the database for these purposes. Consequently, results in such meta-analyses are difficult to reproduce. The present study introduces the software package ECOTOXr, which provides the means to formalise data retrieval from the ECOTOX database in the R scripting language. Three cases are presented to evaluate the performance of the package in relation to earlier data extractions and searches on the website. These cases demonstrate that the package can reproduce data sets relatively well. Furthermore, they illustrate how future studies can further improve traceability and reproducibility by applying the package and adhering to some simple guidelines. This contributes to the FAIR principles, credibility and acceptance of research that uses data from the ECOTOX database.

Subject(s)

Databases, Factual , Software , United States Environmental Protection Agency , United States , Ecotoxicology/methods , Risk Assessment/methods , Reproducibility of Results

16.

Attitudes on data reuse among internal medicine residents.

LaPolla, Fred Willie Zametkin; Milliken, Genevieve; Gillespie, Colleen.

J Med Libr Assoc ; 112(2): 81-87, 2024 Apr 01.

Article in English | MEDLINE | ID: mdl-39119170

ABSTRACT

Background: NYU Langone Health offers a collaborative research block for PGY3 Primary Care residents that employs a secondary data analysis methodology. As discussions of data reuse and secondary data analysis have grown in the data library literature, we sought to understand what attitudes internal medicine residents at a large urban academic medical center had around secondary data analysis. This case report describes a novel survey on resident attitudes around data sharing. Methods: We surveyed internal medicine residents in three tracks: Primary Care (PC), Categorical, and Clinician-Investigator (CI) tracks as part of a larger pilot study on implementation of a research block. All three tracks are in our institution's internal medicine program. In discussions with residency directors and the chief resident, the term "secondary data analysis" was chosen over "data reuse" due to this being more familiar to clinicians, but examples were given to define the concept. Results: We surveyed a population of 162 residents, and 67 residents responded, representing a 41.36% response rate. Strong majorities of residents exhibited positive views of secondary data analysis. Moreover, in our sample, those with exposure to secondary data analysis research opined that secondary data analysis takes less time and is less difficult to conduct compared to the other residents without curricular exposure to secondary analysis. Discussion: The survey reflects that residents believe secondary data analysis is worthwhile and this highlights opportunities for data librarians. As current residents matriculate into professional roles as clinicians, educators, and researchers, libraries have an opportunity to bolster support for data curation and education.

Subject(s)

Attitude of Health Personnel , Internal Medicine , Internship and Residency , Internship and Residency/statistics & numerical data , Humans , Internal Medicine/education , Surveys and Questionnaires , Male , Female , Adult , Information Dissemination/methods

17.

Erratum: Artificial intelligence based data curation: enabling a patient-centric European health data space.

Front Med (Lausanne) ; 11: 1455319, 2024.

Article in English | MEDLINE | ID: mdl-39045419

ABSTRACT

[This corrects the article DOI: 10.3389/fmed.2024.1365501.].

18.

Researchers' perceptions of the trustworthiness, for reuse purposes, of government health data in Victoria, Australia: Implications for policy and practice.

Riley, Merilyn; Kilkenny, Monique F; Robinson, Kerin; Leggat, Sandra G.

Health Inf Manag ; : 18333583241256049, 2024 Jul 24.

Article in English | MEDLINE | ID: mdl-39045683

ABSTRACT

In 2022 the Australian Data Availability and Transparency Act (DATA) commenced, enabling accredited "data users" to access data from "accredited data service providers." However, the DATA Scheme lacks guidance on "trustworthiness" of the data to be utilised for reuse purposes. Objectives: To determine: (i) Do researchers using government health datasets trust the data? (ii) What factors influence their perceptions of data trustworthiness? and (iii) What are the implications for government and data custodians? Method: Authors of published studies (2008-2020) that utilised Victorian government health datasets were surveyed via a case study approach. Twenty-eight trust constructs (identified via literature review) were grouped into data factors, management properties and provider factors. Results: Fifty experienced health researchers responded. Most (88%) believed that Victorian government health data were trustworthy. When grouped, data factors and management properties were more important than data provider factors in building trust. The most important individual trust constructs were: "compliant with ethical regulation" (100%) and "monitoring privacy and confidentiality" (98%). Constructs of least importance were knowledge of "participant consent" (56%) and "major focus of the data provider was research" (50%). Conclusion: Overall, the researchers trusted government health data, but data factors and data management properties were more important than data provider factors in building trust. Implications: Government should ensure the DATA Scheme incorporates mechanisms to validate those data utilised by accredited data users and data providers have sufficient quality (intrinsic and extrinsic) to meet the requirements of "trustworthiness," and that evidentiary documentation is provided to support these "accredited data."

19.

Fructose-1,6-bisphosphatase deficiency: estimation of prevalence in the Chinese population and analysis of genotype-phenotype association.

Ni, Qi; Tang, Meiling; Chen, Xiang; Lu, Yulan; Wu, Bingbing; Wang, Huijun; Zhou, Wenhao; Dong, Xinran.

Front Genet ; 15: 1296797, 2024.

Article in English | MEDLINE | ID: mdl-39036704

ABSTRACT

Objective: Fructose-1,6-bisphosphatase deficiency (FBP1D) is a rare inborn error due to mutations in the FBP1 gene. The genetic spectrum of FBP1D in China is unknown, also nonspecific manifestations confuse disease diagnosis. We systematically estimated the FBP1D prevalence in Chinese and explored genotype-phenotype association. Methods: We collected 101 FBP1 variants from our cohort and public resources, and manually curated pathogenicity of these variants. Ninety-seven pathogenic or likely pathogenic variants were used in our cohort to estimate Chinese FBP1D prevalence by three methods: 1) carrier frequency, 2) permutation and combination, 3) Bayesian framework. Allele frequencies (AFs) of these variants in our cohort, China Metabolic Analytics Project (ChinaMAP) and gnomAD were compared to reveal the different hotspots in Chinese and other populations. Clinical and genetic information of 122 FBP1D patients from our cohort and published literature were collected to analyze the genotype-phenotypes association. Phenotypes of 68 hereditary fructose intolerance (HFI) patients from our previous study were used to compare the phenotypic differences between these two fructose metabolism diseases. Results: The estimated Chinese FBP1D prevalence was 1/1,310,034. In the Chinese population, c.490G>A and c.355G>A had significantly higher AFs than in the non-Finland European population, and c.841G>A had significantly lower AF value than in the South Asian population (all p values < 0.05). The genotype-phenotype association analyses showed that patients carrying homozygous c.841G>A were more likely to present increased urinary glycerol, carrying two CNVs (especially homozygous exon1 deletion) were often with hepatic steatosis, carrying compound heterozygous variants were usually with lethargy, and carrying homozygous variants were usually with ketosis and hepatic steatosis (all p values < 0.05). By comparing to phenotypes of HFI patients, FBP1D patients were more likely to present hypoglycemia, metabolic acidosis, and seizures (all p-value < 0.05). Conclusion: The prevalence of FBP1D in the Chinese population is extremely low. Genetic sequencing could effectively help to diagnose FBP1D.

20.

Consistency, completeness and external validity of ethnicity recording in NHS primary care records: a cohort study in 25 million patients' records at source using OpenSAFELY.

Andrews, Colm D; Mathur, Rohini; Massey, Jon; Park, Robin; Curtis, Helen J; Hopcroft, Lisa; Mehrkar, Amir; Bacon, Seb; Hickman, George; Smith, Rebecca; Evans, David; Ward, Tom; Davy, Simon; Inglesby, Peter; Dillingham, Iain; Maude, Steven; O'Dwyer, Thomas; Butler-Cole, Ben F C; Bridges, Lucy; Bates, Chris; Parry, John; Hester, Frank; Harper, Sam; Cockburn, Jonathan; Goldacre, Ben; MacKenna, Brian; Tomlinson, Laurie A; Walker, Alex J; Hulme, William J.

BMC Med ; 22(1): 288, 2024 Jul 10.

Article in English | MEDLINE | ID: mdl-38987774

ABSTRACT

BACKGROUND: Ethnicity is known to be an important correlate of health outcomes, particularly during the COVID-19 pandemic, where some ethnic groups were shown to be at higher risk of infection and adverse outcomes. The recording of patients' ethnic groups in primary care can support research and efforts to achieve equity in service provision and outcomes; however, the coding of ethnicity is known to present complex challenges. We therefore set out to describe ethnicity coding in detail with a view to supporting the use of this data in a wide range of settings, as part of wider efforts to robustly describe and define methods of using administrative data. METHODS: We describe the completeness and consistency of primary care ethnicity recording in the OpenSAFELY-TPP database, containing linked primary care and hospital records in > 25 million patients in England. We also compared the ethnic breakdown in OpenSAFELY-TPP with that of the 2021 UK census. RESULTS: 78.2% of patients registered in OpenSAFELY-TPP on 1 January 2022 had their ethnicity recorded in primary care records, rising to 92.5% when supplemented with hospital data. The completeness of ethnicity recording was higher for women than for men. The rate of primary care ethnicity recording ranged from 77% in the South East of England to 82.2% in the West Midlands. Ethnicity recording rates were higher in patients with chronic or other serious health conditions. For each of the five broad ethnicity groups, primary care recorded ethnicity was within 2.9 percentage points of the population rate as recorded in the 2021 Census for England as a whole. For patients with multiple ethnicity records, 98.7% of the latest recorded ethnicities matched the most frequently coded ethnicity. Patients whose latest recorded ethnicity was categorised as Other were most likely to have a discordant ethnicity recording (32.2%). CONCLUSIONS: Primary care ethnicity data in OpenSAFELY is present for over three quarters of all patients, and combined with data from other sources can achieve a high level of completeness. The overall distribution of ethnicities across all English OpenSAFELY-TPP practices was similar to the 2021 Census, with some regional variation. This report identifies the best available codelist for use in OpenSAFELY and similar electronic health record data.

Subject(s)

Ethnicity , Primary Health Care , State Medicine , Adult , Aged , Female , Humans , Male , Middle Aged , Cohort Studies , England , Ethnicity/statistics & numerical data , Primary Health Care/statistics & numerical data , Infant, Newborn , Infant , Child, Preschool , Child , Adolescent , Young Adult , Aged, 80 and over

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

Subject(s)

ABSTRACT

ABSTRACT

ABSTRACT

ABSTRACT

Subject(s)

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL