Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 155
Filter
1.
Anal Chem ; 96(6): 2351-2359, 2024 Feb 13.
Article in English | MEDLINE | ID: mdl-38308813

ABSTRACT

The accurate prediction of suitable chiral stationary phases (CSPs) for resolving the enantiomers of a given compound poses a significant challenge in chiral chromatography. Previous attempts at developing machine learning models for structure-based CSP prediction have primarily relied on 1D SMILES strings [the simplified molecular-input line-entry system (SMILES) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings] or 2D graphical representations of molecular structures and have met with only limited success. In this study, we apply the recently developed 3D molecular conformation representation learning algorithm, which uses rapid conformational analysis and point clouds of atom positions in the 3D space, enabling efficient chemical structure-based machine learning. By harnessing the power of the rapid 3D molecular representation learning and a data set comprising over 300,000 chromatographic enantioseparation records sourced from the literature, our models afford notable improvements for the chemical structure-based choice of appropriate CSP for enantioseparation, paving the way for more efficient and informed decision-making in the field of chiral chromatography.

2.
Nat Commun ; 14(1): 7974, 2023 Dec 02.
Article in English | MEDLINE | ID: mdl-38042873

ABSTRACT

De novo peptide sequencing, which does not rely on a comprehensive target sequence database, provides us with a way to identify novel peptides from tandem mass spectra. However, current de novo sequencing algorithms suffer from low accuracy and coverage, which hinders their application in proteomics. In this paper, we present PepNet, a fully convolutional neural network for high accuracy de novo peptide sequencing. PepNet takes an MS/MS spectrum (represented as a high-dimensional vector) as input, and outputs the optimal peptide sequence along with its confidence score. The PepNet model is trained using a total of 3 million high-energy collisional dissociation MS/MS spectra from multiple human peptide spectral libraries. Evaluation results show that PepNet significantly outperforms current best-performing de novo sequencing algorithms (e.g. PointNovo and DeepNovo) in both peptide-level accuracy and positional-level accuracy. PepNet can sequence a large fraction of spectra that were not identified by database search engines, and thus could be used as a complementary tool to database search engines for peptide identification in proteomics. In addition, PepNet runs around 3x and 7x faster than PointNovo and DeepNovo on GPUs, respectively, thus being more suitable for the analysis of large-scale proteomics data.


Subject(s)
Sequence Analysis, Protein , Tandem Mass Spectrometry , Humans , Tandem Mass Spectrometry/methods , Sequence Analysis, Protein/methods , Peptides , Amino Acid Sequence , Neural Networks, Computer , Algorithms , Peptide Library
3.
Biochem Biophys Res Commun ; 671: 10-17, 2023 09 03.
Article in English | MEDLINE | ID: mdl-37290279

ABSTRACT

α-amylase plays a crucial role in regulating metabolism and health by hydrolyzing of starch and glycogen. Despite comprehensive studies of this classic enzyme spanning over a century, the function of its carboxyl terminal domain (CTD) with a conserved eight ß-strands is still not fully understood. Amy63, identified from a marine bacterium, was reported as a novel multifunctional enzyme with amylase, agarase and carrageenase activities. In this study, the crystal structure of Amy63 was determined at 1.8 Å resolution, revealing high conservation with some other amylases. Interestingly, the independent amylase activity of the carboxyl terminal domain of Amy63 (Amy63_CTD) was newly discovered by the plate-based assay and mass spectrometry. To date, the Amy63_CTD alone could be regarded as the smallest amylase subunit. Moreover, the significant amylase activity of Amy63_CTD was measured over a wide range of temperature and pH, with optimal activity at 60 °C and pH 7.5. The Small-angle X-ray scattering (SAXS) data showed that the high-order oligomeric assembly gradually formed with increasing concentration of Amy63_CTD, implying the novel catalytic mechanism as revealed by the assembly structure. Therefore, the discovery of the novel independent amylase activity of Amy63_CTD suggests a possible missing step or a new perspective in the complex catalytic process of Amy63 and other related α-amylases. This work may shed light on the design of nanozymes to process marine polysaccharides efficiently.


Subject(s)
Amylases , alpha-Amylases , Scattering, Small Angle , X-Ray Diffraction , alpha-Amylases/chemistry , alpha-Amylases/metabolism , Starch/metabolism , Hydrogen-Ion Concentration
4.
Bioinformatics ; 39(6)2023 06 01.
Article in English | MEDLINE | ID: mdl-37252828

ABSTRACT

MOTIVATION: Tandem mass spectrometry is an essential technology for characterizing chemical compounds at high sensitivity and throughput, and is commonly adopted in many fields. However, computational methods for automated compound identification from their MS/MS spectra are still limited, especially for novel compounds that have not been previously characterized. In recent years, in silico methods were proposed to predict the MS/MS spectra of compounds, which can then be used to expand the reference spectral libraries for compound identification. However, these methods did not consider the compounds' 3D conformations, and thus neglected critical structural information. RESULTS: We present the 3D Molecular Network for Mass Spectra Prediction (3DMolMS), a deep neural network model to predict the MS/MS spectra of compounds from their 3D conformations. We evaluated the model on the experimental spectra collected in several spectral libraries. The results showed that 3DMolMS predicted the spectra with the average cosine similarity of 0.691 and 0.478 with the experimental MS/MS spectra acquired in positive and negative ion modes, respectively. Furthermore, 3DMolMS model can be generalized to the prediction of MS/MS spectra acquired by different labs on different instruments through minor fine-tuning on a small set of spectra. Finally, we demonstrate that the molecular representation learned by 3DMolMS from MS/MS spectra prediction can be adapted to enhance the prediction of chemical properties such as the elution time in the liquid chromatography and the collisional cross section measured by ion mobility spectrometry, both of which are often used to improve compound identification. AVAILABILITY AND IMPLEMENTATION: The codes of 3DMolMS are available at https://github.com/JosieHong/3DMolMS and the web service is at https://spectrumprediction.gnps2.org.


Subject(s)
Tandem Mass Spectrometry , Tandem Mass Spectrometry/methods , Chromatography, Liquid/methods , Molecular Conformation
5.
Proteomes ; 11(1)2023 Feb 11.
Article in English | MEDLINE | ID: mdl-36810564

ABSTRACT

Staphylococcus aureus is one of the major community-acquired human pathogens, with growing multidrug-resistance, leading to a major threat of more prevalent infections to humans. A variety of virulence factors and toxic proteins are secreted during infection via the general secretory (Sec) pathway, which requires an N-terminal signal peptide to be cleaved from the N-terminus of the protein. This N-terminal signal peptide is recognized and processed by a type I signal peptidase (SPase). SPase-mediated signal peptide processing is the crucial step in the pathogenicity of S. aureus. In the present study, the SPase-mediated N-terminal protein processing and their cleavage specificity were evaluated using a combination of N-terminal amidination bottom-up and top-down proteomics-based mass spectrometry approaches. Secretory proteins were found to be cleaved by SPase, specifically and non-specifically, on both sides of the normal SPase cleavage site. The non-specific cleavages occur at the relatively smaller residues that are present next to the -1, +1, and +2 locations from the original SPase cleavage site to a lesser extent. Additional random cleavages at the middle and near the C-terminus of some protein sequences were also observed. This additional processing could be a part of some stress conditions and unknown signal peptidase mechanisms.

6.
J Proteome Res ; 22(5): 1501-1509, 2023 05 05.
Article in English | MEDLINE | ID: mdl-36802412

ABSTRACT

Liquid chromatography coupled with tandem mass spectrometry is commonly adopted in large-scale glycoproteomic studies involving hundreds of disease and control samples. The software for glycopeptide identification in such data (e.g., the commercial software Byonic) analyzes the individual data set and does not exploit the redundant spectra of glycopeptides presented in the related data sets. Herein, we present a novel concurrent approach for glycopeptide identification in multiple related glycoproteomic data sets by using spectral clustering and spectral library searching. The evaluation on two large-scale glycoproteomic data sets showed that the concurrent approach can identify 105%-224% more spectra as glycopeptides compared to the glycopeptide identification on individual data sets using Byonic alone. The improvement of glycopeptide identification also enabled the discovery of several potential biomarkers of protein glycosylations in hepatocellular carcinoma patients.


Subject(s)
Liver Neoplasms , Tandem Mass Spectrometry , Humans , Tandem Mass Spectrometry/methods , Glycopeptides/analysis , Chromatography, Liquid , Software
7.
Gut Microbes ; 14(1): 2135963, 2022.
Article in English | MEDLINE | ID: mdl-36289064

ABSTRACT

Clostridioides difficile infection (CDI) is a gastro-intestinal (GI) infection that illustrates how perturbations in symbiotic host-microbiome interactions render the GI tract vulnerable to the opportunistic pathogens. CDI also serves as an example of how such perturbations could be reversed via gut microbiota modulation mechanisms, especially fecal microbiota transplantation (FMT). However, microbiome-mediated diagnosis of CDI remains understudied. Here, we evaluated the diagnostic capabilities of the fecal microbiome on the prediction of CDI. We used the metagenomic sequencing data from ten previous studies, encompassing those acquired from CDI patients treated by FMT, CDI-negative patients presenting other intestinal health conditions, and healthy volunteers taking antibiotics. We designed a hybrid species/function profiling approach that determines the abundances of microbial species in the community contributing to its functional profile. These functionally informed taxonomic profiles were then used for classification of the microbial samples. We used logistic regression (LR) models using these features, which showed high prediction accuracy (with an average AUC≥0.91), substantiating that the species/function composition of the gut microbiome has a robust diagnostic prediction of CDI. We further assessed the confounding impact of antibiotic therapy on CDI prediction and found that it is distinguishable from the CDI impact. Finally, we devised a log-odds score computed from the output of the LR models to quantify the likelihood of CDI in a gut microbiome sample and applied it to evaluating the effectiveness of FMT based on post-FMT microbiome samples. The results showed that the gut microbiome of patients exhibited a gradual but steady improvement after receiving successful FMT, indicating the restoration of the normal microbiome functions.


Subject(s)
Clostridioides difficile , Clostridium Infections , Gastrointestinal Microbiome , Microbiota , Humans , Clostridioides difficile/genetics , Clostridium Infections/therapy , Fecal Microbiota Transplantation/methods , Anti-Bacterial Agents/therapeutic use , Treatment Outcome
8.
J Am Med Inform Assoc ; 29(12): 2182-2190, 2022 11 14.
Article in English | MEDLINE | ID: mdl-36164820

ABSTRACT

Concerns regarding inappropriate leakage of sensitive personal information as well as unauthorized data use are increasing with the growth of genomic data repositories. Therefore, privacy and security of genomic data have become increasingly important and need to be studied. With many proposed protection techniques, their applicability in support of biomedical research should be well understood. For this purpose, we have organized a community effort in the past 8 years through the integrating data for analysis, anonymization and sharing consortium to address this practical challenge. In this article, we summarize our experience from these competitions, report lessons learned from the events in 2020/2021 as examples, and discuss potential future research directions in this emerging field.


Subject(s)
Computer Security , Privacy , Data Analysis , Genomics , Genome
9.
Anal Chem ; 94(28): 10003-10010, 2022 07 19.
Article in English | MEDLINE | ID: mdl-35776110

ABSTRACT

Glycosylation is a post-translational modification involved in many important biological functions. The aberrant alteration of glycan structure is implicit with malfunction of cells and possess potential significance in medical diagnosis of complex diseases such as cancer. Liquid chromatography tandem mass spectrometry (LC-MS/MS) has been commonly applied to the analysis of complex glycomic samples. However, the characterization of isomeric glycans from their MS/MS spectra in complex biological samples remains challenging. In this paper, we present a novel reciprocal best-hit glycan-spectrum matching (RB-GSM) approach toward characterizing N-glycans. In this method, the MS/MS spectra in the input data set are evaluated against all glycans with the matched precursor mass using customized scoring functions, where a glycan-spectrum matching (GSM) is considered to be true if it is a reciprocal best-hit, that is, it receives the highest score among not only the GSMs between the respective spectrum and all matched glycans, but also the GSMs between the respective glycan and all matched MS/MS spectra in the input data set. We evaluated this RB-GSM approach on N-glycan identification using MS/MS spectra acquired from glycan standards as well as those released from the model glycoprotein fetuin, immunoglobulin G, and human serum samples, which showed the RB-GSM is capable of distinguishing isomeric glycans.


Subject(s)
Polysaccharides , Tandem Mass Spectrometry , Chromatography, Liquid/methods , Glycosylation , Humans , Isomerism , Polysaccharides/chemistry , Tandem Mass Spectrometry/methods
10.
J Comput Biol ; 29(7): 738-751, 2022 07.
Article in English | MEDLINE | ID: mdl-35584271

ABSTRACT

Microbial organisms play important roles in many aspects of human health and diseases. Encouraged by the numerous studies that show the association between microbiomes and human diseases, computational and machine learning methods have been recently developed to generate and utilize microbiome features for prediction of host phenotypes such as disease versus healthy cancer immunotherapy responder versus nonresponder. We have previously developed a subtractive assembly approach, which focuses on extraction and assembly of differential reads from metagenomic data sets that are likely sampled from differential genomes or genes between two groups of microbiome data sets (e.g., healthy vs. disease). In this article, we further improved our subtractive assembly approach by utilizing groups of k-mers with similar abundance profiles across multiple samples. We implemented a locality-sensitive hashing (LSH)-enabled approach (called kmerLSHSA) to group billions of k-mers into k-mer coabundance groups (kCAGs), which were subsequently used for the retrieval of differential kCAGs for subtractive assembly. Testing of the kmerLSHSA approach on simulated data sets and real microbiome data sets showed that, compared with the conventional approach that utilizes all genes, our approach can quickly identify differential genes that can be used for building promising predictive models for microbiome-based host phenotype prediction. We also discussed other potential applications of LSH-enabled clustering of k-mers according to their abundance profiles across multiple microbiome samples.


Subject(s)
Metagenomics , Microbiota , Cluster Analysis , Metagenome , Metagenomics/methods , Microbiota/genetics , Phenotype
11.
Front Chem ; 9: 707382, 2021.
Article in English | MEDLINE | ID: mdl-34211962

ABSTRACT

The retention time provides critical information for glycan annotation and quantification from the Liquid Chromatography Mass Spectrometry (LC-MS) data. However, the variation of the precise retention time of glycans is highly dependent on the experimental conditions such as the specific separating columns, MS instruments and/or the buffer used. This variation hampers the exploitation of retention time for the glycan annotation from LC-MS data, especially when inter-laboratory data are compared. To incorporate the retention time of glycan across experiments, Glucose Unit Index (GUI) can be computed using the dextrin ladder as internal standard. The retention time of glycans are then calibrated with respect to glucose units derived from dextrin ladders. Despite the successful application of the GUI approach, the manual calibration process is quite tedious and often error prone. In this work, we present a standalone software tool GlycanGUI, with a graphic user interface to automatically carry out the GUI-based glycan annotation/quantification and subsequent data analysis. When tested on experimental data, GlycanGUI reported accurate GUI values compared with manual calibration, and thus is ready to be used for automated glycan annotation and quantification using GUI.

12.
Bioinformatics ; 37(Suppl_1): i161-i168, 2021 07 12.
Article in English | MEDLINE | ID: mdl-34252973

ABSTRACT

MOTIVATION: The availability of human genomic data, together with the enhanced capacity to process them, is leading to transformative technological advances in biomedical science and engineering. However, the public dissemination of such data has been difficult due to privacy concerns. Specifically, it has been shown that the presence of a human subject in a case group can be inferred from the shared summary statistics of the group, e.g. the allele frequencies, or even the presence/absence of genetic variants (e.g. shared by the Beacon project) in the group. These methods rely on the availability of the target's genome, i.e. the DNA profile of a target human subject, and thus are often referred to as the membership inference method. RESULTS: In this article, we demonstrate the haplotypes, i.e. the sequence of single nucleotide variations (SNVs) showing strong genetic linkages in human genome databases, may be inferred from the summary of genomic data without using a target's genome. Furthermore, novel haplotypes that did not appear in the database may be reconstructed solely from the allele frequencies from genomic datasets. These reconstructed haplotypes can be used for a haplotype-based membership inference algorithm to identify target subjects in a case group with greater power than existing methods based on SNVs. AVAILABILITY AND IMPLEMENTATION: The implementation of the membership inference algorithms is available at https://github.com/diybu/Haplotype-based-membership-inferences.


Subject(s)
Genome, Human , Genomics , Algorithms , Gene Frequency , Haplotypes , Humans
13.
Nat Commun ; 12(1): 3445, 2021 06 08.
Article in English | MEDLINE | ID: mdl-34103512

ABSTRACT

To fully utilize the advances in omics technologies and achieve a more comprehensive understanding of human diseases, novel computational methods are required for integrative analysis of multiple types of omics data. Here, we present a novel multi-omics integrative method named Multi-Omics Graph cOnvolutional NETworks (MOGONET) for biomedical classification. MOGONET jointly explores omics-specific learning and cross-omics correlation learning for effective multi-omics data classification. We demonstrate that MOGONET outperforms other state-of-the-art supervised multi-omics integrative analysis approaches from different biomedical classification applications using mRNA expression data, DNA methylation data, and microRNA expression data. Furthermore, MOGONET can identify important biomarkers from different omics data types related to the investigated biomedical problems.


Subject(s)
Algorithms , Biomarkers/analysis , Genomics , Alzheimer Disease/genetics , Breast Neoplasms/genetics , Databases, Genetic , Female , Humans
14.
J Proteome Res ; 20(6): 3345-3352, 2021 06 04.
Article in English | MEDLINE | ID: mdl-34010560

ABSTRACT

Glycosylation is one of the most common post-translational modifications (PTM) occurring in a large variety of proteins with important biological functions in human and other higher organisms. Liquid chromatography tandem mass spectrometry (LC-MS/MS) has been routinely used to characterize site-specific protein glycosylation at high throughput in complex glycoproteomic samples. Recently, electron transfer/high-energy collision dissociation (EThcD) was introduced for glycopeptide identification, which offers rich structural information on glycopepides with the fragment ions from the cleavages of both the glycan and the peptide backbone. Herein, we present the software GlycoHybridSeq for automated interpretation of EThcD-MS/MS spectra from glycoproteomic data using a customized scoring function, which enables the functionalities of identifying glycopeptides, characterizing glycosylation sites, and distinguishing some isomeric glycans. We evaluate GlycoHybridSeq on glycoproteomic data collected for cancer biomarker discovery. The results showed that it achieved comparable or better performance than that of Byonic and MSFragger. GlycoHybridSeq is released as an open source software and is ready to be used in large-scale glycoproteomic data analyses.


Subject(s)
Glycopeptides , Tandem Mass Spectrometry , Chromatography, Liquid , Electrons , Glycosylation , Humans
15.
Proc (Int Conf Dependable Syst Netw) ; 2021: 413-425, 2021 Jun.
Article in English | MEDLINE | ID: mdl-35919377

ABSTRACT

A trusted execution environment (TEE) such as Intel Software Guard Extension (SGX) runs attestation to prove to a data owner the integrity of the initial state of an enclave, including the program to operate on her data. For this purpose, the data-processing program is supposed to be open to the owner or a trusted third party, so its functionality can be evaluated before trust being established. In the real world, however, increasingly there are application scenarios in which the program itself needs to be protected (e.g., proprietary algorithm). So its compliance with privacy policies as expected by the data owner should be verified without exposing its code. To this end, this paper presents Deflection, a new model for TEE-based delegated and flexible in-enclave code verification. Given that the conventional solutions do not work well under the resource-limited and TCB-frugal TEE, we come up with a new design inspired by Proof-Carrying Code. Our design strategically moves most of the workload to the code generator, which is responsible for producing easy-to-check code, while keeping the consumer simple. Also, the whole consumer can be made public and verified through a conventional attestation. We implemented this model on Intel SGX and demonstrate that it introduces a very small part of TCB. We also thoroughly evaluated its performance on micro- and macro- benchmarks and real-world applications, showing that the design only incurs a small overhead when enforcing several categories of security policies.

16.
IEEE Int Conf Cloud Comput ; 2021: 733-743, 2021 Sep.
Article in English | MEDLINE | ID: mdl-35662807

ABSTRACT

Trusted execution environments (TEE) such as Intel's Software Guard Extension (SGX) have been widely studied to boost security and privacy protection for the computation of sensitive data such as human genomics. However, a performance hurdle is often generated by SGX, especially from the small enclave memory. In this paper, we propose a new Hybrid Secured Flow framework (called "HySec-Flow") for large-scale genomic data analysis using SGX platforms. Here, the data-intensive computing tasks can be partitioned into independent subtasks to be deployed into distinct secured and non-secured containers, therefore allowing for parallel execution while alleviating the limited size of Page Cache (EPC) memory in each enclave. We illustrate our contributions using a workflow supporting indexing, alignment, dispatching, and merging the execution of SGX- enabled containers. We provide details regarding the architecture of the trusted and untrusted components and the underlying Scorn and Graphene support as generic shielding execution frameworks to port legacy code. We thoroughly evaluate the performance of our privacy-preserving reads mapping algorithm using real human genome sequencing data. The results demonstrate that the performance is enhanced by partitioning the time-consuming genomic computation into subtasks compared to the conventional execution of the data-intensive reads mapping algorithm in an enclave. The proposed HySec-Flow framework is made available as an open-source and adapted to the data-parallel computation of other large-scale genomic tasks requiring security and scalable computational resources.

17.
Nat Microbiol ; 6(1): 123-135, 2021 01.
Article in English | MEDLINE | ID: mdl-33139880

ABSTRACT

Viruses and plasmids (invasive mobile genetic elements (iMGEs)) have important roles in shaping microbial communities, but their dynamic interactions with CRISPR-based immunity remain unresolved. We analysed generation-resolved iMGE-host dynamics spanning one and a half years in a microbial consortium from a biological wastewater treatment plant using integrated meta-omics. We identified 31 bacterial metagenome-assembled genomes encoding complete CRISPR-Cas systems and their corresponding iMGEs. CRISPR-targeted plasmids outnumbered their bacteriophage counterparts by at least fivefold, highlighting the importance of CRISPR-mediated defence against plasmids. Linear modelling of our time-series data revealed that the variation in plasmid abundance over time explained more of the observed community dynamics than phages. Community-scale CRISPR-based plasmid-host and phage-host interaction networks revealed an increase in CRISPR-mediated interactions coinciding with a decrease in the dominant 'Candidatus Microthrix parvicella' population. Protospacers were enriched in sequences targeting genes involved in the transmission of iMGEs. Understanding the factors shaping the fitness of specific populations is necessary to devise control strategies for undesirable species and to predict or explain community-wide phenotypes.


Subject(s)
Bacteria/genetics , Bacteriophages/genetics , CRISPR-Cas Systems/genetics , Microbial Interactions/genetics , Plasmids/genetics , Bacteria/virology , Clustered Regularly Interspaced Short Palindromic Repeats/genetics , Genome, Bacterial/genetics , Metagenome/genetics , Microbial Consortia/genetics , Microbial Interactions/physiology , Sewage/microbiology , Water Purification
18.
Nat Commun ; 11(1): 5281, 2020 10 19.
Article in English | MEDLINE | ID: mdl-33077707

ABSTRACT

The development of reliable, mixed-culture biotechnological processes hinges on understanding how microbial ecosystems respond to disturbances. Here we reveal extensive phenotypic plasticity and niche complementarity in oleaginous microbial populations from a biological wastewater treatment plant. We perform meta-omics analyses (metagenomics, metatranscriptomics, metaproteomics and metabolomics) on in situ samples over 14 months at weekly intervals. Based on 1,364 de novo metagenome-assembled genomes, we uncover four distinct fundamental niche types. Throughout the time-series, we observe a major, transient shift in community structure, coinciding with substrate availability changes. Functional omics data reveals extensive variation in gene expression and substrate usage amongst community members. Ex situ bioreactor experiments confirm that responses occur within five hours of a pulse disturbance, demonstrating rapid adaptation by specific populations. Our results show that community resistance and resilience are a function of phenotypic plasticity and niche complementarity, and set the foundation for future ecological engineering efforts.


Subject(s)
Bacteria/genetics , Bacteria/metabolism , Microbiota , Wastewater/microbiology , Bacteria/classification , Bacteria/isolation & purification , Bioreactors/microbiology , Ecosystem , Metabolomics , Metagenome , Metagenomics , Proteomics , Time Factors
19.
Bioinformatics ; 36(Suppl_1): i128-i135, 2020 07 01.
Article in English | MEDLINE | ID: mdl-32657380

ABSTRACT

MOTIVATION: The generalized linear mixed model (GLMM) is an extension of the generalized linear model (GLM) in which the linear predictor takes random effects into account. Given its power of precisely modeling the mixed effects from multiple sources of random variations, the method has been widely used in biomedical computation, for instance in the genome-wide association studies (GWASs) that aim to detect genetic variance significantly associated with phenotypes such as human diseases. Collaborative GWAS on large cohorts of patients across multiple institutions is often impeded by the privacy concerns of sharing personal genomic and other health data. To address such concerns, we present in this paper a privacy-preserving Expectation-Maximization (EM) algorithm to build GLMM collaboratively when input data are distributed to multiple participating parties and cannot be transferred to a central server. We assume that the data are horizontally partitioned among participating parties: i.e. each party holds a subset of records (including observational values of fixed effect variables and their corresponding outcome), and for all records, the outcome is regulated by the same set of known fixed effects and random effects. RESULTS: Our collaborative EM algorithm is mathematically equivalent to the original EM algorithm commonly used in GLMM construction. The algorithm also runs efficiently when tested on simulated and real human genomic data, and thus can be practically used for privacy-preserving GLMM construction. We implemented the algorithm for collaborative GLMM (cGLMM) construction in R. The data communication was implemented using the rsocket package. AVAILABILITY AND IMPLEMENTATION: The software is released in open source at https://github.com/huthvincent/cGLMM. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Subject(s)
Genome-Wide Association Study , Privacy , Genomics , Humans , Linear Models , Software
SELECTION OF CITATIONS
SEARCH DETAIL
...