Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 64
Filter
1.
Data Brief ; 54: 110289, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38586142

ABSTRACT

We present the 'NoSQL Injection Dataset for MongoDB, a comprehensive collection of data obtained from diverse projects focusing on NoSQL attacks on MongoDB databases. In the present era, we can classify databases into three main types: structured, semi-structured, and unstructured. While structured databases have played a prominent role in the past, unstructured databases like MongoDB are currently experiencing remarkable growth. Consequently, the vulnerabilities associated with these databases are also increasing. Hence, we have gathered a comprehensive dataset comprising 400 NoSQL injection commands. These commands are segregated into two categories: 221 malicious commands and 179 benign commands. The dataset was meticulously curated by combining both manually authored commands and those acquired through web scraping from reputable sources. The collected dataset serves as a valuable resource for studying and analysing NoSQL injection vulnerabilities, offering insights into potential security threats and aiding in the development of robust protection mechanisms against such attacks. The dataset includes a blend of complex and simple commands that have been enhanced. The dataset is well-suited for machine learning and data analysis, especially for security enthusiasts. The security professionals can use this dataset to train or fine tune the AI-models or LLMs in order to achieve higher attack detection accuracy. The security enthusiasts can also augment this dataset to generate more NoSQL commands and create robust security tools.

2.
IEEE Trans Knowl Data Eng ; 35(10): 10295-10308, 2023 Oct 01.
Article in English | MEDLINE | ID: mdl-37954972

ABSTRACT

In the past decade, many approaches have been suggested to execute ML workloads on a DBMS. However, most of them have looked at in-DBMS ML from a training perspective, whereas ML inference has been largely overlooked. We think that this is an important gap to fill for two main reasons: (1) in the near future, every application will be infused with some sort of ML capability; (2) behind every web page, application, and enterprise there is a DBMS, whereby in-DBMS inference is an appealing solution both for efficiency (e.g., less data movement), performance (e.g., cross-optimizations between relational operators and ML) and governance. In this article, we study whether DBMSs are a good fit for prediction serving. We introduce a technique for translating trained ML pipelines containing both featurizers (e.g., one-hot encoding) and models (e.g., linear and tree-based models) into SQL queries, and we compare in-DBMS performance against popular ML frameworks such as Sklearn and ml.net. Our experiments show that, when pushed inside a DBMS, trained ML pipelines can have performance comparable to ML frameworks in several scenarios, while they perform quite poorly on text featurization and over (even simple) neural networks.

3.
BMC Bioinformatics ; 24(1): 354, 2023 Sep 21.
Article in English | MEDLINE | ID: mdl-37735350

ABSTRACT

BACKGROUND: Plummeting DNA sequencing cost in recent years has enabled genome sequencing projects to scale up by several orders of magnitude, which is transforming genomics into a highly data-intensive field of research. This development provides the much needed statistical power required for genotype-phenotype predictions in complex diseases. METHODS: In order to efficiently leverage the wealth of information, we here assessed several genomic data science tools. The rationale to focus on on-premise installations is to cope with situations where data confidentiality and compliance regulations etc. rule out cloud based solutions. We established a comprehensive qualitative and quantitative comparison between BCFtools, SnpSift, Hail, GEMINI, and OpenCGA. The tools were compared in terms of data storage technology, query speed, scalability, annotation, data manipulation, visualization, data output representation, and availability. RESULTS: Tools that leverage sophisticated data structures are noted as the most suitable for large-scale projects in varying degrees of scalability in comparison to flat-file manipulation (e.g., BCFtools, and SnpSift). Remarkably, for small to mid-size projects, even lightweight relational database. CONCLUSION: The assessment criteria provide insights into the typical questions posed in scalable genomics and serve as guidance for the development of scalable computational infrastructure in genomics.


Subject(s)
Data Science , Genomics , Chromosome Mapping , Databases, Factual , Sequence Analysis, DNA
4.
Molecules ; 28(17)2023 Aug 30.
Article in English | MEDLINE | ID: mdl-37687188

ABSTRACT

A two-dimensional (2D) lamellar Zn metal-organic framework (Zn-MOF, 1) with a fluorescent 1,6-di(pyridin-3-yl)pyrene (3-DPPy) and 1,4-benzenedicarboxylate (BDC2-) bridging linkers was prepared and structurally characterized. The chemical formula of 1 is [Zn(µ-3-DPPy)(µ-BDC)]n. The mononuclear Zn(II) ion, acting as a node, is tetrahedrally coordinated with two 3-DPPy and two BDC linkers. The coordination environment of Zn(II) is a distorted tetrahedral geometry. The Zn-MOF is the sql network structure based on topology analysis. The undulated 2D sheets of 1 tightly pack together to form a lamellar structure. The pyrene moieties are parallelly oriented to each other. The Zn-MOF is not porous, possibly because the mononuclear Zn(II) node did not form cluster-based secondary building units due to the less symmetric 3-DPPy. The steady-state fluorescence measurements indicate that the fluorescence signal of the 1 is slightly blue-shifted compared to the free 3-DPPy in the solid state. The excimer emission band at 463 nm for crystalline 3-DPPy is shifted to 447 nm for 1. The value of 447 nm is also a blue-shift value compared to nonsubstituted pyrene crystals (470 nm). Despite its nonporosity, the surface Lewis acidic sites of 1 could catalyze the transesterification of esters. Surface defect sites are responsible for this catalytic activity.

5.
PeerJ Comput Sci ; 9: e1317, 2023.
Article in English | MEDLINE | ID: mdl-37346735

ABSTRACT

The advent of big data technologies makes a profound impact on various facets of our lives, which also presents an opportunity for Chinese audits. However, the heterogeneity of multi-source audit data, the intricacy of converting Chinese into SQL, and the inefficiency of data processing methods present significant obstacles to the growth of Chinese audits. In this article, we proposed BDMCA, a big data management system designed for Chinese audits. We developed a hybrid management architecture for handling Chinese audit big data, that can alleviate the heterogeneity of multi-mode data. Moreover, we defined an R-HBase spatio-temporal meta-structure for auditing purposes, which exhibits almost linear response time and excellent scalability. Compared to MD-HBase, R-HBase performs 4.5× and 3× better in range query and kNN query, respectively. In addition, we leveraged the slot value filling method to generate templates and build a multi-topic presentation learning model MRo-SQL. MRo-SQL outperforms the state-of-the-art X-SQL parsing model with improvements in logical-form accuracy of up to 5.2%, and execution accuracy of up to 5.9%.

6.
Am J Clin Pathol ; 160(3): 255-260, 2023 09 01.
Article in English | MEDLINE | ID: mdl-37167032

ABSTRACT

OBJECTIVES: Blood culture contamination is a major problem in health care, with significant impacts on both patient safety and cost. Initiatives to reduce blood culture contamination require a reliable, consistent metric to track the success of interventions. The objective of our project was to establish a standardized definition of blood culture contamination suitable for use in a Veterans Health Administration (VHA) national data query, then to validate this definition and query. A secondary objective was to construct a national VHA data dashboard to display the data from this query that could be used in VHA quality improvement projects aimed at reducing blood culture contamination. METHODS: A VHA microbiology expert work group was formed to generate a standardized definition and oversee the validation studies. The standardized definition was used to generate data for calendar year 2021 using a Structured Query Language data query. Twelve VHA hospital microbiology laboratories compared the data from the query against their own locally derived contamination data and recorded those data in a data collection worksheet that all sites used. Data were collated and presented to the work group. RESULTS: More than 50,000 blood culture accessions were in the validation data set, with more than 1,200 contamination events. The overall blood culture contamination rate for the 12 facilities participating was 2.56% with local definitions and data and 2.43% with the standardized definitions and data query. The main differences noted between the 2 data sets were deemed to be issues in local definitions. The query and definition were then converted into a national data dashboard that all VHA facilities can now access. CONCLUSIONS: A standardized definition for blood culture contamination and a national data query were validated for enterprise-wide VHA use. To our knowledge, this represents the first reported standardized, validated, and automated approach for calculating and tracking blood culture contamination. This tool will be key in quality initiatives aimed at reducing contamination events in VHA.


Subject(s)
Blood Culture , Delivery of Health Care , Humans
7.
Stud Health Technol Inform ; 302: 98-102, 2023 May 18.
Article in English | MEDLINE | ID: mdl-37203617

ABSTRACT

Accessibility to high-quality historical data for patients in hospitals may facilitate related predictive model development and data analysis experiments. This study provides a design for a data-sharing platform based on all possible criteria for Medical Information Mart for Intensive Care (MIMIC) IV and Emergency MIMIC-ED. Tables containing columns of medical attributions and outcomes were studied by a team of 5 experts in Medical Informatics. They completely agreed about the columns connection using subject-id, HDM-id, and stay-id as foreign keys. The tables of two marts were considered in the intra-hospital patient transfer path with various outcomes. Using the constraints, queries were generated and applied to the backend of the platform. The suggested user interface was drawn to retrieve records based on various entry criteria and present the output in the frame of a dashboard or a graph. This design is a step toward platform development that is useful for studies aimed at patient trajectory analysis, medical outcome prediction, or studies that require heterogeneous data entries.


Subject(s)
Medical Informatics , Patient Transfer , Humans , Data Warehousing , Hospitals
8.
Entropy (Basel) ; 25(3)2023 Mar 16.
Article in English | MEDLINE | ID: mdl-36981401

ABSTRACT

Text-to-SQL is a task that converts natural language questions into SQL queries. Recent text-to-SQL models employ two decoding methods: sketch-based and generation-based, but each has its own shortcomings. The sketch-based method has limitations in performance as it does not reflect the relevance between SQL elements, while the generation-based method may increase inference time and cause syntactic errors. Therefore, we propose a novel decoding method, Hybrid decoder, which combines both methods. This reflects inter-SQL element information and defines elements that can be generated, enabling the generation of syntactically accurate SQL queries. Additionally, we introduce a Value prediction module for predicting values in the WHERE clause. It simplifies the decoding process and reduces the size of vocabulary by predicting values at once, regardless of the number of conditions. The results of evaluating the significance of Hybrid decoder indicate that it improves performance by effectively incorporating mutual information among SQL elements, compared to the sketch-based method. It also efficiently generates SQL queries by simplifying the decoding process in the generation-based method. In addition, we design a new evaluation measure to evaluate if it generates syntactically correct SQL queries. The result demonstrates that the proposed model generates syntactically accurate SQL queries.

9.
Sensors (Basel) ; 23(4)2023 Feb 12.
Article in English | MEDLINE | ID: mdl-36850675

ABSTRACT

New techniques and tactics are being used to gain unauthorized access to the web that harm, steal, and destroy information. Protecting the system from many threats such as DDoS, SQL injection, cross-site scripting, etc., is always a challenging issue. This research work makes a comparative analysis between normal HTTP traffic and attack traffic that identifies attack-indicating parameters and features. Different features of standard datasets ISCX, CISC, and CICDDoS were analyzed and attack and normal traffic were compared by taking different parameters into consideration. A layered architecture model for DDoS, XSS, and SQL injection attack detection was developed using a dataset collected from the simulation environment. In the long short-term memory (LSTM)-based layered architecture, the first layer was the DDoS detection model designed with an accuracy of 97.57% and the second was the XSS and SQL injection layer with an obtained accuracy of 89.34%. The higher rate of HTTP traffic was investigated first and filtered out, and then passed to the second layer. The web application firewall (WAF) adds an extra layer of security to the web application by providing application-level filtering that cannot be achieved by the traditional network firewall system.

11.
J Autom Reason ; 66(4): 989-1030, 2022.
Article in English | MEDLINE | ID: mdl-36353685

ABSTRACT

SQL is the world's most popular declarative language, forming the basis of the multi-billion-dollar database industry. Although SQL has been standardized, the full standard is based on ambiguous natural language rather than formal specification. Commercial SQL implementations interpret the standard in different ways, so that, given the same input data, the same query can yield different results depending on the SQL system it is run on. Even for a particular system, mechanically checked formalization of all widely-used features of SQL remains an open problem. The lack of a well-understood formal semantics makes it very difficult to validate the soundness of database implementations. Although formal semantics for fragments of SQL were designed in the past, they usually did not support set and bag operations, lateral joins, nested subqueries, and, crucially, null values. Null values complicate SQL's semantics in profound ways analogous to null pointers or side-effects in other programming languages. Since certain SQL queries are equivalent in the absence of null values, but produce different results when applied to tables containing incomplete data, semantics which ignore null values are able to prove query equivalences that are unsound in realistic databases. A formal semantics of SQL supporting all the aforementioned features was only proposed recently. In this paper, we report about our mechanization of SQL semantics covering set/bag operations, lateral joins, nested subqueries, and nulls, written in the Coq proof assistant, and describe the validation of key metatheoretic properties. Additionally, we are able to use the same framework to formalize the semantics of a flat relational calculus (with null values), and show a certified translation of its normal forms into SQL.

12.
Anal Biochem ; 655: 114845, 2022 10 15.
Article in English | MEDLINE | ID: mdl-35970411

ABSTRACT

Fetal serum supports the immortal growth of mammalian cell lines in culture while adult serum leads to the terminal differentiation and death of cells in culture. Many of the proteins in fetal serum that support the indefinite division and growth of cancerous cell lines remain obscure. The peptides and proteins of fetal versus adult serum were analyzed by liquid chromatography, nano electrospray ionization and tandem mass spectrometry (LC-ESI-MS/MS). Three batches of fetal serum contained the Alpha Fetoprotein marker while adult serum batches did not. Insulin (INS), and insulin-like growth factor (ILGF), fibroblast growth factor (FGF), epidermal growth factor (EGF) and platelet derived growth factor (PDGF) were increased in fetal serum. New fetal growth factors including MEGF, HDGFRP and PSIP1 and soluble growth receptors such as TNFR, EGFR, NTRK2 and THRA were discovered. Addition of insulin or the homeotic transcription factor PSIP1, also referred to as Lens Epithelium Derived Growth Factor (LEDGF), partially restored the rounded phenotype of rapidly dividing cells but was not as effective as fetal serum. Thus, a new growth factor in fetal serum, LEDGF/PSIP1, was directly observed by tandem mass spectrometry and confirmed by add back experiments to cell culture media alongside insulin.


Subject(s)
Insulins , Tandem Mass Spectrometry , Animals , Epidermal Growth Factor/pharmacology , Intercellular Signaling Peptides and Proteins , Mammals/metabolism , Transcription Factors/genetics
13.
J Magn Reson ; 342: 107268, 2022 09.
Article in English | MEDLINE | ID: mdl-35930941

ABSTRACT

NMR is a valuable experimental tool in the structural biologist's toolkit to elucidate the structures, functions, and motions of biomolecules. The progress of machine learning, particularly in structural biology, reveals the critical importance of large, diverse, and reliable datasets in developing new methods and understanding in structural biology and science more broadly. Biomolecular NMR research groups produce large amounts of data, and there is renewed interest in organizing these data to train new, sophisticated machine learning architectures and to improve biomolecular NMR analysis pipelines. The foundational data type in NMR is the free-induction decay (FID). There are opportunities to build sophisticated machine learning methods to tackle long-standing problems in NMR data processing, resonance assignment, dynamics analysis, and structure determination using NMR FIDs. Our goal in this study is to provide a lightweight, broadly available tool for archiving FID data as it is generated at the spectrometer, and grow a new resource of FID data and associated metadata. This study presents a relational schema for storing and organizing the metadata items that describe an NMR sample and FID data, which we call Spectral Database (SpecDB). SpecDB is implemented in SQLite and includes a Python software library providing a command-line application to create, organize, query, backup, share, and maintain the database. This set of software tools and database schema allow users to store, organize, share, and learn from NMR time domain data. SpecDB is freely available under an open source license at https://github.rpi.edu/RPIBioinformatics/SpecDB.


Subject(s)
Software , Magnetic Resonance Spectroscopy/methods , Nuclear Magnetic Resonance, Biomolecular/methods
14.
Data Brief ; 42: 108211, 2022 Jun.
Article in English | MEDLINE | ID: mdl-35539028

ABSTRACT

Stakeholders of software development projects have various information needs for making rational decisions during their daily work. Satisfying these needs requires substantial knowledge of where and how the relevant information is stored and consumes valuable time that is often not available. Easing the need for this knowledge is an ideal text-to-SQL benchmark problem, a field where public datasets are scarce and needed. We propose the SEOSS-Queries dataset consisting of natural language utterances and accompanying SQL queries extracted from previous studies, software projects, issue tracking tools, and through expert surveys to cover a large variety of information need perspectives. Our dataset consists of 1,162 English utterances translating into 166 SQL queries; each query has four precise utterances and three more general ones. Furthermore, the dataset contains 393,086 labeled utterances extracted from issue tracker comments. We provide pre-trained SQLNet and RatSQL baseline models for benchmark comparisons, a replication package facilitating a seamless application, and discuss various other tasks that may be solved and evaluated using the dataset. The whole dataset with paraphrased natural language utterances and SQL queries is hosted at figshare.com/s/75ed49ef01ac2f83b3e2.

15.
J Diabetes Sci Technol ; 16(5): 1338-1339, 2022 09.
Article in English | MEDLINE | ID: mdl-35531917
16.
Cancer Inform ; 21: 11769351221097593, 2022.
Article in English | MEDLINE | ID: mdl-35586731

ABSTRACT

Advancements in the field of cancer research have enabled researchers and clinicians to access a massive amount of data to aid cancer patients and to add to the existing knowledge of research. However, despite the existence of reliable sources for extricating this data, it remains a challenge to accurately comprehend and draw conclusions based on the entirety of available information. Therefore, the current study aimed to design and develop a database for the identified variants of 5 different cancer types using 20 different cancer exomes. The exome data were retrieved from NCBI SRA and an NGS data clean-up protocol was implemented to obtain the best quality reads. The reads which passed the quality checks were then used for calling the variants which were then processed and filtered. This data was used to normalize and the normalized data generated was used for developing the database. MutaXome, which stands for mutations in cancer exome was designed in SQL, with the front end in bootstrap and HTML, and backend in PHP. The normalized data containing the variants inclusive of Single Nucleotide Polymorphisms (SNPs), were added into MutaXome, which contains detailed information regarding each type of identified variant. This database, available online via http://www.vidyalab.rf.gd/, serves as a knowledge base for cancer exome variations and holds much potential for enriching it by linking it to a decision support system as prospective studies.

17.
Front Artif Intell ; 5: 807320, 2022.
Article in English | MEDLINE | ID: mdl-35243337

ABSTRACT

Educational data mining research has demonstrated that the large volume of learning data collected by modern e-learning systems could be used to recognize student behavior patterns and group students into cohorts with similar behavior. However, few attempts have been done to connect and compare behavioral patterns with known dimensions of individual differences. To what extent learner behavior is defined by known individual differences? Which of them could be a better predictor of learner engagement and performance? Could we use behavior patterns to build a data-driven model of individual differences that could be more useful for predicting critical outcomes of the learning process than traditional models? Our paper attempts to answer these questions using a large volume of learner data collected in an online practice system. We apply a sequential pattern mining approach to build individual models of learner practice behavior and reveal latent student subgroups that exhibit considerably different practice behavior. Using these models we explored the connections between learner behavior and both, the incoming and outgoing parameters of the learning process. Among incoming parameters we examined traditionally collected individual differences such as self-esteem, gender, and knowledge monitoring skills. We also attempted to bridge the gap between cluster-based behavior pattern models and traditional scale-based models of individual differences by quantifying learner behavior on a latent data-driven scale. Our research shows that this data-driven model of individual differences performs significantly better than traditional models of individual differences in predicting important parameters of the learning process, such as performance and engagement.

18.
Front Oncol ; 12: 1033478, 2022.
Article in English | MEDLINE | ID: mdl-36873303

ABSTRACT

Purpose: To establish a hepatocellular carcinoma imaging database and structured imaging reports based on PACS, HIS, and repository. Methods: This study was approved by the Institutional Review Board. The steps of establishing the database are as follows: 1) According to the standards required for the intelligent diagnosis of HCC, it was attempted to design the corresponding functional modules after analyzing the requirements; 2) Based on client/server (C/S) mode, 3-tier architecture model was adopted. A user interface (UI) could receive data entered by users and show handled data. Business logic layer (BLL) could process the business logic of the data, and data access layer (DAL) could save the data in the database. The storage and management of HCC imaging data could be realized by the SQLSERVER database management software, and Delphi and VC++ programming languages were used. Results: The test results showed that the proposed database could swiftly obtain the pathological, clinical, and imaging data of HCC from the picture archiving and communication system (PACS) and hospital information system (HIS), and perform data storage and visualization of structured imaging reports. According to the HCC imaging data, liver imaging reporting and data system (LI-RADS) assessment, standardized staging, and intelligent imaging analysis were carried out on the high-risk population to establish a one-stop imaging evaluation platform for HCC, strongly supporting clinicians in the diagnosis and treatment of HCC. Conclusions: The establishment of a HCC imaging database can not only provide a huge amount of imaging data for the basic and clinical research on HCC, but also facilitate the scientific management and quantitative assessment of HCC. Besides, a HCC imaging database is advantageous for personalized treatment and follow-up of HCC patients.

19.
IFAC Pap OnLine ; 55(39): 437-442, 2022.
Article in English | MEDLINE | ID: mdl-38620881

ABSTRACT

Covid-19 pandemic has impacted every aspect of our society. One of the worst affected parts is the countries' health systems. Our goal is to provide a proof of concept for cost effective automated delivery system which can be used by hospitals for distributing medicine and food to patients in non-intensive wards, so medical personnel exposure to the virus can be minimized. Only free and open source software tools are used. Working proof of concept of the system is created consisted of: robot platform running ROS, SQL Server relational database, Web App. Limitations are identified. Testing is successful. We have showed that using free and open-source software and tools, it is possible to achieve the goal of creating the system.

20.
JMIR Med Inform ; 9(12): e32698, 2021 Dec 08.
Article in English | MEDLINE | ID: mdl-34889749

ABSTRACT

BACKGROUND: Electronic medical records (EMRs) are usually stored in relational databases that require SQL queries to retrieve information of interest. Effectively completing such queries can be a challenging task for medical experts due to the barriers in expertise. Existing text-to-SQL generation studies have not been fully embraced in the medical domain. OBJECTIVE: The objective of this study was to propose a neural generation model that can jointly consider the characteristics of medical text and the SQL structure to automatically transform medical texts to SQL queries for EMRs. METHODS: We proposed a medical text-to-SQL model (MedTS), which employed a pretrained Bidirectional Encoder Representations From Transformers model as the encoder and leveraged a grammar-based long short-term memory network as the decoder to predict the intermediate representation that can easily be transformed into the final SQL query. We adopted the syntax tree as the intermediate representation rather than directly regarding the SQL query as an ordinary word sequence, which is more in line with the tree-structure nature of SQL and can also effectively reduce the search space during generation. Experiments were conducted on the MIMICSQL dataset, and 5 competitor methods were compared. RESULTS: Experimental results demonstrated that MedTS achieved the accuracy of 0.784 and 0.899 on the test set in terms of logic form and execution, respectively, which significantly outperformed the existing state-of-the-art methods. Further analyses proved that the performance on each component of the generated SQL was relatively balanced and offered substantial improvements. CONCLUSIONS: The proposed MedTS was effective and robust for improving the performance of medical text-to-SQL generation, indicating strong potential to be applied in the real medical scenario.

SELECTION OF CITATIONS
SEARCH DETAIL
...