Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 63
Filter
1.
Data Brief ; 56: 110821, 2024 Oct.
Article in English | MEDLINE | ID: mdl-39252785

ABSTRACT

Fruits are mature ovaries of flowering plants that are integral to human diets, providing essential nutrients such as vitamins, minerals, fiber and antioxidants that are crucial for health and disease prevention. Accurate classification and segmentation of fruits are crucial in the agricultural sector for enhancing the efficiency of sorting and quality control processes, which significantly benefit automated systems by reducing labor costs and improving product consistency. This paper introduces the "FruitSeg30_Segmentation Dataset & Mask Annotations", a novel dataset designed to advance the capability of deep learning models in fruit segmentation and classification. Comprising 1969 high-quality images across 30 distinct fruit classes, this dataset provides diverse visuals essential for a robust model. Utilizing a U-Net architecture, the model trained on this dataset achieved training accuracy of 94.72 %, validation accuracy of 92.57 %, precision of 94 %, recall of 91 %, f1-score of 92.5 %, IoU score of 86 %, and maximum dice score of 0.9472, demonstrating superior performance in segmentation tasks. The FruitSeg30 dataset fills a critical gap and sets new standards in dataset quality and diversity, enhancing agricultural technology and food industry applications.

2.
Data Brief ; 55: 110760, 2024 Aug.
Article in English | MEDLINE | ID: mdl-39183968

ABSTRACT

The ever-evolving global landscape of communication, driven by Information Technology advancements, underscores the importance of emotion detection in natural language processing. However, challenges persist in interpreting emotions within linguistically diverse contexts, notably in low-resource languages like Bengali, compounded by the emergence of Banglish. To address this gap, we present "Bengali & Banglish," an extensive dataset comprising 80,098 labelled samples across six emotion classes. Our dataset fills a void in fine-grained emotion classification for Bengali and pioneers in emotion detection in Banglish. We achieve significant performance metrics through meticulous annotation and rigorous evaluation, including a weighted F1 score of 71.30% for Bengali and 64.59% for Banglish using BanglaBERT. Also, our dataset facilitates Bengali-to-Banglish Machine Translation, contributing to the advancement of language processing models. Furthermore, our dataset demonstrates a high Cohen's Kappa score of 93.5%, affirming the reliability and consistency of our annotations. This research underscores the importance of linguistic diversity in NLP and provides a valuable resource for enhancing Emotion Detection capabilities in Bengali and Banglish across digital platforms.

3.
BioData Min ; 17(1): 22, 2024 Jul 12.
Article in English | MEDLINE | ID: mdl-38997749

ABSTRACT

BACKGROUND: The use of machine learning in medical diagnosis and treatment has grown significantly in recent years with the development of computer-aided diagnosis systems, often based on annotated medical radiology images. However, the lack of large annotated image datasets remains a major obstacle, as the annotation process is time-consuming and costly. This study aims to overcome this challenge by proposing an automated method for annotating a large database of medical radiology images based on their semantic similarity. RESULTS: An automated, unsupervised approach is used to create a large annotated dataset of medical radiology images originating from the Clinical Hospital Centre Rijeka, Croatia. The pipeline is built by data-mining three different types of medical data: images, DICOM metadata and narrative diagnoses. The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation. CONCLUSIONS: The results indicate that fusing the embeddings of all three data sources together provides the best results for the task of unsupervised clustering of large-scale medical data and leads to the most concise clusters. Hence, this work marks the initial step towards building a much larger and more fine-grained annotated dataset of medical radiology images.

4.
Sensors (Basel) ; 24(14)2024 Jul 11.
Article in English | MEDLINE | ID: mdl-39065873

ABSTRACT

In the context of LiDAR sensor-based autonomous vehicles, segmentation networks play a crucial role in accurately identifying and classifying objects. However, discrepancies between the types of LiDAR sensors used for training the network and those deployed in real-world driving environments can lead to performance degradation due to differences in the input tensor attributes, such as x, y, and z coordinates, and intensity. To address this issue, we propose novel intensity rendering and data interpolation techniques. Our study evaluates the effectiveness of these methods by applying them to object tracking in real-world scenarios. The proposed solutions aim to harmonize the differences between sensor data, thereby enhancing the performance and reliability of deep learning networks for autonomous vehicle perception systems. Additionally, our algorithms prevent performance degradation, even when different types of sensors are used for the training data and real-world applications. This approach allows for the use of publicly available open datasets without the need to spend extensive time on dataset construction and annotation using the actual sensors deployed, thus significantly saving time and resources. When applying the proposed methods, we observed an approximate 20% improvement in mIoU performance compared to scenarios without these enhancements.

5.
Diagnostics (Basel) ; 14(11)2024 May 22.
Article in English | MEDLINE | ID: mdl-38893600

ABSTRACT

In order to generate a machine learning algorithm (MLA) that can support ophthalmologists with the diagnosis of glaucoma, a carefully selected dataset that is based on clinically confirmed glaucoma patients as well as borderline cases (e.g., patients with suspected glaucoma) is required. The clinical annotation of datasets is usually performed at the expense of the data volume, which results in poorer algorithm performance. This study aimed to evaluate the application of an MLA for the automated classification of physiological optic discs (PODs), glaucomatous optic discs (GODs), and glaucoma-suspected optic discs (GSODs). Annotation of the data to the three groups was based on the diagnosis made in clinical practice by a glaucoma specialist. Color fundus photographs and 14 types of metadata (including visual field testing, retinal nerve fiber layer thickness, and cup-disc ratio) of 1168 eyes from 584 patients (POD = 321, GOD = 336, GSOD = 310) were used for the study. Machine learning (ML) was performed in the first step with the color fundus photographs only and in the second step with the images and metadata. Sensitivity, specificity, and accuracy of the classification of GSOD vs. GOD and POD vs. GOD were evaluated. Classification of GOD vs. GSOD and GOD vs. POD performed in the first step had AUCs of 0.84 and 0.88, respectively. By combining the images and metadata, the AUCs increased to 0.92 and 0.99, respectively. By combining images and metadata, excellent performance of the MLA can be achieved despite having only a small amount of data, thus supporting ophthalmologists with glaucoma diagnosis.

6.
Patterns (N Y) ; 5(5): 100955, 2024 May 10.
Article in English | MEDLINE | ID: mdl-38800367

ABSTRACT

Materials scientists usually collect experimental data to summarize experiences and predict improved materials. However, a crucial issue is how to proficiently utilize unstructured data to update existing structured data, particularly in applied disciplines. This study introduces a new natural language processing (NLP) task called structured information inference (SII) to address this problem. We propose an end-to-end approach to summarize and organize the multi-layered device-level information from the literature into structured data. After comparing different methods, we fine-tuned LLaMA with an F1 score of 87.14% to update an existing perovskite solar cell dataset with articles published since its release, allowing its direct use in subsequent data analysis. Using structured information, we developed regression tasks to predict the electrical performance of solar cells. Our results demonstrate comparable performance to traditional machine-learning methods without feature selection and highlight the potential of large language models for scientific knowledge acquisition and material development.

7.
J Chromatogr A ; 1728: 465010, 2024 Aug 02.
Article in English | MEDLINE | ID: mdl-38821033

ABSTRACT

Fufang Yinhua Jiedu granules (FYJG) is a Traditional Chinese Medicine (TCM) compound formulae preparation comprising ten herbal drugs, which has been widely used for the treatment of influenza with wind-heat type and upper respiratory tract infections. However, the phytochemical constituents of FYJG have rarely been reported, and its constituent composition still needs to be elucidated. The complexity of the natural ingredients of TCMs and the diversity of preparations are the major obstacles to fully characterizing their constituents. In this study, an innovative and intelligent analysis strategy was built to comprehensively characterize the constituents of FYJG and assign source attribution to all components. Firstly, a simple and highly efficient ultra-high-performance liquid chromatography coupled to quadrupole time-of-flight mass spectrometry (UHPLC-QTOF MSE) method was established to analyze the FYJG and ten single herbs. High-accuracy MS/MS data were acquired under two collision energies using high-definition MSE in the negative and positive modes. Secondly, a multistage intelligent data annotation strategy was developed and used to rapidly screen out and identify the compounds of FYJG, which was integrated with various online software and data processing platforms. The in-house chemical library of 2949 compounds was created and operated in the UNIFI software to enable automatic peak annotation of the MSE data. Then, the acquired MS data were processed by MS-DIAL, and a feature-based molecular networking (FBMN) was constructed on the Global Natural Product Social Molecular Networking (GNPS) to infer potential compositions of FYJG by rapidly classifying and visualizing. It was simultaneously using the MZmine software to recognize the source attribution of ingredients. On this basis, the unique chemical categories and characteristics of herbaceous plant species are utilized further to verify the accuracy of the source attribution of multi-components. This comprehensive analysis successfully identified or tentatively characterized 279 compounds in FYJG, including flavonoids, phenolic acids, coumarins, saponins, alkaloids, lignans, and phenylethanoids. Notably, twelve indole alkaloids and four organic acids from Isatidis Folium were characterized in this formula for the first time. This study demonstrates a potential superiority to identify compounds in complex TCM formulas using high-definition MSE and computer software-assisted structural analysis tools, which can obtain high-quality MS/MS spectra, effectively distinguish isomers, and improve the coverage of trace components. This study elucidates the various components and sources of FYJG and provides a theoretical basis for its further clinical development and application.


Subject(s)
Drugs, Chinese Herbal , Tandem Mass Spectrometry , Drugs, Chinese Herbal/chemistry , Chromatography, High Pressure Liquid/methods , Tandem Mass Spectrometry/methods , Medicine, Chinese Traditional
8.
Diagnostics (Basel) ; 14(7)2024 Apr 02.
Article in English | MEDLINE | ID: mdl-38611668

ABSTRACT

The facet joint injection is the most common procedure used to release lower back pain. In this paper, we proposed a deep learning method for detecting and segmenting facet joints in ultrasound images based on convolutional neural networks (CNNs) and enhanced data annotation. In the enhanced data annotation, a facet joint was considered as the first target and the ventral complex as the second target to improve the capability of CNNs in recognizing the facet joint. A total of 300 cases of patients undergoing pain treatment were included. The ultrasound images were captured and labeled by two professional anesthesiologists, and then augmented to train a deep learning model based on the Mask Region-based CNN (Mask R-CNN). The performance of the deep learning model was evaluated using the average precision (AP) on the testing sets. The data augmentation and data annotation methods were found to improve the AP. The AP50 for facet joint detection and segmentation was 90.4% and 85.0%, respectively, demonstrating the satisfying performance of the deep learning model. We presented a deep learning method for facet joint detection and segmentation in ultrasound images based on enhanced data annotation and the Mask R-CNN. The feasibility and potential of deep learning techniques in facet joint ultrasound image analysis have been demonstrated.

9.
Insights Imaging ; 15(1): 47, 2024 Feb 16.
Article in English | MEDLINE | ID: mdl-38361108

ABSTRACT

OBJECTIVES: MAchine Learning In MyelomA Response (MALIMAR) is an observational clinical study combining "real-world" and clinical trial data, both retrospective and prospective. Images were acquired on three MRI scanners over a 10-year window at two institutions, leading to a need for extensive curation. METHODS: Curation involved image aggregation, pseudonymisation, allocation between project phases, data cleaning, upload to an XNAT repository visible from multiple sites, annotation, incorporation of machine learning research outputs and quality assurance using programmatic methods. RESULTS: A total of 796 whole-body MR imaging sessions from 462 subjects were curated. A major change in scan protocol part way through the retrospective window meant that approximately 30% of available imaging sessions had properties that differed significantly from the remainder of the data. Issues were found with a vendor-supplied clinical algorithm for "composing" whole-body images from multiple imaging stations. Historic weaknesses in a digital video disk (DVD) research archive (already addressed by the mid-2010s) were highlighted by incomplete datasets, some of which could not be completely recovered. The final dataset contained 736 imaging sessions for 432 subjects. Software was written to clean and harmonise data. Implications for the subsequent machine learning activity are considered. CONCLUSIONS: MALIMAR exemplifies the vital role that curation plays in machine learning studies that use real-world data. A research repository such as XNAT facilitates day-to-day management, ensures robustness and consistency and enhances the value of the final dataset. The types of process described here will be vital for future large-scale multi-institutional and multi-national imaging projects. CRITICAL RELEVANCE STATEMENT: This article showcases innovative data curation methods using a state-of-the-art image repository platform; such tools will be vital for managing the large multi-institutional datasets required to train and validate generalisable ML algorithms and future foundation models in medical imaging. KEY POINTS: • Heterogeneous data in the MALIMAR study required the development of novel curation strategies. • Correction of multiple problems affecting the real-world data was successful, but implications for machine learning are still being evaluated. • Modern image repositories have rich application programming interfaces enabling data enrichment and programmatic QA, making them much more than simple "image marts".

10.
Data Brief ; 51: 109708, 2023 Dec.
Article in English | MEDLINE | ID: mdl-38020431

ABSTRACT

This dataset features a collection of 3832 high-resolution ultrasound images, each with dimensions of 959×661 pixels, focused on Fetal heads. The images highlight specific anatomical regions: the brain, cavum septum pellucidum (CSP), and lateral ventricles (LV). The dataset was assembled under the Creative Commons Attribution 4.0 International license, using previously anonymized and de-identified images to maintain ethical standards. Each image is complemented by a CSV file detailing pixel size in millimeters (mm). For enhanced compatibility and usability, the dataset is available in 11 universally accepted formats, including Cityscapes, YOLO, CVAT, Datumaro, COCO, TFRecord, PASCAL, LabelMe, Segmentation mask, OpenImage, and ICDAR. This broad range of formats ensures adaptability for various computer vision tasks, such as classification, segmentation, and object detection. It is also compatible with multiple medical imaging software and deep learning frameworks. The reliability of the annotations is verified through a two-step validation process involving a Senior Attending Physician and a Radiologic Technologist. The Intraclass Correlation Coefficients (ICC) and Jaccard similarity indices (JS) are utilized to quantify inter-rater agreement. The dataset exhibits high annotation reliability, with ICC values averaging at 0.859 and 0.889, and JS values at 0.855 and 0.857 in two iterative rounds of annotation. This dataset is designed to be an invaluable resource for ongoing and future research projects in medical imaging and computer vision. It is particularly suited for applications in prenatal diagnostics, clinical diagnosis, and computer-assisted interventions. Its detailed annotations, broad compatibility, and ethical compliance make it a highly reusable and adaptable tool for the development of algorithms aimed at improving maternal and Fetal health.

11.
Int J Med Inform ; 178: 105200, 2023 10.
Article in English | MEDLINE | ID: mdl-37703800

ABSTRACT

INTRODUCTION: Hospitals generate large amounts of data and this data is generally modeled and labeled in a proprietary way, hampering its exchange and integration. Manually annotating data element names to internationally standardized data element identifiers is a time-consuming effort. Tools can support performing this task automatically. This study aimed to determine what factors influence the quality of automatic annotations. METHODS: Data element names were used from the Dutch COVID-19 ICU Data Warehouse containing data on intensive care patients with COVID-19 from 25 hospitals in the Netherlands. In this data warehouse, the data had been merged using a proprietary terminology system while also storing the original hospital labels (synonymous names). Usagi, an OHDSI annotation tool, was used to perform the annotation for the data. A gold standard was used to determine if Usagi made correct annotations. Logistic regression was used to determine if the number of characters, number of words, match score (Usagi's certainty) and hospital label origin influenced Usagi's performance to annotate correctly. RESULTS: Usagi automatically annotated 30.5% of the data element names correctly and 5.5% of the synonymous names. The match score is the best predictor for Usagi finding the correct annotation. It was determined that the AUC of data element names was 0.651 and 0.752 for the synonymous names respectively. The AUC for the individual hospital label origins varied between 0.460 to 0.905. DISCUSSION: The results show that Usagi performed better to annotate the data element names than the synonymous names. The hospital origin in the synonymous names dataset was associated with the amount of correctly annotated concepts. Hospitals that performed better had shorter synonymous names and fewer words. Using shorter data element names or synonymous names should be considered to optimize the automatic annotating process. Overall, the performance of Usagi is too poor to completely rely on for automatic annotation.


Subject(s)
COVID-19 , Humans , COVID-19/epidemiology , Netherlands
12.
J Cell Sci ; 136(17)2023 09 01.
Article in English | MEDLINE | ID: mdl-37555624

ABSTRACT

The extracellular matrix (ECM) is a complex meshwork of proteins that forms the scaffold of all tissues in multicellular organisms. It plays crucial roles in all aspects of life - from orchestrating cell migration during development, to supporting tissue repair. It also plays critical roles in the etiology or progression of diseases. To study this compartment, we have previously defined the compendium of all genes encoding ECM and ECM-associated proteins for multiple organisms. We termed this compendium the 'matrisome' and further classified matrisome components into different structural or functional categories. This nomenclature is now largely adopted by the research community to annotate '-omics' datasets and has contributed to advance both fundamental and translational ECM research. Here, we report the development of Matrisome AnalyzeR, a suite of tools including a web-based application and an R package. The web application can be used by anyone interested in annotating, classifying and tabulating matrisome molecules in large datasets without requiring programming knowledge. The companion R package is available to more experienced users, interested in processing larger datasets or in additional data visualization options.


Subject(s)
Extracellular Matrix Proteins , Extracellular Matrix , Extracellular Matrix/metabolism , Extracellular Matrix Proteins/genetics , Extracellular Matrix Proteins/metabolism , Cell Movement
13.
J Am Med Inform Assoc ; 30(12): 2036-2040, 2023 11 17.
Article in English | MEDLINE | ID: mdl-37555837

ABSTRACT

Despite recent methodology advancements in clinical natural language processing (NLP), the adoption of clinical NLP models within the translational research community remains hindered by process heterogeneity and human factor variations. Concurrently, these factors also dramatically increase the difficulty in developing NLP models in multi-site settings, which is necessary for algorithm robustness and generalizability. Here, we reported on our experience developing an NLP solution for Coronavirus Disease 2019 (COVID-19) signs and symptom extraction in an open NLP framework from a subset of sites participating in the National COVID Cohort (N3C). We then empirically highlight the benefits of multi-site data for both symbolic and statistical methods, as well as highlight the need for federated annotation and evaluation to resolve several pitfalls encountered in the course of these efforts.


Subject(s)
COVID-19 , Natural Language Processing , Humans , Electronic Health Records , Algorithms
14.
Heliyon ; 9(6): e17104, 2023 Jun.
Article in English | MEDLINE | ID: mdl-37484314

ABSTRACT

BACKGROUND: Deep learning is an important means to realize the automatic detection, segmentation, and classification of pulmonary nodules in computed tomography (CT) images. An entire CT scan cannot directly be used by deep learning models due to image size, image format, image dimensionality, and other factors. Between the acquisition of the CT scan and feeding the data into the deep learning model, there are several steps including data use permission, data access and download, data annotation, and data preprocessing. This paper aims to recommend a complete and detailed guide for researchers who want to engage in interdisciplinary lung nodule research of CT images and Artificial Intelligence (AI) engineering. METHODS: The data preparation pipeline used the following four popular large-scale datasets: LIDC-IDRI (Lung Image Database Consortium image collection), LUNA16 (Lung Nodule Analysis 2016), NLST (National Lung Screening Trial) and NELSON (The Dutch-Belgian Randomized Lung Cancer Screening Trial). The dataset preparation is presented in chronological order. FINDINGS: The different data preparation steps before deep learning were identified. These include both more generic steps and steps dedicated to lung nodule research. For each of these steps, the required process, necessity, and example code or tools for actual implementation are provided. DISCUSSION AND CONCLUSION: Depending on the specific research question, researchers should be aware of the various preparation steps required and carefully select datasets, data annotation methods, and image preprocessing methods. Moreover, it is vital to acknowledge that each auxiliary tool or code has its specific scope of use and limitations. This paper proposes a standardized data preparation process while clearly demonstrating the principles and sequence of different steps. A data preparation pipeline can be quickly realized by following these proposed steps and implementing the suggested example codes and tools.

15.
Diagnostics (Basel) ; 13(8)2023 Apr 14.
Article in English | MEDLINE | ID: mdl-37189517

ABSTRACT

Identifying Human Epithelial Type 2 (HEp-2) mitotic cells is a crucial procedure in anti-nuclear antibodies (ANAs) testing, which is the standard protocol for detecting connective tissue diseases (CTD). Due to the low throughput and labor-subjectivity of the ANAs' manual screening test, there is a need to develop a reliable HEp-2 computer-aided diagnosis (CAD) system. The automatic detection of mitotic cells from the microscopic HEp-2 specimen images is an essential step to support the diagnosis process and enhance the throughput of this test. This work proposes a deep active learning (DAL) approach to overcoming the cell labeling challenge. Moreover, deep learning detectors are tailored to automatically identify the mitotic cells directly in the entire microscopic HEp-2 specimen images, avoiding the segmentation step. The proposed framework is validated using the I3A Task-2 dataset over 5-fold cross-validation trials. Using the YOLO predictor, promising mitotic cell prediction results are achieved with an average of 90.011% recall, 88.307% precision, and 81.531% mAP. Whereas, average scores of 86.986% recall, 85.282% precision, and 78.506% mAP are obtained using the Faster R-CNN predictor. Employing the DAL method over four labeling rounds effectively enhances the accuracy of the data annotation, and hence, improves the prediction performance. The proposed framework could be practically applicable to support medical personnel in making rapid and accurate decisions about the mitotic cells' existence.

16.
Brief Bioinform ; 24(3)2023 05 19.
Article in English | MEDLINE | ID: mdl-37183449

ABSTRACT

Undoubtedly, single-cell RNA sequencing (scRNA-seq) has changed the research landscape by providing insights into heterogeneous, complex and rare cell populations. Given that more such data sets will become available in the near future, their accurate assessment with compatible and robust models for cell type annotation is a prerequisite. Considering this, herein, we developed scAnno (scRNA-seq data annotation), an automated annotation tool for scRNA-seq data sets primarily based on the single-cell cluster levels, using a joint deconvolution strategy and logistic regression. We explicitly constructed a reference profile for human (30 cell types and 50 human tissues) and a reference profile for mouse (26 cell types and 50 mouse tissues) to support this novel methodology (scAnno). scAnno offers a possibility to obtain genes with high expression and specificity in a given cell type as cell type-specific genes (marker genes) by combining co-expression genes with seed genes as a core. Of importance, scAnno can accurately identify cell type-specific genes based on cell type reference expression profiles without any prior information. Particularly, in the peripheral blood mononuclear cell data set, the marker genes identified by scAnno showed cell type-specific expression, and the majority of marker genes matched exactly with those included in the CellMarker database. Besides validating the flexibility and interpretability of scAnno in identifying marker genes, we also proved its superiority in cell type annotation over other cell type annotation tools (SingleR, scPred, CHETAH and scmap-cluster) through internal validation of data sets (average annotation accuracy: 99.05%) and cross-platform data sets (average annotation accuracy: 95.56%). Taken together, we established the first novel methodology that utilizes a deconvolution strategy for automated cell typing and is capable of being a significant application in broader scRNA-seq analysis. scAnno is available at https://github.com/liuhong-jia/scAnno.


Subject(s)
Algorithms , Software , Animals , Mice , Humans , Gene Expression Profiling/methods , Leukocytes, Mononuclear , Single-Cell Analysis/methods , RNA/genetics , Sequence Analysis, RNA/methods
17.
Cancers (Basel) ; 15(5)2023 Mar 04.
Article in English | MEDLINE | ID: mdl-36900387

ABSTRACT

Objective: To summarize the available literature on using machine learning (ML) for palliative care practice as well as research and to assess the adherence of the published studies to the most important ML best practices. Methods: The MEDLINE database was searched for the use of ML in palliative care practice or research, and the records were screened according to PRISMA guidelines. Results: In total, 22 publications using machine learning for mortality prediction (n = 15), data annotation (n = 5), predicting morbidity under palliative therapy (n = 1), and predicting response to palliative therapy (n = 1) were included. Publications used a variety of supervised or unsupervised models, but mostly tree-based classifiers and neural networks. Two publications had code uploaded to a public repository, and one publication uploaded the dataset. Conclusions: Machine learning in palliative care is mainly used to predict mortality. Similarly to other applications of ML, external test sets and prospective validations are the exception.

18.
Neural Netw ; 161: 746-756, 2023 Apr.
Article in English | MEDLINE | ID: mdl-36857880

ABSTRACT

Reviews of songs play an important role in online music service platforms. Prior research shows that users can make quicker and more informed decisions when presented with meaningful song reviews. However, reviews of songs are generally long in length and most of them are non-informative for users. It is difficult for users to efficiently grasp meaningful messages for making decisions. To solve this problem, one practical strategy is to provide tips, i.e., short, concise, empathetic, and self-contained descriptions about songs. Tips are produced from song reviews and should express non-trivial insights about the songs. To the best of our knowledge, no prior studies have explored the tip generation task in music domain. In this paper, we create a dataset named MTips for the task and propose a learning-to-generate framework named GenTMS for automatically generating tips from song reviews. The dataset involves 8,003 Chinese tips/non-tips from 128 songs which are distributed in five different song genres. Experimental results show that GenTMS achieves top-10 precision at 85.56%, outperforming the baseline models by at least 3.34%. Besides, to simulate the practical usage of our proposed framework, we also experiment with previously-unseen songs, during which GenTMS also achieves the best performance with top-10 precision at 78.89% on average. The results demonstrate the effectiveness of the proposed framework in tip generation of the music domain.


Subject(s)
Learning , Music , Problem Solving , Decision Making
19.
Front Neuroinform ; 17: 1301718, 2023.
Article in English | MEDLINE | ID: mdl-38348138

ABSTRACT

The study presents a novel approach designed to detect time-continuous states in time-series data, called the State-Detecting Algorithm (SDA). The SDA operates on unlabeled data and detects optimal change-points among intrinsic functional states in time-series data based on an ensemble of Ward's hierarchical clustering with time-connectivity constraint. The algorithm chooses the best number of states and optimal state boundaries, maximizing clustering quality metrics. We also introduce a series of methods to estimate the performance and confidence of the SDA when the ground truth annotation is unavailable. These include information value analysis, paired statistical tests, and predictive modeling analysis. The SDA was validated on EEG recordings of Guhyasamaja meditation practice with a strict staged protocol performed by three experienced Buddhist practitioners in an ecological setup. The SDA used neurophysiological descriptors as inputs, including PSD, power indices, coherence, and PLV. Post-hoc analysis of the obtained EEG states revealed significant differences compared to the baseline and neighboring states. The SDA was found to be stable with respect to state order organization and showed poor clustering quality metrics and no statistical significance between states when applied to randomly shuffled epochs (i.e., surrogate subject data used as controls). The SDA can be considered a general data-driven approach that detects hidden functional states associated with the mental processes evolving during meditation or other ongoing mental and cognitive processes.

20.
PeerJ Comput Sci ; 8: e1151, 2022.
Article in English | MEDLINE | ID: mdl-36532803

ABSTRACT

Since the inception of the current COVID-19 pandemic, related misleading information has spread at a remarkable rate on social media, leading to serious implications for individuals and societies. Although COVID-19 looks to be ending for most places after the sharp shock of Omicron, severe new variants can emerge and cause new waves, especially if the variants can evade the insufficient immunity provided by prior infection and incomplete vaccination. Fighting the fake news that promotes vaccine hesitancy, for instance, is crucial for the success of the global vaccination programs and thus achieving herd immunity. To combat the proliferation of COVID-19-related misinformation, considerable research efforts have been and are still being dedicated to building and sharing COVID-19 misinformation detection datasets and models for Arabic and other languages. However, most of these datasets provide binary (true/false) misinformation classifications. Besides, the few studies that support multi-class misinformation classification deal with a small set of misinformation classes or mix them with situational information classes. False news stories about COVID-19 are not equal; some tend to have more sinister effects than others (e.g., fake cures and false vaccine info). This suggests that identifying the sub-type of misinformation is critical for choosing the suitable action based on their level of seriousness, ranging from assigning warning labels to the susceptible post to removing the misleading post instantly. We develop comprehensive annotation guidelines in this work that define 19 fine-grained misinformation classes. Then, we release the first Arabic COVID-19-related misinformation dataset comprising about 6.7K tweets with multi-class and multi-label misinformation annotations. In addition, we release a version of the dataset to be the first Twitter Arabic dataset annotated exclusively with six different situational information classes. Identifying situational information (e.g., caution, help-seeking) helps authorities or individuals understand the situation during emergencies. To confirm the validity of the collected data, we define three classification tasks and experiment with various machine learning and transformer-based classifiers to offer baseline results for future research. The experimental results indicate the quality and validity of the data and its suitability for constructing misinformation and situational information classification models. The results also demonstrate the superiority of AraBERT-COV19, a transformer-based model pretrained on COVID-19-related tweets, with micro-averaged F-scores of 81.6% and 78.8% for the multi-class misinformation and situational information classification tasks, respectively. Label Powerset with linear SVC achieved the best performance among the presented methods for multi-label misinformation classification with micro-averaged F-scores of 76.69%.

SELECTION OF CITATIONS
SEARCH DETAIL