Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 43
Filter
1.
Article in English | MEDLINE | ID: mdl-38781059

ABSTRACT

This paper proposes a novel transformer-based framework to generate accurate class-specific object localization maps for weakly supervised semantic segmentation (WSSS). Leveraging the insight that the attended regions of the one-class token in the standard vision transformer can generate class-agnostic localization maps, we investigate the transformer's capacity to capture class-specific attention for class-discriminative object localization by learning multiple class tokens. We present the Multi-Class Token transformer, which incorporates multiple class tokens to enable class-aware interactions with patch tokens. This is facilitated by a class-aware training strategy that establishes a one-to-one correspondence between output class tokens and ground-truth class labels. We also introduce a Contrastive-Class-Token (CCT) module to enhance the learning of discriminative class tokens, enabling the model to better capture the unique characteristics of each class. Consequently, the proposed framework effectively generates class-discriminative object localization maps from the class-to-patch attentions associated with different class tokens. To refine these localization maps, we propose the utilization of patch-level pairwise affinity derived from the patch-to-patch transformer attention. Furthermore, the proposed framework seamlessly complements the Class Activation Mapping (CAM) method, yielding significant improvements in WSSS performance on PASCAL VOC 2012 and MS COCO 2014. These results underline the importance of the class token for WSSS. The codes and models are publicly available here.

2.
Artif Intell Med ; 152: 102872, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38701636

ABSTRACT

Accurately measuring the evolution of Multiple Sclerosis (MS) with magnetic resonance imaging (MRI) critically informs understanding of disease progression and helps to direct therapeutic strategy. Deep learning models have shown promise for automatically segmenting MS lesions, but the scarcity of accurately annotated data hinders progress in this area. Obtaining sufficient data from a single clinical site is challenging and does not address the heterogeneous need for model robustness. Conversely, the collection of data from multiple sites introduces data privacy concerns and potential label noise due to varying annotation standards. To address this dilemma, we explore the use of the federated learning framework while considering label noise. Our approach enables collaboration among multiple clinical sites without compromising data privacy under a federated learning paradigm that incorporates a noise-robust training strategy based on label correction. Specifically, we introduce a Decoupled Hard Label Correction (DHLC) strategy that considers the imbalanced distribution and fuzzy boundaries of MS lesions, enabling the correction of false annotations based on prediction confidence. We also introduce a Centrally Enhanced Label Correction (CELC) strategy, which leverages the aggregated central model as a correction teacher for all sites, enhancing the reliability of the correction process. Extensive experiments conducted on two multi-site datasets demonstrate the effectiveness and robustness of our proposed methods, indicating their potential for clinical applications in multi-site collaborations to train better deep learning models with lower cost in data collection and annotation.


Subject(s)
Deep Learning , Magnetic Resonance Imaging , Multiple Sclerosis , Multiple Sclerosis/diagnostic imaging , Humans , Magnetic Resonance Imaging/methods , Image Interpretation, Computer-Assisted/methods , Image Processing, Computer-Assisted/methods
3.
iScience ; 27(4): 109550, 2024 Apr 19.
Article in English | MEDLINE | ID: mdl-38595796

ABSTRACT

During the evolution of large models, performance evaluation is necessary for assessing their capabilities. However, current model evaluations mainly rely on specific tasks and datasets, lacking a united framework for assessing the multidimensional intelligence of large models. In this perspective, we advocate for a comprehensive framework of cognitive science-inspired artificial general intelligence (AGI) tests, including crystallized, fluid, social, and embodied intelligence. The AGI tests consist of well-designed cognitive tests adopted from human intelligence tests, and then naturally encapsulates into an immersive virtual community. We propose increasing the complexity of AGI testing tasks commensurate with advancements in large models and emphasizing the necessity for the interpretation of test results to avoid false negatives and false positives. We believe that cognitive science-inspired AGI tests will effectively guide the targeted improvement of large models in specific dimensions of intelligence and accelerate the integration of large models into human society.

4.
Article in English | MEDLINE | ID: mdl-38478447

ABSTRACT

Most existing weakly supervised semantic segmentation (WSSS) methods rely on class activation mapping (CAM) to extract coarse class-specific localization maps using image-level labels. Prior works have commonly used an off-line heuristic thresholding process that combines the CAM maps with off-the-shelf saliency maps produced by a general pretrained saliency model to produce more accurate pseudo-segmentation labels. We propose AuxSegNet + , a weakly supervised auxiliary learning framework to explore the rich information from these saliency maps and the significant intertask correlation between saliency detection and semantic segmentation. In the proposed AuxSegNet + , saliency detection and multilabel image classification are used as auxiliary tasks to improve the primary task of semantic segmentation with only image-level ground-truth labels. We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps. In particular, we propose a cross-task dual-affinity learning module to learn both pairwise and unary affinities, which are used to enhance the task-specific features and predictions by aggregating both query-dependent and query-independent global context for both saliency detection and semantic segmentation. The learned cross-task pairwise affinity can also be used to refine and propagate CAM maps to provide better pseudo labels for both tasks. Iterative improvement of segmentation performance is enabled by cross-task affinity learning and pseudo-label updating. Extensive experiments demonstrate the effectiveness of the proposed approach with new state-of-the-art WSSS results on the challenging PASCAL VOC and MS COCO benchmarks.

6.
IEEE Trans Pattern Anal Mach Intell ; 46(6): 4366-4380, 2024 Jun.
Article in English | MEDLINE | ID: mdl-38236683

ABSTRACT

Fine-grained image retrieval mainly focuses on learning salient features from the seen subcategories as discriminative embedding while neglecting the problems behind zero-shot settings. We argue that retrieving fine-grained objects from unseen subcategories may rely on more diverse clues, which are easily restrained by the salient features learnt from seen subcategories. To address this issue, we propose a novel Content-aware Rectified Activation model, which enables this model to suppress the activation on salient regions while preserving their discrimination, and spread activation to adjacent non-salient regions, thus mining more diverse discriminative features for retrieving unseen subcategories. Specifically, we construct a content-aware rectified prototype (CARP) by perceiving semantics of salient regions. CARP acts as a channel-wise non-destructive activation upper bound and can be selectively used to suppress salient regions for obtaining the rectified features. Moreover, two regularizations are proposed: 1) a semantic coherency constraint that imposes a restriction on semantic coherency of CARP and salient regions, aiming at propagating the discriminative ability of salient regions to CARP, 2) a feature-navigated constraint to further guide the model to adaptively balance the discrimination power of rectified features and the suppression power of salient features. Experimental results on fine-grained and product retrieval benchmarks demonstrate that our method consistently outperforms the state-of-the-art methods.

7.
IEEE Trans Pattern Anal Mach Intell ; 46(5): 3537-3556, 2024 May.
Article in English | MEDLINE | ID: mdl-38145536

ABSTRACT

3D object detection from images, one of the fundamental and challenging problems in autonomous driving, has received increasing attention from both industry and academia in recent years. Benefiting from the rapid development of deep learning technologies, image-based 3D detection has achieved remarkable progress. Particularly, more than 200 works have studied this problem from 2015 to 2021, encompassing a broad spectrum of theories, algorithms, and applications. However, to date no recent survey exists to collect and organize this knowledge. In this paper, we fill this gap in the literature and provide the first comprehensive survey of this novel and continuously growing research field, summarizing the most commonly used pipelines for image-based 3D detection and deeply analyzing each of their components. Additionally, we also propose two new taxonomies to organize the state-of-the-art methods into different categories, with the intent of providing a more systematic review of existing methods and facilitating fair comparisons with future works. In retrospect of what has been achieved so far, we also analyze the current challenges in the field and discuss future directions for image-based 3D detection research.

8.
Microbiol Spectr ; : e0226923, 2023 Sep 12.
Article in English | MEDLINE | ID: mdl-37698427

ABSTRACT

As an RNA virus, severe acute respiratory coronavirus 2 (SARS-CoV-2) is known for frequent substitution mutations, and substitutions in important genome regions are often associated with viral fitness. However, whether indel mutations are related to viral fitness is generally ignored. Here we developed a computational methodology to investigate indels linked to fitness occurring in over 9 million SARS-CoV-2 genomes. Remarkably, by analyzing 31,642,404 deletion records and 1,981,308 insertion records, our pipeline identified 26,765 deletion types and 21,054 insertion types and discovered 65 indel types with a significant association with Pango lineages. We proposed the concept of featured indels representing the population of specific Pango lineages and variants as substitution mutations and termed these 65 indels as featured indels. The selective pressure of all indel types is assessed using the Bayesian model to explore the importance of indels. Our results exhibited higher selective pressure of indels like substitution mutations, which are important for assessing viral fitness and consistent with previous studies in vitro. Evaluation of the growth rate of each viral lineage indicated that indels play key roles in SARS-CoV-2 evolution and deserve more attention as substitution mutations. IMPORTANCE The fitness of indels in pathogen genome evolution has rarely been studied. We developed a computational methodology to investigate the severe acute respiratory coronavirus 2 genomes and analyze over 33 million records of indels systematically, ultimately proposing the concept of featured indels that can represent specific Pango lineages and identifying 65 featured indels. Machine learning model based on Bayesian inference and viral lineage growth rate evaluation suggests that these featured indels exhibit selection pressure comparable to replacement mutations. In conclusion, indels are not negligible for evaluating viral fitness.

9.
IEEE Trans Pattern Anal Mach Intell ; 45(12): 15996-16012, 2023 Dec.
Article in English | MEDLINE | ID: mdl-37531304

ABSTRACT

Semantic segmentation has achieved huge progress via adopting deep Fully Convolutional Networks (FCN). However, the performance of FCN-based models severely rely on the amounts of pixel-level annotations which are expensive and time-consuming. Considering that bounding boxes also contain abundant semantic and objective information, an intuitive solution is to learn the segmentation with weak supervisions from the bounding boxes. How to make full use of the class-level and region-level supervisions from bounding boxes to estimate the uncertain regions is the critical challenge for the weakly supervised learning task. In this paper, we propose a mixture model to address this problem. First, we introduce a box-driven class-wise masking model (BCM) to remove irrelevant regions of each class. Moreover, based on the pixel-level segment proposal generated from the bounding box supervision, we calculate the mean filling rates of each class to serve as an important prior cue to guide the model ignoring the wrongly labeled pixels in proposals. To realize the more fine-grained supervision at instance-level, we further propose the anchor-based filling rate shifting module. Unlike previous methods that directly train models with the generated noisy proposals, our method can adjust the model learning dynamically with the adaptive segmentation loss. Thus it can help reduce the negative impacts from wrongly labeled proposals. Besides, based on the learned high-quality proposals with above pipeline, we explore to further boost the performance through two-stage learning. The proposed method is evaluated on the challenging PASCAL VOC 2012 benchmark and achieves 74.9 % and 76.4 % mean IoU accuracy under weakly and semi-supervised modes, respectively. Extensive experimental results show that the proposed method is effective and is on par with, or even better than current state-of-the-art methods.

10.
J Chem Inf Model ; 63(19): 5971-5980, 2023 10 09.
Article in English | MEDLINE | ID: mdl-37589216

ABSTRACT

Many material properties are manifested in the morphological appearance and characterized using microscopic images, such as scanning electron microscopy (SEM). Polymer miscibility is a key physical quantity of polymer materials and is commonly and intuitively judged using SEM images. However, human observation and judgment of the images is time-consuming, labor-intensive, and hard to be quantified. Computer image recognition with machine learning methods can make up for the defects of artificial judging, giving accurate and quantitative judgment. We achieve automatic miscibility recognition utilizing a convolutional neural network and transfer learning methods, and the model obtains up to 94% accuracy. We also put forward a quantitative criterion for polymer miscibility with this model. The proposed method can be widely applied to the quantitative characterization of the microstructure and properties of various materials.


Subject(s)
Neural Networks, Computer , Polymers , Humans , Machine Learning
11.
Front Neurosci ; 17: 1196087, 2023.
Article in English | MEDLINE | ID: mdl-37483345

ABSTRACT

Introduction: Brain atrophy is a critical biomarker of disease progression and treatment response in neurodegenerative diseases such as multiple sclerosis (MS). Confounding factors such as inconsistent imaging acquisitions hamper the accurate measurement of brain atrophy in the clinic. This study aims to develop and validate a robust deep learning model to overcome these challenges; and to evaluate its impact on the measurement of disease progression. Methods: Voxel-wise pseudo-atrophy labels were generated using SIENA, a widely adopted tool for the measurement of brain atrophy in MS. Deformation maps were produced for 195 pairs of longitudinal 3D T1 scans from patients with MS. A 3D U-Net, namely DeepBVC, was specifically developed overcome common variances in resolution, signal-to-noise ratio and contrast ratio between baseline and follow up scans. The performance of DeepBVC was compared against SIENA using McLaren test-retest dataset and 233 in-house MS subjects with MRI from multiple time points. Clinical evaluation included disability assessment with the Expanded Disability Status Scale (EDSS) and traditional imaging metrics such as lesion burden. Results: For 3 subjects in test-retest experiments, the median percent brain volume change (PBVC) for DeepBVC and SIENA was 0.105 vs. 0.198% (subject 1), 0.061 vs. 0.084% (subject 2), 0.104 vs. 0.408% (subject 3). For testing consistency across multiple time points in individual MS subjects, the mean (± standard deviation) PBVC difference of DeepBVC and SIENA were 0.028% (± 0.145%) and 0.031% (±0.154%), respectively. The linear correlation with baseline T2 lesion volume were r = -0.288 (p < 0.05) and r = -0.249 (p < 0.05) for DeepBVC and SIENA, respectively. There was no significant correlation of disability progression with PBVC as estimated by either method (p = 0.86, p = 0.84). Discussion: DeepBVC is a deep learning powered brain volume change estimation method for assessing brain atrophy used T1-weighted images. Compared to SIENA, DeepBVC demonstrates superior performance in reproducibility and in the context of common clinical scan variances such as imaging contrast, voxel resolution, random bias field, and signal-to-noise ratio. Enhanced measurement robustness, automation, and processing speed of DeepBVC indicate its potential for utilisation in both research and clinical environments for monitoring disease progression and, potentially, evaluating treatment effectiveness.

12.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 13636-13652, 2023 Nov.
Article in English | MEDLINE | ID: mdl-37467085

ABSTRACT

In this work, we explore neat yet effective Transformer-based frameworks for visual grounding. The previous methods generally address the core problem of visual grounding, i.e., multi-modal fusion and reasoning, with manually-designed mechanisms. Such heuristic designs are not only complicated but also make models easily overfit specific data distributions. To avoid this, we first propose TransVG, which establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates. We empirically show that complicated fusion modules can be replaced by a simple stack of Transformer encoder layers with higher performance. However, the core fusion Transformer in TransVG is stand-alone against uni-modal encoders, and thus should be trained from scratch on limited visual grounding data, which makes it hard to be optimized and leads to sub-optimal performance. To this end, we further introduce TransVG++ to make two-fold improvements. For one thing, we upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding. For another, we devise Language Conditioned Vision Transformer that removes external fusion modules and reuses the uni-modal ViT for vision-language fusion at the intermediate layers. We conduct extensive experiments on five prevalent datasets, and report a series of state-of-the-art records.

13.
IEEE Trans Pattern Anal Mach Intell ; 45(11): 13876-13892, 2023 Nov.
Article in English | MEDLINE | ID: mdl-37486845

ABSTRACT

Long-tail distribution is widely spread in real-world applications. Due to the extremely small ratio of instances, tail categories often show inferior accuracy. In this paper, we find such performance bottleneck is mainly caused by the imbalanced gradients, which can be categorized into two parts: (1) positive part, deriving from the samples of the same category, and (2) negative part, contributed by other categories. Based on comprehensive experiments, it is also observed that the gradient ratio of accumulated positives to negatives is a good indicator to measure how balanced a category is trained. Inspired by this, we come up with a gradient-driven training mechanism to tackle the long-tail problem: re-balancing the positive/negative gradients dynamically according to current accumulative gradients, with a unified goal of achieving balance gradient ratios. Taking advantage of the simple and flexible gradient mechanism, we introduce a new family of gradient-driven loss functions, namely equalization losses. We conduct extensive experiments on a wide spectrum of visual tasks, including two-stage/single-stage long-tailed object detection (LVIS), long-tailed image classification (ImageNet-LT, Places-LT, iNaturalist), and long-tailed semantic segmentation (ADE20 K). Our method consistently outperforms the baseline models, demonstrating the effectiveness and generalization ability of the proposed equalization losses.

14.
Front Neurosci ; 17: 1167612, 2023.
Article in English | MEDLINE | ID: mdl-37274196

ABSTRACT

Background and introduction: Federated learning (FL) has been widely employed for medical image analysis to facilitate multi-client collaborative learning without sharing raw data. Despite great success, FL's applications remain suboptimal in neuroimage analysis tasks such as lesion segmentation in multiple sclerosis (MS), due to variance in lesion characteristics imparted by different scanners and acquisition parameters. Methods: In this work, we propose the first FL MS lesion segmentation framework via two effective re-weighting mechanisms. Specifically, a learnable weight is assigned to each local node during the aggregation process, based on its segmentation performance. In addition, the segmentation loss function in each client is also re-weighted according to the lesion volume for the data during training. Results: The proposed method has been validated on two FL MS segmentation scenarios using public and clinical datasets. Specifically, the case-wise and voxel-wise Dice score of the proposed method under the first public dataset is 65.20 and 74.30, respectively. On the second in-house dataset, the case-wise and voxel-wise Dice score is 53.66, and 62.31, respectively. Discussions and conclusions: The Comparison experiments on two FL MS segmentation scenarios using public and clinical datasets have demonstrated the effectiveness of the proposed method by significantly outperforming other FL methods. Furthermore, the segmentation performance of FL incorporating our proposed aggregation mechanism can achieve comparable performance to that from centralized training with all the raw data.

15.
Article in English | MEDLINE | ID: mdl-37220047

ABSTRACT

Observing that the existing model compression approaches only focus on reducing the redundancies in convolutional neural networks (CNNs) along one particular dimension (e.g., the channel or spatial or temporal dimension), in this work, we propose our multidimensional pruning (MDP) framework, which can compress both 2-D CNNs and 3-D CNNs along multiple dimensions in an end-to-end fashion. Specifically, MDP indicates the simultaneous reduction of channels and more redundancy on other additional dimensions. The redundancy of additional dimensions depends on the input data, i.e., spatial dimension for 2-D CNNs when using images as the input data, and spatial and temporal dimensions for 3-D CNNs when using videos as the input data. We further extend our MDP framework to the MDP-Point approach for compressing point cloud neural networks (PCNNs) whose inputs are irregular point clouds (e.g., PointNet). In this case, the redundancy along the additional dimension indicates the point dimension (i.e., the number of points). Comprehensive experiments on six benchmark datasets demonstrate the effectiveness of our MDP framework and its extended version MDP-Point for compressing CNNs and PCNNs, respectively.

16.
IEEE Trans Image Process ; 32: 3176-3187, 2023.
Article in English | MEDLINE | ID: mdl-37204946

ABSTRACT

Pedestrian detection is still a challenging task for computer vision, especially in crowded scenes where the overlaps between pedestrians tend to be large. The non-maximum suppression (NMS) plays an important role in removing the redundant false positive detection proposals while retaining the true positive detection proposals. However, the highly overlapped results may be suppressed if the threshold of NMS is lower. Meanwhile, a higher threshold of NMS will introduce a larger number of false positive results. To solve this problem, we propose an optimal threshold prediction (OTP) based NMS method that predicts a suitable threshold of NMS for each human instance. First, a visibility estimation module is designed to obtain the visibility ratio. Then, we propose a threshold prediction subnet to determine the optimal threshold of NMS automatically according to the visibility ratio and classification score. Finally, we re-formulate the objective function of the subnet and utilize the reward-guided gradient estimation algorithm to update the subnet. Comprehensive experiments on CrowdHuman and CityPersons show the superior performance of the proposed method in pedestrian detection, especially in crowded scenes.

17.
IEEE Trans Pattern Anal Mach Intell ; 45(10): 12550-12561, 2023 Oct.
Article in English | MEDLINE | ID: mdl-37159310

ABSTRACT

Trajectory forecasting for traffic participants (e.g., vehicles) is critical for autonomous platforms to make safe plans. Currently, most trajectory forecasting methods assume that object trajectories have been extracted and directly develop trajectory predictors based on the ground truth trajectories. However, this assumption does not hold in practical situations. Trajectories obtained from object detection and tracking are inevitably noisy, which could cause serious forecasting errors to predictors built on ground truth trajectories. In this paper, we propose to predict trajectories directly based on detection results without relying on explicitly formed trajectories. Different from traditional methods which encode the motion cues of an agent based on its clearly defined trajectory, we extract the motion information only based on the affinity cues among detection results, in which an affinity-aware state update mechanism is designed to manage the state information. In addition, considering that there could be multiple plausible matching candidates, we aggregate the states of them. These designs take the uncertainty of association into account which relax the undesirable effect of noisy trajectory obtained from data association and improve the robustness of the predictor. Extensive experiments validate the effectiveness of our method and its generalization ability to different detectors or forecasting schemes.

18.
IEEE Trans Pattern Anal Mach Intell ; 45(4): 5296-5313, 2023 04.
Article in English | MEDLINE | ID: mdl-35939471

ABSTRACT

This paper investigates the task of 2D whole-body human pose estimation, which aims to localize dense landmarks on the entire human body including body, feet, face, and hands. We propose a single-network approach, termed ZoomNet, to take into account the hierarchical structure of the full human body and solve the scale variation of different body parts. We further propose a neural architecture search framework, termed ZoomNAS, to promote both the accuracy and efficiency of whole-body pose estimation. ZoomNAS jointly searches the model architecture and the connections between different sub-modules, and automatically allocates computational complexity for searched sub-modules. To train and evaluate ZoomNAS, we introduce the first large-scale 2D human whole-body dataset, namely COCO-WholeBody V1.0, which annotates 133 keypoints for in-the-wild images. Extensive experiments demonstrate the effectiveness of ZoomNAS and the significance of COCO-WholeBody V1.0.


Subject(s)
Algorithms , Human Body , Humans
19.
Nat Commun ; 13(1): 7681, 2022 12 12.
Article in English | MEDLINE | ID: mdl-36509809

ABSTRACT

As one of the most predominant interannual variabilities, the Indian Ocean Dipole (IOD) exerts great socio-economic impacts globally, especially on Asia, Africa, and Australia. While enormous efforts have been made since its discovery to improve both climate models and statistical methods for better prediction, current skills in IOD predictions are mostly limited up to three months ahead. Here, we challenge this long-standing problem using a multi-task deep learning model that we name MTL-NET. Hindcasts of the IOD events during the past four decades indicate that the MTL-NET can predict the IOD well up to 7-month ahead, outperforming most of world-class dynamical models used for comparison in this study. Moreover, the MTL-NET can help assess the importance of different predictors and correctly capture the nonlinear relationships between the IOD and predictors. Given its merits, the MTL-NET is demonstrated to be an efficient model for improved IOD prediction.


Subject(s)
Machine Learning , Indian Ocean , Asia , Australia , Africa
20.
IEEE Trans Image Process ; 31: 4884-4896, 2022.
Article in English | MEDLINE | ID: mdl-35839182

ABSTRACT

Motion modeling is crucial in modern action recognition methods. As motion dynamics like moving tempos and action amplitude may vary a lot in different video clips, it poses great challenge on adaptively covering proper motion information. To address this issue, we introduce a Motion Diversification and Selection (MoDS) module to generate diversified spatio-temporal motion features and then select the suitable motion representation dynamically for categorizing the input video. To be specific, we first propose a spatio-temporal motion generation (StMG) module to construct a bank of diversified motion features with varying spatial neighborhood and time range. Then, a dynamic motion selection (DMS) module is leveraged to choose the most discriminative motion feature both spatially and temporally from the feature bank. As a result, our proposed method can make full use of the diversified spatio-temporal motion information, while maintaining computational efficiency at the inference stage. Extensive experiments on five widely-used benchmarks, demonstrate the effectiveness of the method and we achieve state-of-the-art performance on Something-Something V1 & V2 that are of large motion variation.


Subject(s)
Algorithms , Benchmarking , Motion
SELECTION OF CITATIONS
SEARCH DETAIL
...