Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 29
Filter
Add more filters










Publication year range
1.
IEEE Trans Image Process ; 33: 3907-3920, 2024.
Article in English | MEDLINE | ID: mdl-38900622

ABSTRACT

Inferring 3D human motion is fundamental in many applications, including understanding human activity and analyzing one's intention. While many fruitful efforts have been made to human motion prediction, most approaches focus on pose-driven prediction and inferring human motion in isolation from the contextual environment, thus leaving the body location movement in the scene behind. However, real-world human movements are goal-directed and highly influenced by the spatial layout of their surrounding scenes. In this paper, instead of planning future human motion in a "dark" room, we propose a Multi-Condition Latent Diffusion network (MCLD) that reformulates the human motion prediction task as a multi-condition joint inference problem based on the given historical 3D body motion and the current 3D scene contexts. Specifically, instead of directly modeling joint distribution over the raw motion sequences, MCLD performs a conditional diffusion process within the latent embedding space, characterizing the cross-modal mapping from the past body movement and current scene context condition embeddings to the future human motion embedding. Extensive experiments on large-scale human motion prediction datasets demonstrate that our MCLD achieves significant improvements over the state-of-the-art methods on both realistic and diverse predictions.


Subject(s)
Movement , Humans , Movement/physiology , Algorithms , Neural Networks, Computer , Video Recording/methods , Imaging, Three-Dimensional/methods , Image Processing, Computer-Assisted/methods
2.
IEEE Trans Image Process ; 32: 5394-5407, 2023.
Article in English | MEDLINE | ID: mdl-37721874

ABSTRACT

Human parsing aims to segment each pixel of the human image with fine-grained semantic categories. However, current human parsers trained with clean data are easily confused by numerous image corruptions such as blur and noise. To improve the robustness of human parsers, in this paper, we construct three corruption robustness benchmarks, termed LIP-C, ATR-C, and Pascal-Person-Part-C, to assist us in evaluating the risk tolerance of human parsing models. Inspired by the data augmentation strategy, we propose a novel heterogeneous augmentation-enhanced mechanism to bolster robustness under commonly corrupted conditions. Specifically, two types of data augmentations from different views, i.e., image-aware augmentation and model-aware image-to-image transformation, are integrated in a sequential manner for adapting to unforeseen image corruptions. The image-aware augmentation can enrich the high diversity of training images with the help of common image operations. The model-aware augmentation strategy that improves the diversity of input data by considering the model's randomness. The proposed method is model-agnostic, and it can plug and play into arbitrary state-of-the-art human parsing frameworks. The experimental results show that the proposed method demonstrates good universality which can improve the robustness of the human parsing models and even the semantic segmentation models when facing various image common corruptions. Meanwhile, it can still obtain approximate performance on clean data.

3.
IEEE Trans Image Process ; 32: 4459-4471, 2023.
Article in English | MEDLINE | ID: mdl-37527313

ABSTRACT

Semi-supervised dense prediction tasks, such as semantic segmentation, can be greatly improved through the use of contrastive learning. However, this approach presents two key challenges: selecting informative negative samples from a highly redundant pool and implementing effective data augmentation. To address these challenges, we present an adversarial contrastive learning method specifically for semi-supervised semantic segmentation. Direct learning of adversarial negatives is adopted to retain discriminative information from the past, leading to higher learning efficiency. Our approach also leverages an advanced data augmentation strategy called AdverseMix, which combines information from under-performing classes to generate more diverse and challenging samples. Additionally, we use auxiliary labels and classifiers to prevent over-adversarial negatives from affecting the learning process. Our experiments on the Pascal VOC and Cityscapes datasets demonstrate that our method outperforms the state-of-the-art by a significant margin, even when using a small fraction of labeled data.

4.
IEEE Trans Image Process ; 32: 1978-1991, 2023.
Article in English | MEDLINE | ID: mdl-37030697

ABSTRACT

Recently, deep convolution neural networks (CNNs) steered face super-resolution methods have achieved great progress in restoring degraded facial details by joint training with facial priors. However, these methods have some obvious limitations. On the one hand, multi-task joint learning requires additional marking on the dataset, and the introduced prior network will significantly increase the computational cost of the model. On the other hand, the limited receptive field of CNN will reduce the fidelity and naturalness of the reconstructed facial images, resulting in suboptimal reconstructed images. In this work, we propose an efficient CNN-Transformer Cooperation Network (CTCNet) for face super-resolution tasks, which uses the multi-scale connected encoder-decoder architecture as the backbone. Specifically, we first devise a novel Local-Global Feature Cooperation Module (LGCM), which is composed of a Facial Structure Attention Unit (FSAU) and a Transformer block, to promote the consistency of local facial detail and global facial structure restoration simultaneously. Then, we design an efficient Feature Refinement Module (FRM) to enhance the encoded features. Finally, to further improve the restoration of fine facial details, we present a Multi-scale Feature Fusion Unit (MFFU) to adaptively fuse the features from different stages in the encoder procedure. Extensive evaluations on various datasets have assessed that the proposed CTCNet can outperform other state-of-the-art methods significantly. Source code will be available at https://github.com/IVIPLab/CTCNet.


Subject(s)
Learning , Neural Networks, Computer , Software , Image Processing, Computer-Assisted
5.
IEEE Trans Pattern Anal Mach Intell ; 45(9): 10718-10730, 2023 Sep.
Article in English | MEDLINE | ID: mdl-37030807

ABSTRACT

As a representative self-supervised method, contrastive learning has achieved great successes in unsupervised training of representations. It trains an encoder by distinguishing positive samples from negative ones given query anchors. These positive and negative samples play critical roles in defining the objective to learn the discriminative encoder, avoiding it from learning trivial features. While existing methods heuristically choose these samples, we present a principled method where both positive and negative samples are directly learnable end-to-end with the encoder. We show that the positive and negative samples can be cooperatively and adversarially learned by minimizing and maximizing the contrastive loss, respectively. This yields cooperative positives and adversarial negatives with respect to the encoder, which are updated to continuously track the learned representation of the query anchors over mini-batches. The proposed method achieves 71.3% and 75.3% in top-1 accuracy respectively over 200 and 800 epochs of pre-training ResNet-50 backbone on ImageNet1K without tricks such as multi-crop or stronger augmentations. With Multi-Crop, it can be further boosted into 75.7%. The source code and pre-trained model are released in https://github.com/maple-research-lab/caco.

6.
Article in English | MEDLINE | ID: mdl-37022403

ABSTRACT

Deep learning-based models have been shown to outperform human beings in many computer vision tasks with massive available labeled training data in learning. However, humans have an amazing ability to easily recognize images of novel categories by browsing only a few examples of these categories. In this case, few-shot learning comes into being to make machines learn from extremely limited labeled examples. One possible reason why human beings can well learn novel concepts quickly and efficiently is that they have sufficient visual and semantic prior knowledge. Toward this end, this work proposes a novel knowledge-guided semantic transfer network (KSTNet) for few-shot image recognition from a supplementary perspective by introducing auxiliary prior knowledge. The proposed network jointly incorporates vision inferring, knowledge transferring, and classifier learning into one unified framework for optimal compatibility. A category-guided visual learning module is developed in which a visual classifier is learned based on the feature extractor along with the cosine similarity and contrastive loss optimization. To fully explore prior knowledge of category correlations, a knowledge transfer network is then developed to propagate knowledge information among all categories to learn the semantic-visual mapping, thus inferring a knowledge-based classifier for novel categories from base categories. Finally, we design an adaptive fusion scheme to infer the desired classifiers by effectively integrating the above knowledge and visual information. Extensive experiments are conducted on two widely used Mini-ImageNet and Tiered-ImageNet benchmarks to validate the effectiveness of KSTNet. Compared with the state of the art, the results show that the proposed method achieves favorable performance with minimal bells and whistles, especially in the case of one-shot learning.

7.
IEEE Trans Pattern Anal Mach Intell ; 45(5): 5549-5560, 2023 May.
Article in English | MEDLINE | ID: mdl-36049010

ABSTRACT

Representation learning has significantly been developed with the advance of contrastive learning methods. Most of those methods are benefited from various data augmentations that are carefully designated to maintain their identities so that the images transformed from the same instance can still be retrieved. However, those carefully designed transformations limited us to further explore the novel patterns exposed by other transformations. Meanwhile, as shown in our experiments, direct contrastive learning for stronger augmented images can not learn representations effectively. Thus, we propose a general framework called Contrastive Learning with Stronger Augmentations (CLSA) to complement current contrastive learning approaches. Here, the distribution divergence between the weakly and strongly augmented images over the representation bank is adopted to supervise the retrieval of strongly augmented queries from a pool of instances. Experiments on the ImageNet dataset and downstream datasets showed the information from the strongly augmented images can significantly boost the performance. For example, CLSA achieves top-1 accuracy of 76.2% on ImageNet with a standard ResNet-50 architecture with a single-layer classifier fine-tuned, which is almost the same level as 76.5% of supervised results.

8.
IEEE Trans Image Process ; 31: 5599-5612, 2022.
Article in English | MEDLINE | ID: mdl-36001523

ABSTRACT

Most state-of-the-art instance-level human parsing models adopt two-stage anchor-based detectors and, therefore, cannot avoid the heuristic anchor box design and the lack of analysis on a pixel level. To address these two issues, we have designed an instance-level human parsing network which is anchor-free and solvable on a pixel level. It consists of two simple sub-networks: an anchor-free detection head for bounding box predictions and an edge-guided parsing head for human segmentation. The anchor-free detector head inherits the pixel-like merits and effectively avoids the sensitivity of hyper-parameters as proved in object detection applications. By introducing the part-aware boundary clue, the edge-guided parsing head is capable to distinguish adjacent human parts from among each other up to 58 parts in a single human instance, even overlapping instances. Meanwhile, a refinement head integrating box-level score and part-level parsing quality is exploited to improve the quality of the parsing results. Experiments on two multiple human parsing datasets (i.e., CIHP and LV-MHP-v2.0) and one video instance-level human parsing dataset (i.e., VIP) show that our method achieves the best global-level and instance-level performance over state-of-the-art one-stage top-down alternatives.

9.
IEEE Trans Image Process ; 31: 5484-5497, 2022.
Article in English | MEDLINE | ID: mdl-35943998

ABSTRACT

Spatial-temporal relation reasoning is a significant yet challenging problem for video action recognition. Previous works typically apply local operations like 2D or 3D CNNs to conduct space-time interactions in video sequences, or simply capture space-time long-range relations of a single fixed scale. However, this is inadequate for obtaining a comprehensive action representation. Besides, most models treat all input frames equally for the final classification, without selecting key frames and motion-sensitive regions. This introduces irrelevant video content and hurts the performance of models. In this paper, we propose a generic Spatial-Temporal Pyramid Graph Network (STPG-Net) to adaptively capture long-range spatial-temporal relations in video sequences at multiple scales. Specifically, we design a temporal attention (TA) module and a spatial-temporal attention (STA) module to learn the contribution of each frame and each space-time region to an action at a feature level, respectively. We then apply the selected key information to build spatial-temporal pyramid graphs for long-range relation reasoning and more comprehensive action representation learning. STPG-Net can be flexibly integrated into 2D and 3D backbone networks in a plug-and-play manner. Extensive experiments show that it brings consistent improvements over many challenging baselines on several standard action recognition benchmarks (i.e., Something-Something V1 & V2, and FineGym), demonstrating the effectiveness of our approach.

10.
IEEE Trans Pattern Anal Mach Intell ; 44(11): 8694-8700, 2022 11.
Article in English | MEDLINE | ID: mdl-34018928

ABSTRACT

In this paper, we propose the K-Shot Contrastive Learning (KSCL) of visual features by applying multiple augmentations to investigate the sample variations within individual instances. It aims to combine the advantages of inter-instance discrimination by learning discriminative features to distinguish between different instances, as well as intra-instance variations by matching queries against the variants of augmented samples over instances. Particularly, for each instance, it constructs an instance subspace to model the configuration of how the significant factors of variations in K-shot augmentations can be combined to form the variants of augmentations. Given a query, the most relevant variant of instances is then retrieved by projecting the query onto their subspaces to predict the positive instance class. This generalizes the existing contrastive learning that can be viewed as a special one-shot case. An eigenvalue decomposition is performed to configure instance subspaces, and the embedding network can be trained end-to-end through the differentiable subspace configuration. Experiment results demonstrate the proposed K-shot contrastive learning achieves superior performances to the state-of-the-art unsupervised methods.


Subject(s)
Algorithms , Learning
11.
IEEE Trans Pattern Anal Mach Intell ; 44(6): 3300-3315, 2022 06.
Article in English | MEDLINE | ID: mdl-33434123

ABSTRACT

Human motion prediction aims to generate future motions based on the observed human motions. Witnessing the success of Recurrent Neural Networks (RNN) in modeling sequential data, recent works utilize RNNs to model human-skeleton motions on the observed motion sequence and predict future human motions. However, these methods disregard the existence of the spatial coherence among joints and the temporal evolution among skeletons, which reflects the crucial characteristics of human motions in spatiotemporal space. To this end, we propose a novel Skeleton-Joint Co-Attention Recurrent Neural Networks (SC-RNN) to capture the spatial coherence among joints, and the temporal evolution among skeletons simultaneously on a skeleton-joint co-attention feature map in spatiotemporal space. First, a skeleton-joint feature map is constructed as the representation of the observed motion sequence. Second, we design a new Skeleton-Joint Co-Attention (SCA) mechanism to dynamically learn a skeleton-joint co-attention feature map of this skeleton-joint feature map, which can refine the useful observed motion information to predict one future motion. Third, a variant of GRU embedded with SCA collaboratively models the human-skeleton motion and human-joint motion in spatiotemporal space by regarding the skeleton-joint co-attention feature map as the motion context. Experimental results of human motion prediction demonstrate that the proposed method outperforms the competing methods.


Subject(s)
Algorithms , Neural Networks, Computer , Attention , Humans , Motion , Skeleton
12.
IEEE Trans Pattern Anal Mach Intell ; 44(4): 2168-2187, 2022 04.
Article in English | MEDLINE | ID: mdl-33074801

ABSTRACT

Representation learning with small labeled data have emerged in many problems, since the success of deep neural networks often relies on the availability of a huge amount of labeled data that is expensive to collect. To address it, many efforts have been made on training sophisticated models with few labeled data in an unsupervised and semi-supervised fashion. In this paper, we will review the recent progresses on these two major categories of methods. A wide spectrum of models will be categorized in a big picture, where we will show how they interplay with each other to motivate explorations of new ideas. We will review the principles of learning the transformation equivariant, disentangled, self-supervised and semi-supervised representations, all of which underpin the foundation of recent progresses. Many implementations of unsupervised and semi-supervised generative models have been developed on the basis of these criteria, greatly expanding the territory of existing autoencoders, generative adversarial nets (GANs) and other deep networks by exploring the distribution of unlabeled data for more powerful representations. We will discuss emerging topics by revealing the intrinsic connections between unsupervised and semi-supervised learning, and propose in future directions to bridge the algorithmic and theoretical gap between transformation equivariance for unsupervised learning and supervised invariance for supervised learning, and unify unsupervised pretraining and supervised finetuning. We will also provide a broader outlook of future directions to unify transformation and instance equivariances for representation learning, connect unsupervised and semi-supervised augmentations, and explore the role of the self-supervised regularization for many learning problems.


Subject(s)
Algorithms , Big Data , Neural Networks, Computer , Supervised Machine Learning
13.
IEEE Trans Pattern Anal Mach Intell ; 44(4): 2045-2057, 2022 Apr.
Article in English | MEDLINE | ID: mdl-33035159

ABSTRACT

Transformation equivariant representations (TERs) aim to capture the intrinsic visual structures that equivary to various transformations by expanding the notion of translation equivariance underlying the success of convolutional neural networks (CNNs). For this purpose, we present both deterministic AutoEncoding Transformations (AET) and probabilistic AutoEncoding Variational Transformations (AVT) models to learn visual representations from generic groups of transformations. While the AET is trained by directly decoding the transformations from the learned representations, the AVT is trained by maximizing the joint mutual information between the learned representation and transformations. This results in generalized TERs (GTERs) equivariant against transformations in a more general fashion by capturing complex patterns of visual structures beyond the conventional linear equivariance under a transformation group. The presented approach can be extended to (semi-)supervised models by jointly maximizing the mutual information of the learned representation with both labels and transformations. Experiments demonstrate the proposed models outperform the state-of-the-art models in both unsupervised and (semi-)supervised tasks. Moreover, we show that the unsupervised representation can even surpass the fully supervised representation pretrained on ImageNet when they are fine-tuned for the object detection task.

14.
Insect Sci ; 29(2): 505-520, 2022 Apr.
Article in English | MEDLINE | ID: mdl-34050604

ABSTRACT

The fall armyworm (FAW), Spodoptera frugiperda (J.E. Smith), spread rapidly in Africa and Asia recently, causing huge economic losses in crop production. Fall armyworm caterpillars were first detected in South Korea and Japan in June 2019. Here, the migration timing and path for FAW into the countries were estimated by a trajectory simulation approach implementing the insect's flight behavior. The result showed that FAWs found in both South Korea and Japan were estimated to have come from eastern China by crossing the Yellow Sea or the East China Sea in 10-36 h in three series of migrations. In the first series, FAW moths that arrived on Jeju Island during 22-24 May were estimated to be from Zhejiang, Anhui and Fujian Provinces after 1-2 nights' flights. In the second series, it was estimated that FAW moths landed in southern Korea and Kyushu region of Japan simultaneously or successively during 5-9 June, and these moths mostly came from Guangdong and Fujian Provinces. The FAW moths in the third series were estimated to have immigrated from Taiwan Province onto Okinawa Islands during 19-24 June. During these migrations, southwesterly low-level jets extending from eastern China to southern Korea and/or Japan were observed in the northwestern periphery of the western Pacific Subtropical High. These results, for the first time, suggested that the overseas FAW immigrants invading Korea and Japan came from eastern and southern China. This study is helpful for future monitoring, early warning and the source control of this pest in the two countries.


Subject(s)
Emigration and Immigration , Moths , Animals , China , Japan , Spodoptera , Zea mays
15.
Insects ; 12(12)2021 Dec 10.
Article in English | MEDLINE | ID: mdl-34940192

ABSTRACT

Fall armyworm is recognized as one of most highly destructive global agricultural pests. In January 2020, it had first invaded Australia, posing a significant risk to its biosecurity, food security, and agricultural productivity. In this study, the migration paths and wind systems for the case of fall armyworm invading Australia were analyzed using a three-dimensional trajectory simulation approach, combined with its flight behavior and NCEP meteorological reanalysis data. The analysis showed that fall armyworm in Torres Strait most likely came from surrounding islands of central Indonesia on two occasions via wind migration. Specifically, fall armyworm moths detected on Saibai and Erub Islands might have arrived from southern Sulawesi Island, Indonesia, between January 15 and 16. The fall armyworm in Bamaga most likely arrived from the islands around Arafura Sea and Sulawesi Island of Indonesia, between January 26 and 27. The high risk period for the invasion of fall armyworm is only likely to have occurred in January-February due to monsoon winds, which were conducive to flight across the Timor Sea towards Australia. This case study is the first to confirm the immigration paths and timing of fall armyworm from Indonesia to Australia via its surrounding islands.

16.
IEEE Trans Pattern Anal Mach Intell ; 43(9): 2953-2970, 2021 09.
Article in English | MEDLINE | ID: mdl-33591909

ABSTRACT

Differentiable architecture search (DARTS) enables effective neural architecture search (NAS) using gradient descent, but suffers from high memory and computational costs. In this paper, we propose a novel approach, namely Partially-Connected DARTS (PC-DARTS), to achieve efficient and stable neural architecture search by reducing the channel and spatial redundancies of the super-network. In the channel level, partial channel connection is presented to randomly sample a small subset of channels for operation selection to accelerate the search process and suppress the over-fitting of the super-network. Side operation is introduced for bypassing (non-sampled) channels to guarantee the performance of searched architectures under extremely low sampling rates. In the spatial level, input features are down-sampled to eliminate spatial redundancy and enhance the efficiency of the mixed computation for operation selection. Furthermore, edge normalization is developed to maintain the consistency of edge selection based on channel sampling with the architectural parameters for edges. Theoretical analysis shows that partial channel connection and parameterized side operation are equivalent to regularizing the super-network on the weights and architectural parameters during bilevel optimization. Experimental results demonstrate that the proposed approach achieves higher search speed and training stability than DARTS. PC-DARTS obtains a top-1 error rate of 2.55 percent on CIFAR-10 with 0.07 GPU-days for architecture search, and a state-of-the-art top-1 error rate of 24.1 percent on ImageNet (under the mobile setting) within 2.8 GPU-days.

17.
IEEE Trans Image Process ; 30: 2060-2071, 2021.
Article in English | MEDLINE | ID: mdl-33460378

ABSTRACT

Person re-identification is a crucial task of identifying pedestrians of interest across multiple surveillance camera views. For person re-identification, a pedestrian is usually represented with features extracted from a rectangular image region that inevitably contains the scene background, which incurs ambiguity to distinguish different pedestrians and degrades the accuracy. Thus, we propose an end-to-end foreground-aware network to discriminate against the foreground from the background by learning a soft mask for person re-identification. In our method, in addition to the pedestrian ID as supervision for the foreground, we introduce the camera ID of each pedestrian image for background modeling. The foreground branch and the background branch are optimized collaboratively. By presenting a target attention loss, the pedestrian features extracted from the foreground branch become more insensitive to backgrounds, which greatly reduces the negative impact of changing backgrounds on pedestrian matching across different camera views. Notably, in contrast to existing methods, our approach does not require an additional dataset to train a human landmark detector or a segmentation model for locating the background regions. The experimental results conducted on three challenging datasets, i.e., Market-1501, DukeMTMC-reID, and MSMT17, demonstrate the effectiveness of our approach.


Subject(s)
Biometric Identification/methods , Image Processing, Computer-Assisted/methods , Machine Learning , Algorithms , Humans , Pedestrians , Video Recording
18.
IEEE Trans Image Process ; 30: 1639-1647, 2021.
Article in English | MEDLINE | ID: mdl-33347409

ABSTRACT

Deep neural networks have been successfully applied to many real-world applications. However, such successes rely heavily on large amounts of labeled data that is expensive to obtain. Recently, many methods for semi-supervised learning have been proposed and achieved excellent performance. In this study, we propose a new EnAET framework to further improve existing semi-supervised methods with self-supervised information. To our best knowledge, all current semi-supervised methods improve performance with prediction consistency and confidence ideas. We are the first to explore the role of self-supervised representations in semi-supervised learning under a rich family of transformations. Consequently, our framework can integrate the self-supervised information as a regularization term to further improve all current semi-supervised methods. In the experiments, we use MixMatch, which is the current state-of-the-art method on semi-supervised learning, as a baseline to test the proposed EnAET framework. Across different datasets, we adopt the same hyper-parameters, which greatly improves the generalization ability of the EnAET framework. Experiment results on different datasets demonstrate that the proposed EnAET framework greatly improves the performance of current semi-supervised algorithms. Moreover, this framework can also improve supervised learning by a large margin, including the extremely challenging scenarios with only 10 images per class. The code and experiment records are available in https://github.com/maple-research-lab/EnAET.

19.
IEEE Trans Pattern Anal Mach Intell ; 43(3): 1110-1118, 2021 03.
Article in English | MEDLINE | ID: mdl-31545711

ABSTRACT

In this work, we aim to address the problem of human interaction recognition in videos by exploring the long-term inter-related dynamics among multiple persons. Recently, Long Short-Term Memory (LSTM) has become a popular choice to model individual dynamic for single-person action recognition due to its ability to capture the temporal motion information in a range. However, most existing LSTM-based methods focus only on capturing the dynamics of human interaction by simply combining all dynamics of individuals or modeling them as a whole. Such methods neglect the inter-related dynamics of how human interactions change over time. To this end, we propose a novel Hierarchical Long Short-Term Concurrent Memory (H-LSTCM) to model the long-term inter-related dynamics among a group of persons for recognizing human interactions. Specifically, we first feed each person's static features into a Single-Person LSTM to model the single-person dynamic. Subsequently, at one time step, the outputs of all Single-Person LSTM units are fed into a novel Concurrent LSTM (Co-LSTM) unit, which mainly consists of multiple sub-memory units, a new cell gate, and a new co-memory cell. In the Co-LSTM unit, each sub-memory unit stores individual motion information, while this Co-LSTM unit selectively integrates and stores inter-related motion information between multiple interacting persons from multiple sub-memory units via the cell gate and co-memory cell, respectively. Extensive experiments on several public datasets validate the effectiveness of the proposed H-LSTCM by comparing against baseline and state-of-the-art methods.


Subject(s)
Algorithms , Neural Networks, Computer , Humans , Memory, Long-Term , Motion , Recognition, Psychology
20.
IEEE Trans Pattern Anal Mach Intell ; 42(1): 126-139, 2020 01.
Article in English | MEDLINE | ID: mdl-30296212

ABSTRACT

With the popularity of mobile sensor technology, smart wearable devices open a unprecedented opportunity to solve the challenging human activity recognition (HAR) problem by learning expressive representations from the multi-dimensional daily sensor signals. This inspires us to develop a new algorithm applicable to both camera-based and wearable sensor-based HAR systems. Although competitive classification accuracy has been reported, existing methods often face the challenge of distinguishing visually similar activities composed of activity patterns in different temporal orders. In this paper, we propose a novel probabilistic algorithm to compactly encode temporal orders of activity patterns for HAR. Specifically, the algorithm learns an optimal set of latent patterns such that their temporal structures really matter in recognizing different human activities. Then, a novel probabilistic First-Take-All (pFTA) approach is introduced to generate compact features from the orders of these latent patterns to encode the entire sequence, and the temporal structural similarity between different sequences can be efficiently measured by the Hamming distance between compact features. Experiments on three public HAR datasets show the proposed pFTA approach can achieve competitive performance in terms of accuracy as well as efficiency.


Subject(s)
Human Activities/classification , Pattern Recognition, Automated/methods , Algorithms , Databases, Factual , Humans , Image Processing, Computer-Assisted , Models, Statistical , Video Recording , Wearable Electronic Devices
SELECTION OF CITATIONS
SEARCH DETAIL
...