Search | VHL Regional Portal

Selva, Javier; Johansen, Anders S; Escalera, Sergio; Nasrollahi, Kamal; Moeslund, Thomas B; Clapes, Albert.

IEEE Trans Pattern Anal Mach Intell ; 45(11): 12922-12943, 2023 Nov.

Article in English | MEDLINE | ID: mdl-37022830

ABSTRACT

Transformer models have shown great success handling long-range interactions, making them a promising tool for modeling video. However, they lack inductive biases and scale quadratically with input length. These limitations are further exacerbated when dealing with the high dimensionality introduced by the temporal dimension. While there are surveys analyzing the advances of Transformers for vision, none focus on an in-depth analysis of video-specific designs. In this survey, we analyze the main contributions and trends of works leveraging Transformers to model video. Specifically, we delve into how videos are handled at the input level first. Then, we study the architectural changes made to deal with video more efficiently, reduce redundancy, re-introduce useful inductive biases, and capture long-term temporal dynamics. In addition, we provide an overview of different training regimes and explore effective self-supervised learning strategies for video. Finally, we conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D ConvNets even with less computational complexity.

Deep learning with self-supervision and uncertainty regularization to count fish in underwater images.

Tarling, Penny; Cantor, Mauricio; Clapés, Albert; Escalera, Sergio.

PLoS One ; 17(5): e0267759, 2022.

Article in English | MEDLINE | ID: mdl-35507631

ABSTRACT

Effective conservation actions require effective population monitoring. However, accurately counting animals in the wild to inform conservation decision-making is difficult. Monitoring populations through image sampling has made data collection cheaper, wide-reaching and less intrusive but created a need to process and analyse this data efficiently. Counting animals from such data is challenging, particularly when densely packed in noisy images. Attempting this manually is slow and expensive, while traditional computer vision methods are limited in their generalisability. Deep learning is the state-of-the-art method for many computer vision tasks, but it has yet to be properly explored to count animals. To this end, we employ deep learning, with a density-based regression approach, to count fish in low-resolution sonar images. We introduce a large dataset of sonar videos, deployed to record wild Lebranche mullet schools (Mugil liza), with a subset of 500 labelled images. We utilise abundant unlabelled data in a self-supervised task to improve the supervised counting task. For the first time in this context, by introducing uncertainty quantification, we improve model training and provide an accompanying measure of prediction uncertainty for more informed biological decision-making. Finally, we demonstrate the generalisability of our proposed counting framework through testing it on a recent benchmark dataset of high-resolution annotated underwater images from varying habitats (DeepFish). From experiments on both contrasting datasets, we demonstrate our network outperforms the few other deep learning models implemented for solving this task. By providing an open-source framework along with training data, our study puts forth an efficient deep learning template for crowd counting aquatic animals thereby contributing effective methods to assess natural populations from the ever-increasing visual data.

Subject(s)

Deep Learning , Animals , Benchmarking , Ecosystem , Fishes , Uncertainty

Action Recognition Using Single-Pixel Time-of-Flight Detection.

Ofodile, Ikechukwu; Helmi, Ahmed; Clapés, Albert; Avots, Egils; Peensoo, Kerttu Maria; Valdma, Sandhra-Mirella; Valdmann, Andreas; Valtna-Lukner, Heli; Omelkov, Sergey; Escalera, Sergio; Ozcinar, Cagri; Anbarjafari, Gholamreza.

Entropy (Basel) ; 21(4)2019 Apr 18.

Article in English | MEDLINE | ID: mdl-33267128

ABSTRACT

Action recognition is a challenging task that plays an important role in many robotic systems, which highly depend on visual input feeds. However, due to privacy concerns, it is important to find a method which can recognise actions without using visual feed. In this paper, we propose a concept for detecting actions while preserving the test subject's privacy. Our proposed method relies only on recording the temporal evolution of light pulses scattered back from the scene. Such data trace to record one action contains a sequence of one-dimensional arrays of voltage values acquired by a single-pixel detector at 1 GHz repetition rate. Information about both the distance to the object and its shape are embedded in the traces. We apply machine learning in the form of recurrent neural networks for data analysis and demonstrate successful action recognition. The experimental results show that our proposed method could achieve on average 96.47 % accuracy on the actions walking forward, walking backwards, sitting down, standing up and waving hand, using recurrent neural network.

ABSTRACT

ABSTRACT

Subject(s)

ABSTRACT

SEND TO:

SELECTION OF CITATIONS

SEARCH DETAIL