ABSTRACT
We investigate an edge-computing scenario for robot control, where two similar neural networks are running on one computational node. We test the feasibility of using a single object-detection model (YOLOv5) with the benefit of reduced computational resources against the potentially more accurate independent and specialized models. Our results show that using one single convolutional neural network (for object detection and hand-gesture classification) instead of two separate ones can reduce resource usage by almost 50%. For many classes, we observed an increase in accuracy when using the model trained with more labels. For small datasets (a few hundred instances per label), we found that it is advisable to add labels with many instances from another dataset to increase detection accuracy.
Subject(s)
Gestures , Running , Hand , Neural Networks, Computer , Upper ExtremityABSTRACT
Multimodal user interfaces promise natural and intuitive human-machine interactions. However, is the extra effort for the development of a complex multisensor system justified, or can users also be satisfied with only one input modality? This study investigates interactions in an industrial weld inspection workstation. Three unimodal interfaces, including spatial interaction with buttons augmented on a workpiece or a worktable, and speech commands, were tested individually and in a multimodal combination. Within the unimodal conditions, users preferred the augmented worktable, but overall, the interindividual usage of all input technologies in the multimodal condition was ranked best. Our findings indicate that the implementation and the use of multiple input modalities is valuable and that it is difficult to predict the usability of individual input modalities for complex systems.