Lightweight Vision Transformer Framework for Real-Time Human–Object Interaction Recognition
Author(s):Michael Turner¹, Olivia Reed², Ethan Walker³
Affiliation: 1,2,3Department of Computer Engineering, Westbridge Institute of Technology, Wellington, New Zealand
Page No: 28-33
Volume issue & Publishing Year: Volume 2 Issue 11, Nov-2025
Journal: International Journal of Advanced Engineering Application (IJAEA)
ISSN NO: 3048-6807
DOI: https://doi.org/10.5281/zenodo.17753367
Abstract:
Human–Object Interaction (HOI) recognition is a fundamental task in intelligent computing systems, enabling machines to understand how humans engage with surrounding objects in real-time environments. Traditional deep learning approaches for HOI rely heavily on convolutional architectures, which often struggle with long-range dependencies and are computationally expensive for edge deployment. This paper proposes a Lightweight Vision Transformer Framework (LVTF) designed specifically for efficient and accurate real-time HOI recognition. The framework employs a patch-based visual encoder combined with optimized multi-head attention mechanisms to capture global contextual relationships between humans and objects. A lightweight decoder further refines these representations to generate interaction labels with minimal latency. Experimental evaluations conducted on benchmark HOI datasets demonstrate that the LVTF achieves competitive accuracy while reducing computational complexity by nearly 40% compared to conventional transformer and CNN-based models. The reduced model footprint and low inference delay make the proposed approach highly suitable for real-time intelligent applications, including smart surveillance, assistive robotics, and human–computer interaction systems
Keywords: Vision transformer, human–object interaction, real-time recognition, lightweight architecture, attention mechanism, intelligent systems.
Reference:
- [1] A. Dosovitskiy et al., “An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale,” Proc. ICLR, 2021.
- [2] N. Carion et al., “End-to-End Object Detection with Transformers,” Proc. ECCV, pp. 213–229, 2020.
- [3] X. Chen, S. Li, and R. Wang, “Vision Transformer Applications in Real-Time Object Understanding,” IEEE Trans. Multimedia, vol. 25, pp. 645–657, 2023.
- [4] Y. Zhang et al., “Human–Object Interaction Detection Using Deep Neural Networks,” Proc. CVPR Workshops, 2020.
- [5] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proc. CVPR, pp. 770–778, 2016.
- [6] A. Radford et al., “Learning Transferable Visual Models with Natural Language Supervision,” Proc. ICML, 2021.
- [7] H. Tan and M. Bansal, “LXMERT: Learning Cross-Modality Encoder Representations,” Proc. EMNLP, 2019.
- [8] A. Newell, Z. Huang, and J. Deng, “Pose-Attentive Relational Networks for Human–Object Interaction,” Proc. ICCV, pp. 834–845, 2019.
- [9] S. G. Kong and X. Li, “Efficient Vision Transformations for Embedded AI Systems,” IEEE Embedded Systems Letters, vol. 14, no. 3, pp. 253–257, 2022.
- [10] H. Wu, S. Li, and J. Liu, “Lightweight Transformer Designs for Mobile Vision Applications,” Pattern Recognition, vol. 138, art. no. 109407, 2023.
- [11] X. Wang et al., “GPNN: Graph Parsing Neural Networks for Human–Object Interaction,” Proc. ECCV, pp. 407–423, 2018.
- [12] Z. Fang, Q. Huang, and T. Lu, “Real-Time Human Action Recognition Using Hybrid Attention Networks,” IEEE Access, vol. 10, pp. 114320–114332, 2022.
