International Journal of Advanced Engineering Application

ISSN: 3048-6807

Transformer-Based Multimodal Fusion Model for Real-Time Object Understanding

Author(s):Emily Carter¹, Daniel Morgan², Sophia Hayes³

Affiliation: 1,2,3Department of Computer Engineering, Lakeview Institute of Technology & Management, Denver, Colorado, USA

Page No: 20-27

Volume issue & Publishing Year: Volume 2 Issue 11, Nov-2025

Journal: International Journal of Advanced Engineering Application (IJAEA)

ISSN NO: 3048-6807

DOI: https://doi.org/10.5281/zenodo.17753308

Download PDF

Article Indexing:

Abstract:
Real-time object understanding is a critical requirement in intelligent computing applications such as autonomous navigation, industrial automation, smart surveillance, and human–machine interaction. Traditional unimodal learning systems rely heavily on visual data alone, limiting their performance under adverse conditions such as occlusion, low lighting, and noisy environments. To address these challenges, this paper proposes a Transformer-Based Multimodal Fusion Model (TMFM) that integrates heterogeneous data sources—including RGB images, depth maps, audio cues, and sensor metadata—into a unified semantic understanding framework. The model employs modality-specific encoders followed by cross-attention–driven fusion layers, enabling effective alignment and interaction among features from different modalities. A shared transformer decoder performs high-level reasoning to generate accurate object representations. Experimental evaluation on benchmark multimodal datasets demonstrates that TMFM improves object recognition accuracy by up to 18% compared to existing CNN- and RNN-based fusion architectures while maintaining real-time inference capability due to its parallel processing design. The proposed model shows strong potential for deployment in next-generation intelligent systems requiring fast, robust, and context-aware object understanding.

Keywords: Multimodal fusion, transformer model, real-time object understanding, cross-attention, intelligent systems, deep learning, sensor integration

Reference:

  • [1] A. Dosovitskiy et al., “An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale,” Proc. ICLR, 2021.
  • [2] N. Carion et al., “End-to-End Object Detection with Transformers,” Proc. ECCV, pp. 213–229, 2020.
  • [3] X. Chen, S. Li, and Z. Zhang, “Multimodal Fusion with Transformers for Robust Object Understanding,” IEEE Trans. Multimedia, vol. 25, pp. 512–524, 2023.
  • [4] J. Lee et al., “Cross-Attention Networks for Multimodal Scene Analysis,” Pattern Recognition, vol. 135, art. no. 109140, 2023.
  • [5] A. Radford et al., “Learning Transferable Visual Models from Natural Language Supervision,” Proc. ICML, 2021 (CLIP Model).
  • [6] Y. Xu et al., “ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations,” Proc. NeurIPS, 2019.
  • [7] Z. Wang, Y. Lu, and T. Wang, “Multimodal Transformer for RGB-D Object Detection,” IEEE Access, vol. 10, pp. 65420–65430, 2022.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proc. CVPR, pp. 770–778, 2016.
  • [9] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers,” Proc. NAACL, 2019.
  • [10] H. Tan and M. Bansal, “LXMERT: Learning Cross-Modality Encoder Representations,” Proc. EMNLP, 2019.
  • [11] Q. Wu, Y. Shen, and D. Liu, “Real-Time Object Detection Using Lightweight Deep Learning Models,” IEEE Sensors Journal, vol. 22, no. 8, pp. 7548–7556, 2022.
  • [12] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137–1149, 2017.