An Enhanced Attention-Based Deep Learning System for Text Detection and Information Retrieval from Images: Exploiting Transformer Architectures and Multi-Modal Fusion
Subhakarrao GollaResearch Scholar, Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Kakinada, Kakinada, Andhra Pradesh, India. subhakar.golla@gmail.com0009-0008-6780-1968
Dr.B. SujathaProfessor, Department of Computer Science and Engineering, Godavari Institute of Engineering & Technology, Rajahmundry, Andhra Pradesh, India. bsujatha@giet.ac.in0000-0001-9433-3647
Dr.L. SumalathaProfessor, Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Kakinada, Kakinada, Andhra Pradesh, India. sumalatha.lingamgunta@gmail.com0000-0002-8113-9340
Keywords: Text Detection, Vision Transformer, Scene Text Recognition, Graph Convolutional Networks, Multi-Modal Fusion, Cross-Modal Attention.
Abstract
This study is constructed based on the prior text detection and recognition of natural images to extract text and provides ample enhancements to meet the demands of the complex visual situations. In particular, the proposed framework focuses on improving robustness and adaptability in real-world environments. Our system that is based on a combination of transformer-based architectures along with multi-modal fusion strategies makes detection and recognition successful in TMT. The integration of these advanced techniques enables better contextual understanding and feature representation. The approach uses ViT structure as a backbone and also employs Cross-Modal Attention Module (CMAM) to effectively use the information presented in both visual and semantic perspectives. This dual-stream processing enhances both localization and recognition accuracy. Experimental results show the significant improvement in accuracy, with an average precision of 96.8% in detection of text and an accuracy of 94.3% in recognition of characters, which are 4.5% and 5.9% better than those of the previous work. These improvements demonstrate the effectiveness of the proposed architecture over existing baseline methods. The strengthened framework demonstrates remarkable robustness to challenging scenarios with extreme lighting conditions, severe occlusions, and strong stylized text. Furthermore, it generalizes well across diverse datasets and conditions. In addition, the overall end-to-end inferencing speed of the system has been fine-tuned to approximately 52ms per image, which can be applied for real-time applications. This ensures practical deployment feasibility in time-sensitive systems. This paper sets new state-of-the-art benchmarks in scene text understanding and retrieval, with significant potential to boost applications in automatic document processing, aiding devices for the visually impaired, and augmented reality devices.