Hierarchical Attention and Semantic Refinement for Advanced Image Captioning
Maysoon Khazaal Abbas MaaroofDepartment of Mathematic, College of Basic Education, University of Babylon, Babil basic.maysoon.maroof@uobabylon.edu.iq0000-0002-4035-0537
Nuha Kareem Hameed Rasheed Al-MsarhedDepartment of information security, college of information technology, University of Babylon, Babil nuhakareem@uobabylon.edu.iq0009-0005-8979-2140
Farah Alaa A. HassanDepartment of Cyber Security, College of Information Technology, University of Babylon, Babil inf883.frh.alaa@uobabylon.edu.iq0009-0008-5082-5947
Keywords: Image Captioning, Deep Learning, Hierarchical Attention, Semantic Refinement, Context-Aware Models, Vision-Language Integration, Knowledge Graphs, Computer Vision, Natural Language Processing.
Abstract
Automated image captioning, a pivotal task at the confluence of computer vision and natural language processing, strives to generate semantically rich and contextually accurate textual descriptions for visual scenes. Despite considerable progress with encoder-decoder architectures, contemporary models often exhibit limitations in capturing fine-grained visual details, understanding complex inter-object relationships, and maintaining robust semantic coherence, frequently resulting in generic or imprecise captions. This paper introduces the Hierarchical Context-Aware Attention and Semantic Refinement Network (HCASR-Net), a novel framework meticulously designed to address these persistent challenges. HCASR-Net integrates two core innovations a Hierarchical Context-Aware Attention (HCAA) mechanism that progressively fuses multi-scale visual features with evolving textual context, enabling a more nuanced focus on both salient objects and subtle relational cues, demonstrably improving feature utilization by an average of 9.5% based on gradient attribution analysis. A Semantic Refinement Module (SRM) operating post-decoding, which leverages a compact, learnable knowledge graph to iteratively refine generated captions, significantly reducing semantic inconsistencies and improving factual grounding, leading to a 15.2% reduction in identifiable semantic errors in a controlled study. Extensive evaluations on the MS COCO and Flickr30k benchmarks establish that HCASR-Net achieves new state-of-the-art performance, attaining a CIDEr score of 134.8 (a 1.0 point improvement over strong baselines) and a SPICE score of 23.6 (a 0.3 point improvement) on MS COCO. Qualitative assessments and rigorous human evaluation studies further underscore HCASR-Net's capacity to produce captions that are demonstrably more detailed, contextually appropriate, and semantically sound, with human evaluators showing a clear preference (42% vs. 31% for the next best SOTA) for its outputs. This work offers a significant advancement in image captioning by providing a robust mechanism for deeper visual-linguistic integration and post-hoc semantic validation.