JoWUA

Volume 17 - Issue 1

Enhanced Vision Transformer and Attention-Based Multi-Scale Feature Fusion for High Precision Non-Invasive Gender Classification in Silkworm Cocoon Datasets

B.H. Gowramma Assistant Professor, Bapuji Institute of Engineering and Technology, Davangere, Karnataka, India; Visvesvaraya Technological University, Belagavi, Karnataka, India.
gow.paru@gmail.com 0009-0008-3568-1519
Dr.B. Poornima Professor and Head, Department of the Information Science and Engineering, Bapuji Institute of Engineering and Technology, Davangere, Karnataka, India; Visvesvaraya Technological University, Belagavi, Karnataka, India.
poornimateju@gmail.com 0000-0002-7050-5528
Swetha Parvatha Reddy Chandrasekhara Assistant Professor, B.M.S. College of Engineering, Bangalore, Karnataka, India.
swethapc.reddy@gmail.com 0000-0003-3041-1284
Dr.M.S. Mrutyunjaya Associate Professor and Head, Department of Computer Science and Engineering, R L Jalappa Institute of Technology, Doddaballapura, Karnataka, India.
mrutyunjayams@gmail.com 0009-0009-8040-7743
Dr. Kusuma Lingaiah Scientist, Bivoltine Breeding Laboratory, Central Sericultural Research and Training Institute, Central Silk Board, Srirampura, Mysuru, India.
kusuma.lingiah@gmail.com 0000-0002-2442-9755

DOI: 10.58346/JOWUA.2026.I1.053

Keywords: Enhanced Vision Transformer, Attention-Based Multi-Scale Feature Fusion, Gender Classification, Non-Invasive Image Analysis, Fine-Grained Feature Learning, Self-Attention Mechanism, Precision Agriculture Automation.

Abstract

This research introduces AMF-ViT-CocoonNet, an Enhanced Vision Transformer with Attention-Based Multi-Scale Feature Fusion that achieves high-accuracy, non-invasive gender classification in silkworm cocoon datasets. Currently implemented hand-crafted and traditional machine learning methods are susceptible to luminance fluctuations and background distortions, resulting in low resilience in real-world conditions. While CNN-based models are effective at feature learning, they primarily focus on local receptive fields and often fail to capture the long-range spatial dependencies necessary for fine-grained gender discrimination. Vision Transformers model global context well and are less sensitive to low-level textures but often carry increased computational requirements. To overcome these shortcomings, the proposed architecture integrates hierarchical self-attention with an Improved Feed-Forward Network (IFFN), employing depth-wise separable convolutions and channel attention to retain local texture details. Moreover, the adaptive attention-guided multi-scale feature fusion mechanism combines discriminatory features at hierarchical levels while removing redundant information. The integration of token reduction ensures computational efficiency and reduces model complexity without sacrificing performance. The proposed model was evaluated on an updated dataset of 5,900 high-resolution images comprising CSR2 and CSR26 bivoltine breeds. Experimental results, validated through a 5-fold cross-validation framework, demonstrate that AMF-ViT-CocoonNet achieves a peak classification accuracy of 97.48%, with a precision of 97.20% and a recall of 97.30%. These results represent a significant improvement over baseline CNN architectures and traditional vision transformers, establishing a robust framework for automated, non-invasive cocoon sorting in the sericulture industry.

Date

March 2026

Page Number

970-988