Enhanced Vision Transformer and Attention-Based Multi-Scale Feature Fusion for High Precision Non-Invasive Gender Classification in Silkworm Cocoon Datasets
B.H. GowrammaAssistant Professor, Bapuji Institute of Engineering and Technology, Davangere, Karnataka, India; Visvesvaraya Technological University, Belagavi, Karnataka, India. gow.paru@gmail.com0009-0008-3568-1519
Dr.B. PoornimaProfessor and Head, Department of the Information Science and Engineering, Bapuji Institute of Engineering and Technology, Davangere, Karnataka, India; Visvesvaraya Technological University, Belagavi, Karnataka, India. poornimateju@gmail.com0000-0002-7050-5528
Swetha Parvatha Reddy ChandrasekharaAssistant Professor, B.M.S. College of Engineering, Bangalore, Karnataka, India. swethapc.reddy@gmail.com0000-0003-3041-1284
Dr.M.S. MrutyunjayaAssociate Professor and Head, Department of Computer Science and Engineering, R L Jalappa Institute of Technology, Doddaballapura, Karnataka, India. mrutyunjayams@gmail.com0009-0009-8040-7743
Dr. Kusuma LingaiahScientist, Bivoltine Breeding Laboratory, Central Sericultural Research and Training Institute, Central Silk Board, Srirampura, Mysuru, India. kusuma.lingiah@gmail.com0000-0002-2442-9755
This research introduces AMF-ViT-CocoonNet, an Enhanced Vision Transformer with Attention-Based Multi-Scale Feature Fusion that achieves high-accuracy, non-invasive gender classification in silkworm cocoon datasets. Currently implemented hand-crafted and traditional machine learning methods are susceptible to luminance fluctuations and background distortions, resulting in low resilience in real-world conditions. While CNN-based models are effective at feature learning, they primarily focus on local receptive fields and often fail to capture the long-range spatial dependencies necessary for fine-grained gender discrimination. Vision Transformers model global context well and are less sensitive to low-level textures but often carry increased computational requirements. To overcome these shortcomings, the proposed architecture integrates hierarchical self-attention with an Improved Feed-Forward Network (IFFN), employing depth-wise separable convolutions and channel attention to retain local texture details. Moreover, the adaptive attention-guided multi-scale feature fusion mechanism combines discriminatory features at hierarchical levels while removing redundant information. The integration of token reduction ensures computational efficiency and reduces model complexity without sacrificing performance. The proposed model was evaluated on an updated dataset of 5,900 high-resolution images comprising CSR2 and CSR26 bivoltine breeds. Experimental results, validated through a 5-fold cross-validation framework, demonstrate that AMF-ViT-CocoonNet achieves a peak classification accuracy of 97.48%, with a precision of 97.20% and a recall of 97.30%. These results represent a significant improvement over baseline CNN architectures and traditional vision transformers, establishing a robust framework for automated, non-invasive cocoon sorting in the sericulture industry.