A Foundation-Level Multi-Modal Ophthalmic Model for Unified Cross-Modal Representation Learning
Savita MamadapurResearch Scholar, Department of Computer Science and Engineering, FET, Jain (Deemed-To-Be-University) Bangalore, Karnataka, India. savita.ec048@gmail.com0009-0000-7451-4416
Dr.P. ManikandanProfessor, Department of Computer Science and Engineering, FET, Jain (Deemed-to-be-University) Bangalore, Karnataka, India. mani.p.mk@gmail.com0000-0003-3037-7688
Dr.P. RenukadeviAssistant Professor, Department of Computer Science and Engineering, FET, Jain (Deemed-to-be-University) Bangalore, Karnataka, India. pgrenu@gmail.com0009-0001-9533-5860
To develop and test a foundation-level multimodal ophthalmic model that learns common cross-modal representations for automated classification of eye diseases (normal, diabetic retinopathy, glaucoma, cataract) using the Kaggle Eye Diseases classification fundus image dataset and textual descriptors. The proposed framework is based on a modality-agnostic vision encoder, initialized via transfer learning and trained on 4 categories of 4217 color fundus images, and a lightweight text encoder fed by textual tokens for label and description levels. The features of the fundus and text embeddings are matched in a shared latent space via image-text contrastive objectives, leveraging recent multimodal ophthalmic foundation models such as EyeCLIP and Eye Found. In this single space, a classification leader is used to perform multi-class disease prediction, enabling image-only and image-and-text inference. The proposed multi modal model with extensive data augmentation reaches test accuracy of 95, on the same Kaggle dataset, which is comparable or a little higher than current Efficient NetB3 based and transformer ensemble baselines, which report a test accuracy of 95. The model has a high macro averaged precision, recall, and F1 scores in all four classes, and significantly less confusion between cataract and glaucoma than the single modal CNN and transformer baselines. Experiments of ablation demonstrate that either the removal of the text arm or the contrastive alignment goal deteriorates performance and class balance which confirms the advantage of learning both cross modal representation unanimously as supported by previous multimodal ophthalmic experiments.Multi-modal ophthalmic style, based on a foundation style and trained on a single public Kaggle fundus dataset, can acquire unified cross-modal representations that result in robust classification of eye disease across multiple classes. It can be extended to other imaging modalities (e.g., OCT) and to more detailed clinical text, which aligns with the direction of large multimodal foundation models in ophthalmology. This is why the given approach can be considered an effective starting point of scalable, real world ophthalmic AI systems.