Implementation of Feature Selection and Data Split using Brute Force to Improve Accuracy
Mahmud MustapaAssociate Professor, Department Electronic Engineering Education, Universitas Negeri Makassar mahmud.mustapa@unm.ac.id0000-0001-8974-9728
Ummiati RahmahAssociate Professor, Department Electronic Engineering Education, Universitas Negeri Makassar ummiati.rahmah@unm.ac.id0009-0006-5102-4119
Pandu Adi CakranegaraAssistant Professor, Universitas Presiden pandu.cakranegara@president.ac.id0000-0001-8754-3646
Winci FirdausBadan Riset dan Inovasi wincifirdaus1@gmail.com0000-0002-8261-4211
Dendi PratamaLecture Politeknik Bina Madani dendi@poltekbima.ac.id0000-0003-2002-6358
Robbi RahimLecturer Sekolah Tinggi Ilmu Manajemen Sukma usurobbi85@zoho.com0000-0001-6119-867X
This study seeks to classify data using feature selection and brute force. The dataset contains irrelevant characteristics, therefore feature selection influences computing time and the classification model. UCI's YouTube Spam Collection was used for testing. This dataset contains five datasets with 1,956 legitimate messages from five popular videos (Shakira, Katy Perry, Psy, Eminem, and LMFAO). Using weight information gain, the feature selection technique finds the best attributes. The dataset will then be separated into two parts: training with a 70:30 ratio and testing with a 30:70 ratio. Comparing using C4.5 and Nave Bayes. The FS+BF+C4.5 approach has an accuracy of 69.90%, 63.37%, 98.32%, 50.89%, and 91.75 for five videos (Psy, Katy Perry, LMFAO, Eminem and Shakira). Standard C4.5 technique accuracy is 66.99%, 59.41%, 95.80%, 50.89%, and 88.66%. Naive Bayes accuracy is 61.17, 51.49, 89.08, 50.00, and 79.38. FS+BF+C4.5 obtains an overall average accuracy of 74.85%, 2.5% and 8.6% higher than C4.5 and Naive Bayes (72.35 percent and 66.22 percent). Using feature selection and brute force with the C4.5 approach can reduce classification error compared to the normal C4.5 and Naive Bayes methods.