Implementation of Feature Selection and Data Split using Brute Force to Improve Accuracy
This study seeks to classify data using feature selection and brute force. The dataset contains irrelevant characteristics, therefore feature selection influences computing time and the classification model. UCI's YouTube Spam Collection was used for testing. This dataset contains five datasets with 1,956 legitimate messages from five popular videos (Shakira, Katy Perry, Psy, Eminem, and LMFAO). Using weight information gain, the feature selection technique finds the best attributes. The dataset will then be separated into two parts: training with a 70:30 ratio and testing with a 30:70 ratio. Comparing using C4.5 and Nave Bayes. The FS+BF+C4.5 approach has an accuracy of 69.90%, 63.37%, 98.32%, 50.89%, and 91.75 for five videos (Psy, Katy Perry, LMFAO, Eminem and Shakira). Standard C4.5 technique accuracy is 66.99%, 59.41%, 95.80%, 50.89%, and 88.66%. Naive Bayes accuracy is 61.17, 51.49, 89.08, 50.00, and 79.38. FS+BF+C4.5 obtains an overall average accuracy of 74.85%, 2.5% and 8.6% higher than C4.5 and Naive Bayes (72.35 percent and 66.22 percent). Using feature selection and brute force with the C4.5 approach can reduce classification error compared to the normal C4.5 and Naive Bayes methods.