- Akash Sundararaj
University of Massachusetts Dartmouth, Dartmouth, MA 02747 United States
asundararaj6@umassd.edu - Gokhan Kul
University of Massachusetts Dartmouth, Dartmouth, MA 02747 United States
gkul@umassd.edu
Impact Analysis of Training Data Characteristics for Phishing Email Classification
E-mail is the most essential form of formal communication for organizations. However, phishing attacks occurring through e-mail are a prevalent threat, and these attacks are steadily rising even after e-mail filters to prevent these attacks have become ubiquitous. Phishing attacks are often one of the first steps of major hacking attempts such as Advanced Persistent Threat (APT) attacks or ransomware attacks. In this work, we look into the training data that phishing e-mail detectors to identify the ideal dataset parameters to optimize the phishing e-mail classifiers. To perform this assessment, we surveyed through phishing e-mail detection methods in the literature and identified that majority of phishing e-mail detectors either use structural properties or text mining methods. Therefore, we analyze the optimal ratio for phishing and legitimate e-mails in the training data for these approaches. We design an experiment using Enron dataset and a phishing e-mail collection to evaluate the effectiveness of these methods with varying sizes of legitimate and phishing emails to empirically show their strengths and weaknesses for specific data parameters. We display the influence of the balanced and unbalanced dataset of e-mails on the results produced by the machine learning classifiers. Interestingly, unbalanced datasets provide better accuracy while they consistently provide worse precision and recall compared to balanced datasets. The empirical results also suggest that phishing e-mail filters have not been perfected, warranting that there is still room for development in this area. Our findings will help the researchers to avoid the common mistakes native to this type of threat before building machine learning classifiers for this domain.