Towards Detecting and Classifying Malicious URLs Using Deep Learning
Emails containing Uniform Resource Locators (URLs) pose substantial risks to organizations by potentially compromising both credentials and network security through general and spear-phishing campaigns to their employees. The detection and classification of malicious URLs is an important research problem with practical applications. With an appropriate machine learning model, an organization may protect itself by filtering incoming emails and the websites its employees are visiting based on the maliciousness of URLs contained in emails and web pages. In this work, we compare the performance of traditional machine learning algorithms, such as Random Forest, CART, and kNN against popular deep learning framework models, such as Fast.ai and Keras-TensorFlow across CPU, GPU, and TPU architectures. Using the publicly available ISCX-URL-2016 dataset, we present the models’ performances across binary and multiclass classification experiments. By collecting accuracy and timing metrics, we find that Random Forest, Keras-TensorFlow, and Fast.ai models performed comparably and with the highest accuracies > 96% in both the detection and classification of malicious URLs, with Random Forest as the preferable model based on time, performance, and complexity constraints. Additionally, by ranking and using feature selection techniques, we determine that the top 5-10 features provide the best performances compared to using all the features provided in the dataset.