An Analysis and Implementation of Methods for High Speed Lexical Classification of Malicious URLs
- Authors: Egan, Shaun P , Irwin, Barry V W
- Date: 2012
- Subjects: To be catalogued
- Language: English
- Type: text , article
- Identifier: http://hdl.handle.net/10962/429757 , vital:72637 , https://digifors.cs.up.ac.za/issa/2012/Proceedings/Research/58_ResearchInProgress.pdf
- Description: Several authors have put forward methods of using Artificial Neural Networks (ANN) to classify URLs as malicious or benign by using lexical features of those URLs. These methods have been compared to other methods of classification, such as blacklisting and spam filtering, and have been found to be as accurate. Early attempts proved to be as highly accurate. Fully featured classifications use lexical features as well as lookups to classify URLs and include (but are not limited to) blacklists, spam filters and reputation services. These classifiers are based on the Online Perceptron Model, using a single neuron as a linear combiner and used lexical features that rely on the presence (or lack thereof) of words belonging to a bag-of-words. Several obfuscation resistant features are also used to increase the positive classification rate of these perceptrons. Examples of these include URL length, number of directory traversals and length of arguments passed to the file within the URL. In this paper we describe how we implement the online perceptron model and methods that we used to try to increase the accuracy of this model through the use of hidden layers and training cost validation. We discuss our results in relation to those of other papers, as well as other analysis performed on the training data and the neural networks themselves to best understand why they are so effective. Also described will be the proposed model for developing these Neural Networks, how to implement them in the real world through the use of browser extensions, proxy plugins and spam filters for mail servers, and our current implementation. Finally, work that is still in progress will be described. This work includes other methods of increasing accuracy through the use of modern training techniques and testing in a real world environment.
- Full Text:
- Date Issued: 2012
Normandy: A Framework for Implementing High Speed Lexical Classification of Malicious URLs
- Authors: Egan, Shaun P , Irwin, Barry V W
- Date: 2012
- Language: English
- Type: text , article
- Identifier: http://hdl.handle.net/10962/427958 , vital:72476 , https://www.researchgate.net/profile/Barry-Ir-win/publication/326224974_Normandy_A_Framework_for_Implementing_High_Speed_Lexical_Classification_of_Malicious_URLs/links/5b3f21074585150d2309dd50/Normandy-A-Framework-for-Implementing-High-Speed-Lexical-Classification-of-Malicious-URLs.pdf
- Description: Research has shown that it is possible to classify malicious URLs using state of the art techniques to train Artificial Neural Networks (ANN) using only lexical features of a URL. This has the advantage of being high speed and does not add any overhead to classifications as it does not require look-ups from external services. This paper discusses our method for implementing and testing a framework which automates the generation of these neural networks as well as testing involved in trying to optimize the performance of these ANNs.
- Full Text:
- Date Issued: 2012
An evaluation of lightweight classification methods for identifying malicious URLs
- Authors: Egan, Shaun P , Irwin, Barry V W
- Date: 2011
- Subjects: To be catalogued
- Language: English
- Type: text , article
- Identifier: http://hdl.handle.net/10962/429839 , vital:72644 , 10.1109/ISSA.2011.6027532
- Description: Recent research has shown that it is possible to identify malicious URLs through lexical analysis of their URL structures alone. This paper intends to explore the effectiveness of these lightweight classification algorithms when working with large real world datasets including lists of malicious URLs obtained from Phishtank as well as largely filtered be-nign URLs obtained from proxy traffic logs. Lightweight algorithms are defined as methods by which URLs are analysed that do not use exter-nal sources of information such as WHOIS lookups, blacklist lookups and content analysis. These parameters include URL length, number of delimiters as well as the number of traversals through the directory structure and are used throughout much of the research in the para-digm of lightweight classification. Methods which include external sources of information are often called fully featured classifications and have been shown to be only slightly more effective than a purely lexical analysis when considering both false-positives and false-negatives. This distinction allows these algorithms to be run client side without the introduction of additional latency, but still providing a high level of accu-racy through the use of modern techniques in training classifiers. Anal-ysis of this type will also be useful in an incident response analysis where large numbers of URLs need to be filtered for potentially mali-cious URLs as an initial step in information gathering as well as end us-er implementations such as browser extensions which could help pro-tect the user from following potentially malicious links. Both AROW and CW classifier update methods will be used as prototype implementa-tions and their effectiveness will be compared to fully featured analysis results. These methods are interesting because they are able to train on any labelled data, including instances in which their prediction is cor-rect, allowing them to build a confidence in specific lexical features. This makes it possible for them to be trained using noisy input data, making them ideal for real world applications such as link filtering and information gathering.
- Full Text:
- Date Issued: 2011
High Speed Lexical Classification of Malicious URLs
- Authors: Egan, Shaun P , Irwin, Barry V W
- Date: 2011
- Language: English
- Type: text , article
- Identifier: http://hdl.handle.net/10962/428055 , vital:72483 , https://www.researchgate.net/profile/Barry-Ir-win/publication/326225046_High_Speed_Lexical_Classification_of_Malicious_URLs/links/5b3f20acaca27207851c60f9/High-Speed-Lexical-Classification-of-Malicious-URLs.pdf
- Description: It has been shown in recent research that it is possible to identify malicious URLs through lexi-cal analysis of their URL structures alone. Lightweight algorithms are defined as methods by which URLs are analyzed that do not use external sources of information such as WHOIS lookups, blacklist lookups and content analysis. These parameters include URL length, number of delimiters as well as the number of traversals through the directory structure and are used throughout much of the research in the paradigm of lightweight classification. Methods which include external sources of information are often called fully featured classifications and have been shown to be only slightly more effective than a purely lexical analysis when considering both false-positives and falsenegatives. This distinction allows these algorithms to be run client side without the introduction of additional latency, but still providing a high level of accuracy through the use of modern techniques in training classifiers. Both AROW and CW classifier update methods will be used as prototype implementations and their effectiveness will be com-pared to fully featured analysis results. These methods are selected because they are able to train on any labeled data, including instances in which their prediction is correct, allowing them to build a confidence in specific lexical features.
- Full Text:
- Date Issued: 2011