Several features are identified that can be used to distinguish a phishing uniform resource locator (URL). High accuracy (97.31 percent) is achieved by a logistic regression filter based on these features. An advantage of feature-based tests over content-based ones is that opening a page to observe content may have undesired side effects, such as acknowledging receipt of a credit card. Google’s infrastructure data was used for testing, as well as for some of the features.
One set of features is page based (page rank and presence on the crawl database). Another feature is domain based (whether the URL’s domain can be found on the list of known nonphishing sites). A third set of features is whether the host is obfuscated, or whether a large number of characters are present after the organization’s name in the hostname. Finally, there is a word-based feature: phishing URLs use the words “login” and “signin” much more than nonphishing URLs. Six other words are also suggestive of a phishing URL.
A logistic regression filter based on these features was trained on approximately 1,600 URLs and tested on 800 URLs, giving the 97.31 percent accuracy noted. The trained classifier analyzed Google data for 12 days, and found out that more than eight percent of the viewers of a phishing page are potential phishing victims. Ebay and Paypal were the most popular phishing targets. The investigation appears valid and thorough, and the exposition is clear. This paper is worth reading.