Malicious Website detection using machine learning

Mar 29, 2019| Category: Security| Tags: Security, Machine Learning, Artificial Intelligence

A malicious website is a site that attempts to install malware on your computer and may attempt to install onto your devices. Usually it requires permissions on your side, however, in the case of a drive-by downloads, the website will attempt to install software without your permissions. In many times, malicious websites often look like legitimate websites. What's more, your anti-virus software might not be able to detect it because hackers deliberately program it in such a way that it is difficult for anti-virus software to detect.

To avoid this, is it possible to detect a suspicious website using machine learning?
Let us look at a dataset on Kaggle 1 and see if we can use the power of machine learning to detect malicious websites.

Table of Contents

1) Descriptive Statistics

2) Exploratory data analysis and data cleaning

3) Machine Learning

4) Summary

In [1]:
from IPython.display import display
from IPython.display import HTML
import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True)

# This line will hide code by default when the notebook is exported as HTML
di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True)

# This line will add a button to toggle visibility of code blocks, for use with the HTML export version
di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle code</button>''', raw=True)
In [2]:
#Importing the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score
from sklearn import metrics
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
In [3]:
#We will read the URL dataset and perform some EDA
df = pd.read_csv('dataset.csv')

Descriptive Statistics

We have a labeled dataset consisting of websites and whether they are malicious or not. The websites are run through a feature generator which stores information of a website which runs in the network and application layer. The goal is to predict whether a website is malicious or not.

Using the features generated from the website is it possible to make use of machine learning to detect a malicious website?

In [4]:
df.head()
Out[4]:
URL URL_LENGTH NUMBER_SPECIAL_CHARACTERS CHARSET SERVER CONTENT_LENGTH WHOIS_COUNTRY WHOIS_STATEPRO WHOIS_REGDATE WHOIS_UPDATED_DATE ... DIST_REMOTE_TCP_PORT REMOTE_IPS APP_BYTES SOURCE_APP_PACKETS REMOTE_APP_PACKETS SOURCE_APP_BYTES REMOTE_APP_BYTES APP_PACKETS DNS_QUERY_TIMES Type
0 M0_109 16 7 iso-8859-1 nginx 263.0 None None 10/10/2015 18:21 None ... 0 2 700 9 10 1153 832 9 2.0 1
1 B0_2314 16 6 UTF-8 Apache/2.4.10 15087.0 None None None None ... 7 4 1230 17 19 1265 1230 17 0.0 0
2 B0_911 16 6 us-ascii Microsoft-HTTPAPI/2.0 324.0 None None None None ... 0 0 0 0 0 0 0 0 0.0 0
3 B0_113 17 6 ISO-8859-1 nginx 162.0 US AK 7/10/1997 4:00 12/09/2013 0:45 ... 22 3 3812 39 37 18784 4380 39 8.0 0
4 B0_403 17 6 UTF-8 None 124140.0 US TX 12/05/1996 0:00 11/04/2017 0:00 ... 2 5 4278 61 62 129889 4586 61 4.0 0

5 rows × 21 columns

The dataset consists of URL, message along with the information whether a website is malicious or not.

Let us look at the description of major attributes in the dataset
• URL: anonymized URL which may or may not be malicious.
• URL_LENGTH: Indicates number of characters in the URL.
• NUMBER_SPECIAL_CHARACTERS: Indicates special characers in the URL.
• CHARSET: It indicates the character encoding standard and is a categorical variable.
• SERVER: It indicates the operative system of the server got from the packet response and is a categorical variable.
• CONTENT_LENGTH: It indicates the content size of HTTP header.
• WHOIS_COUNTRY: Using Whois API it indicates the Countries the server got a response and is a categorical variable.
• WHOIS_STATEPRO: Using Whois API it indicates the States the server got a response and is a categorical variable.
• WHOIS_REGDATE: It indicates the Whois server registration date.
• WHOIS_UPDATED_DATE: It indicates the last update date from the server analyzed.
• TCP_CONVERSATION_EXCHANGE: It indicates the number of TCP packets that were exchanged between the honeypot client and the server.
• DIST_REMOTE_TCP_PORT: It indicates the number of ports detected.
• REMOTE_IPS: It indicates the total number of IPs connected to honeypot client.
• APP_BYTES: It indicates the number of bytes transfered.
• SOURCE_APP_PACKETS: It indicates the packets sent from the honeypot to the server.
• REMOTE_APP_PACKETS: It indicates the packets received from the server.
• APP_PACKETS: It is the total number of IP packets generated during the communication between the honeypot and the server.
• DNS_QUERY_TIMES: It indicates the number of DNS packets generated during the communication between the honeypot and the server.
• TYPE: This is the outcome variable indicating whether the website is malicious or not.

Exploratory data analysis and Data cleaning

In [5]:
#Let us handle with the null values
df.drop('URL', axis = 1, inplace=True)
df = df.interpolate()
df['SERVER'].fillna('Rare', inplace=True)
In [6]:
#Let us look at the distribution of the outcome variable
plt.figure(figsize=(7,5))
sns.countplot(df['Type'])
plt.show()

We can see most of the websites are benign while some are malicious. Let us look if there is a relationship between different attributes and the Type of the Website.

In [8]:
malicious = df[df['Type'] == 1]
benign = df[df['Type'] == 0]
var = ["URL_LENGTH","NUMBER_SPECIAL_CHARACTERS","DNS_QUERY_TIMES","REMOTE_IPS",'APP_PACKETS']
for columns in var:
    plt.figure()
    sns.distplot(malicious[columns], rug= False,label='Malicious', hist = False)
    sns.distplot(benign[columns], rug= False,label='Benign', hist = False)
    plt.legend()
plt.tight_layout()

We can see that some of the attributes show a huge difference and these attributes can be a useful in identifying whether a website is malicious or not. This information may get modeled while training which could be used by our model to how risky is the website and if this is more than a certain threshold a user can be alerted before he/she proceeds to the website.
Hence we can see that depending upon the attributes the risks of getting infected can change!

Let us look at the attributes and the correlations using a corrplot, this can help in understanding the important attribtues and avoiding multicollinearity.

In [9]:
#Let us visualize with a heatmap
plt.figure(figsize=(14,10))
sns.heatmap(df.corr(), cmap='coolwarm', annot=True)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1e434c15518>
In [10]:
# Convert categorical columns to numbered categorical columns
df = pd.get_dummies(df,prefix_sep='--')
X = df.drop('Type',axis=1) #Predictors
y = df['Type']
X.head()
Out[10]:
URL_LENGTH NUMBER_SPECIAL_CHARACTERS CONTENT_LENGTH TCP_CONVERSATION_EXCHANGE DIST_REMOTE_TCP_PORT REMOTE_IPS APP_BYTES SOURCE_APP_PACKETS REMOTE_APP_PACKETS SOURCE_APP_BYTES ... WHOIS_UPDATED_DATE--9/09/2015 0:00 WHOIS_UPDATED_DATE--9/09/2015 20:47 WHOIS_UPDATED_DATE--9/09/2016 0:00 WHOIS_UPDATED_DATE--9/10/2015 0:00 WHOIS_UPDATED_DATE--9/11/2015 0:00 WHOIS_UPDATED_DATE--9/11/2016 0:00 WHOIS_UPDATED_DATE--9/12/2015 0:00 WHOIS_UPDATED_DATE--9/12/2015 14:43 WHOIS_UPDATED_DATE--9/12/2016 0:00 WHOIS_UPDATED_DATE--None
0 16 7 263.0 7 0 2 700 9 10 1153 ... 0 0 0 0 0 0 0 0 0 1
1 16 6 15087.0 17 7 4 1230 17 19 1265 ... 0 0 0 0 0 0 0 0 0 1
2 16 6 324.0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 1
3 17 6 162.0 31 22 3 3812 39 37 18784 ... 0 0 0 0 0 0 0 0 0 0
4 17 6 124140.0 57 2 5 4278 61 62 129889 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 1978 columns

Categorical attributes can be handled using dummy variables as machine learning algorithms cannot understand anything which is not numeric.

Machine Learning

Let us make use of machine learning to predict whether a website is malicious or not.

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

By using random forest we can obtain feature importance which can help understanding the most important attributes.

In [12]:
rf = RandomForestClassifier(n_estimators=100, n_jobs=-1, max_depth=30, criterion = 'entropy')
rf.fit(X_train, y_train)
print('Training Accuracy Score: {}'.format(rf.score(X_train, y_train)))
Training Accuracy Score: 0.9983948635634029

Let us look at the predictions on the Test dataset.

In [13]:
y_pred = rf.predict(X_test)
In [14]:
# Visualize our results
#print ("{0}. {1} appears {2} times.".format(1, 'b', 3.1415))
prediction_rf=rf.predict(X_test)
print('Accuracy: {0}%'.format(round(accuracy_score(y_test, prediction_rf)*100,2)))
print(classification_report(y_test, prediction_rf))
Accuracy: 95.7%
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       462
           1       0.98      0.70      0.82        73

   micro avg       0.96      0.96      0.96       535
   macro avg       0.97      0.85      0.90       535
weighted avg       0.96      0.96      0.95       535

We have achieved 95.33% accuracy in detecting whether a website is malicious or not.
Let us look at the AUC score

In [15]:
y_pred_proba = rf.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test,  y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Based on the AUC score we can see that we have a very good model which can predict if a website is malicious or not.
Let us look at the most important features.

In [16]:
# View our feature importances
feature_importance_zip = zip(list(X), rf.feature_importances_)

# Sort the feature_importance_zip
sorted_importance = sorted(feature_importance_zip, key=lambda x: x[1], reverse=True)

for feature in sorted_importance[:15]:
    print(feature)
('DIST_REMOTE_TCP_PORT', 0.05542248595112565)
('REMOTE_APP_PACKETS', 0.05350555875366626)
('SOURCE_APP_BYTES', 0.05278637609692167)
('NUMBER_SPECIAL_CHARACTERS', 0.03784767672080814)
('WHOIS_UPDATED_DATE--2/09/2016 0:00', 0.035669284782677904)
('WHOIS_STATEPRO--Barcelona', 0.03531835186890351)
('URL_LENGTH', 0.0343964152328995)
('WHOIS_REGDATE--17/09/2008 0:00', 0.03239780891591486)
('WHOIS_COUNTRY--ES', 0.03135052817134378)
('SOURCE_APP_PACKETS', 0.02854442611666951)
('REMOTE_APP_BYTES', 0.02801012481009961)
('CONTENT_LENGTH', 0.027931476567531185)
('REMOTE_IPS', 0.025523627035864523)
('APP_PACKETS', 0.022632579871355966)
('DNS_QUERY_TIMES', 0.022313662212124954)
In [17]:
plt.figure(figsize=(10,10))
(pd.Series(rf.feature_importances_, index=X.columns)
   .nlargest(15)
   .plot(kind='barh')).invert_yaxis() 

Summary

We have performed analysis on the malicious websites dataset. We can see how a machine inteprets and learns from the features. Based on the differences in the featues it can classify whether the website is malicious or not.

By making use of machine learning we can train a model to identify whether a website is malicious or safe. This is useful as the model takes only url as an input and the model can generate other features and identify if the website is risky or not.

1. Data Source: https://archive.ics.uci.edu/ml/datasets/sms+spam+collection