Malware, or malicious software, is any program or file that is harmful to a computer user. Malware includes computer viruses, worms, Trojan horses and spyware. These malicious programs can perform a variety of functions, including stealing, encrypting or deleting sensitive data, altering or hijacking core computing functions and monitoring users’ computer activity without their permission. Once a machine is infected by malware, criminals can hurt consumers and enterprises in many ways. Cybercrimes have been increasing and the global cost of cybercrime has now reached as much as 600 billion dollars — about 0.8 percent of global GDP.
To minimize risk, is it possible to predict the chance of machine getting infected using machine learning?
Let us look at a dataset provided by Microsoft 1 and see if we can use the power of machine learning to predict risk of a machine getting infected by malware.
Note: Click the Toggle Code button below if you want to have a look at python script.
[showhide type="post"] from IPython.display import display from IPython.display import HTML import IPython.core.display as di # Example: di.display_html('<h3>%s:</h3>' % str, raw=True) # This line will hide code by default when the notebook is exported as HTML di.display_html('<script>jQuery(function() {if (jQuery("body.notebook_app").length == 0) { jQuery(".input_area").toggle(); jQuery(".prompt").toggle();}});</script>', raw=True) # This line will add a button to toggle visibility of code blocks, for use with the HTML export version di.display_html('''<button onclick="jQuery('.input_area').toggle(); jQuery('.prompt').toggle();">Toggle Code</button>''', raw=True) #We set the types of each fields in the train set in order to reduce the memory usage dtypes = { 'MachineIdentifier': 'category', 'ProductName': 'category', 'EngineVersion': 'category', 'AppVersion': 'category', 'AvSigVersion': 'category', 'IsBeta': 'int8', 'RtpStateBitfield': 'float16', 'IsSxsPassiveMode': 'int8', 'DefaultBrowsersIdentifier': 'float16', 'AVProductStatesIdentifier': 'float32', 'AVProductsInstalled': 'float16', 'AVProductsEnabled': 'float16', 'HasTpm': 'int8', 'CountryIdentifier': 'int16', 'CityIdentifier': 'float32', 'OrganizationIdentifier': 'float16', 'GeoNameIdentifier': 'float16', 'LocaleEnglishNameIdentifier': 'int8', 'Platform': 'category', 'Processor': 'category', 'OsVer': 'category', 'OsBuild': 'int16', 'OsSuite': 'int16', 'OsPlatformSubRelease': 'category', 'OsBuildLab': 'category', 'SkuEdition': 'category', 'IsProtected': 'float16', 'AutoSampleOptIn': 'int8', 'PuaMode': 'category', 'SMode': 'float16', 'IeVerIdentifier': 'float16', 'SmartScreen': 'category', 'Firewall': 'float16', 'UacLuaenable': 'float32', 'Census_MDC2FormFactor': 'category', 'Census_DeviceFamily': 'category', 'Census_OEMNameIdentifier': 'float16', 'Census_OEMModelIdentifier': 'float32', 'Census_ProcessorCoreCount': 'float16', 'Census_ProcessorManufacturerIdentifier': 'float16', 'Census_ProcessorModelIdentifier': 'float16', 'Census_ProcessorClass': 'category', 'Census_PrimaryDiskTotalCapacity': 'float32', 'Census_PrimaryDiskTypeName': 'category', 'Census_SystemVolumeTotalCapacity': 'float32', 'Census_HasOpticalDiskDrive': 'int8', 'Census_TotalPhysicalRAM': 'float32', 'Census_ChassisTypeName': 'category', 'Census_InternalPrimaryDiagonalDisplaySizeInInches': 'float16', 'Census_InternalPrimaryDisplayResolutionHorizontal': 'float16', 'Census_InternalPrimaryDisplayResolutionVertical': 'float16', 'Census_PowerPlatformRoleName': 'category', 'Census_InternalBatteryType': 'category', 'Census_InternalBatteryNumberOfCharges': 'float32', 'Census_OSVersion': 'category', 'Census_OSArchitecture': 'category', 'Census_OSBranch': 'category', 'Census_OSBuildNumber': 'int16', 'Census_OSBuildRevision': 'int32', 'Census_OSEdition': 'category', 'Census_OSSkuName': 'category', 'Census_OSInstallTypeName': 'category', 'Census_OSInstallLanguageIdentifier': 'float16', 'Census_OSUILocaleIdentifier': 'int16', 'Census_OSWUAutoUpdateOptionsName': 'category', 'Census_IsPortableOperatingSystem': 'int8', 'Census_GenuineStateName': 'category', 'Census_ActivationChannel': 'category', 'Census_IsFlightingInternal': 'float16', 'Census_IsFlightsDisabled': 'float16', 'Census_FlightRing': 'category', 'Census_ThresholdOptIn': 'float16', 'Census_FirmwareManufacturerIdentifier': 'float16', 'Census_FirmwareVersionIdentifier': 'float32', 'Census_IsSecureBootEnabled': 'int8', 'Census_IsWIMBootEnabled': 'float16', 'Census_IsVirtualDevice': 'float16', 'Census_IsTouchEnabled': 'int8', 'Census_IsPenCapable': 'int8', 'Census_IsAlwaysOnAlwaysConnectedCapable': 'float16', 'Wdft_IsGamer': 'float16', 'Wdft_RegionIdentifier': 'float16', 'HasDetections': 'int8' } #We have two files train and test, we will performing the training first and check our prediction model on test file. train = pd.read_csv('train_sample.csv', dtype=dtypes) test = pd.read_csv('test_sample.csv', dtype=dtypes) [/showhide]
We have a huge dataset with 84 attributes. The goal of this dataset is to predict a Windows machine’s probability of getting infected by various families of malware, based on different properties of that machine. The data containing these properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft’s endpoint protection solution, Windows Defender.
MachineIdentifier | ProductName | EngineVersion | AppVersion | AvSigVersion | IsBeta | RtpStateBitfield | IsSxsPassiveMode | DefaultBrowsersIdentifier | AVProductStatesIdentifier | … | Census_FirmwareVersionIdentifier | Census_IsSecureBootEnabled | Census_IsWIMBootEnabled | Census_IsVirtualDevice | Census_IsTouchEnabled | Census_IsPenCapable | Census_IsAlwaysOnAlwaysConnectedCapable | Wdft_IsGamer | Wdft_RegionIdentifier | HasDetections | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Unnamed: 0 | |||||||||||||||||||||
1713700 | 312ee21c55c435e926e8d697463cb3b0 | win8defender | 1.1.15200.1 | 4.18.1806.18062 | 1.275.202.0 | 0 | 7.0 | 0 | NaN | 53447.0 | … | 64689.0 | 1 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 3.0 | 1 |
2218958 | 3fab37ea700827f06691984cc03652eb | win8defender | 1.1.15200.1 | 4.18.1807.18075 | 1.275.1141.0 | 0 | 7.0 | 0 | NaN | 47380.0 | … | 51910.0 | 0 | NaN | 0.0 | 1 | 0 | 0.0 | 0.0 | 15.0 | 1 |
2720229 | 4e0bd7c65da468f9bfb5939582358310 | win8defender | 1.1.15200.1 | 4.18.1807.18075 | 1.275.569.0 | 0 | 7.0 | 0 | NaN | 50188.0 | … | 6899.0 | 0 | NaN | 0.0 | 0 | 0 | 0.0 | 1.0 | 3.0 | 1 |
4079822 | 75114848cf4f7e16a317ba8f15efd1b2 | win8defender | 1.1.15100.1 | 4.18.1807.18075 | 1.273.1616.0 | 0 | 7.0 | 0 | NaN | 53447.0 | … | 63555.0 | 1 | NaN | 0.0 | 0 | 0 | 0.0 | 0.0 | 3.0 | 1 |
1693213 | 30993d30aa1c94e4e75032884e4d99e2 | win8defender | 1.1.15100.1 | 4.18.1807.18075 | 1.273.778.0 | 0 | 7.0 | 0 | NaN | 53447.0 | … | 19951.0 | 1 | NaN | 0.0 | 0 | 0 | 0.0 | 0.0 | 7.0 | 1 |
5 rows × 83 columns
Each row in this dataset corresponds to a machine, uniquely identified by a MachineIdentifier.
HasDetections is the ground truth and indicates that Malware was detected on the machine.
Let us look at the description of major attributes in the dataset
It is a large dataset but let us see based on various attributes of a machine, can we predict if a machine will get hit by a malware?
Let us look at the distribution of outcome variable
HasDetections is the ground truth and indicates that Malware was detected on the machine.
We can see we have a balanced dataset, that means dataset has been sampled to include a much larger proportion of malware machines.
Let us have a look at the categorical variables
We can see that the detections are less in Touch devices.
The rate of infections is lower for touch devices
We can see that there is a significant difference in detection levels based on the Os versions.
Hence we can see that depending upon the attributes the risks of getting infected can change!
This is the number of Antivirus products installed. In case of a single antivirus, the rate of detection is high. Installing two Antivirus products decreases the rate of detection.
RS indicates Redstone and th is threshold which are both versions of windows 10.
Also we can see that rs4 has more number of detections this maybe as it was a new version
SmartScreen Filter helps to identify reported phishing and malware websites and also helps you make informed decisions about downloads.
As you browse the web, it analyzes pages and determines if they might be suspicious.
If it finds a match, SmartScreen will show you a warning letting you know that the site has been blocked for your safety.
SmartScreen checks files that you download from the web against a list of reported malicious software sites and programs known to be unsafe.
This is the SmartScreen enabled string value from registry. We can see that if it exists and is not set can have a large number of detections!
We will see it is the most important feature in detection
We can see the detection levels differ a lot based on country. This could be a good feature.
We can see these columns have high cardinality, frequency encoding will ranking the categories with respect to their frequencies. These variables are then treated as numerical.
And we can then use them in our model
Now that we are done with feature engineering let us move to machine learning and prediction.
Let us make use of lgbm, Light GBM is a gradient boosting framework that uses tree based learning algorithm.
Light GBM is a fast, distributed, high-performance gradient boosting framework based on decision tree algorithm, used for ranking, classification and many other machine learning tasks.
Following are the cross validation scores for the training data inorder to create an optimized model with high accuracy. We made use of cross validation with 5 folds, the advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.
fold n°0 Training until validation scores don't improve for 200 rounds. [100] training's auc: 0.751319 valid_1's auc: 0.720071 [200] training's auc: 0.769667 valid_1's auc: 0.721429 [300] training's auc: 0.781697 valid_1's auc: 0.721092 Early stopping, best iteration is: [185] training's auc: 0.767497 valid_1's auc: 0.721497 time elapsed: 0.038s fold n°1 Training until validation scores don't improve for 200 rounds. [100] training's auc: 0.751309 valid_1's auc: 0.721836 [200] training's auc: 0.769685 valid_1's auc: 0.722959 [300] training's auc: 0.78145 valid_1's auc: 0.722416 [400] training's auc: 0.791111 valid_1's auc: 0.721733 Early stopping, best iteration is: [214] training's auc: 0.771747 valid_1's auc: 0.723126 time elapsed: 0.079s fold n°2 Training until validation scores don't improve for 200 rounds. [100] training's auc: 0.751524 valid_1's auc: 0.720615 [200] training's auc: 0.769441 valid_1's auc: 0.721717 [300] training's auc: 0.781359 valid_1's auc: 0.721405 [400] training's auc: 0.79129 valid_1's auc: 0.721 Early stopping, best iteration is: [203] training's auc: 0.769872 valid_1's auc: 0.721767 time elapsed: 0.12 s fold n°3 Training until validation scores don't improve for 200 rounds. [100] training's auc: 0.7513 valid_1's auc: 0.720716 [200] training's auc: 0.769799 valid_1's auc: 0.721748 [300] training's auc: 0.782024 valid_1's auc: 0.721542 Early stopping, best iteration is: [190] training's auc: 0.768455 valid_1's auc: 0.72178 time elapsed: 0.16 s fold n°4 Training until validation scores don't improve for 200 rounds. [100] training's auc: 0.751782 valid_1's auc: 0.722352 [200] training's auc: 0.769942 valid_1's auc: 0.723223 [300] training's auc: 0.781948 valid_1's auc: 0.722633 [400] training's auc: 0.791381 valid_1's auc: 0.721827 Early stopping, best iteration is: [202] training's auc: 0.770309 valid_1's auc: 0.72333 time elapsed: 0.2 s CV score: 0.72230
We got an accuracy of 68.1 % on test dataset after submitting on kaggle.
These results represent machines and the risk of being affected based on their features.
Machine | Risk of Infection | |
---|---|---|
0 | bbd32c3d6e6673dab113227a96ea1614 | 33.8% |
1 | e5544db56485780c556ae0faa49a0dda | 46.1% |
2 | 7b5a256ef0a3e28f9cebfa6e05b1427a | 42.3% |
3 | 772afbaa64c2169494d0146a614fafff | 23.3% |
4 | df1cc7ed605e68570ec5cecb2232f5c4 | 46.7% |
5 | e64a52d52410e58b173a770b99ea75e4 | 43.9% |
6 | 23e15d0ab76ed2de40d423b10a691ec8 | 52.8% |
7 | 9da269745fc049f4ec790111aa7c42c8 | 19.9% |
8 | d600da38bf735caad71cd5c9100c172f | 39.8% |
9 | 75d6b6247d558385de5c6e9242d73715 | 25.7% |
Thus we have performed analysis on the attributes of a machine and found the attributes that are most important for predicting the risk of infection. SmartScreen, CountryIdentifier, AVProductStatesIdentifier, AVProductsInstalled, EngineVersion are some of the most important features that were helpful for the prediction.
SmartScreen was the most important feature and it helps to identify reported phishing and malware websites and also helps you make informed decisions about downloads. We have see from the analysis if it exists and is not set can have a large number of detections!
CountryIdentifier indicates the source country of the machine and we can see that malware rates varies across countries and is a strong identifier.
AVProductsInstalled indicates the number of Antivirus products installed.
Appversion and EngineVersion indicates the version of Windows defender.
This means if we have information about a machine then we can predict the chances of them getting infected and hence if the risks are high then we can take counter measures accordingly.
Hence we can see by making use of machine learning we can identify the risks of infection and prevent our machines from getting infected!
More complex models using deep learning can be utilized which can help to improve the accuracy further. By making use of deep learning we can create a model that learns on its own by performing a task repeatedly, each time tweaking it a little to improve the outcome.
1. Data Source: https://www.kaggle.com/c/microsoft-malware-prediction ↩
Recent Comments
Leveraging AI To Reduce Risk of Ransomware
Top 5 free tools to defend against Ransomware Attack - AI-Based Analysis and Response