Botnets are connected computers that perform a number of repetitive tasks to keep websites going. Connected devices play an important role in modern life. From smart home appliances, computers, coffee machines, and cameras, to connected cars, this huge shift in our lifestyles has made our lives easier. Unfortunately, these exposed devices could be easily targeted by attackers and cybercriminals who could use them later to enable larger-scale attacks. Security vendors provide many solutions and products to defend against botnets, but in this tutorial, we are going to learn how to build novel botnet detection systems with Python and machine learning techniques.
You will find all the code discussed, in addition to some other useful scripts, in the following repository: https://github.com/PacktPublishing/Mastering-Machine-Learning-for-Penetration-Testing/tree/master/Chapter05
This article is an excerpt from a book written by Chiheb Chebbi titled Mastering Machine Learning for Penetration Testing
We are going to learn how to build different botnet detection systems with many machine learning algorithms. As a start to a first practical lab, let’s start by building a machine learning-based botnet detector using different classifiers. By now, I hope you have acquired a clear understanding about the major steps of building machine learning systems. So, I believe that you already know that, as a first step, we need to look for a dataset.
Many educational institutions and organizations are given a set of collected datasets from internal laboratories. One of the most well known botnet datasets is called the CTU-13 dataset. It is a labeled dataset with botnet, normal, and background traffic delivered by CTU University, Czech Republic. During their work, they tried to capture real botnet traffic mixed with normal traffic and background traffic. To download the dataset and check out more information about it, you can visit the following link: https://mcfp.weebly.com/the-ctu-13-dataset-a-labeled-dataset-with-botnet-normal-and-background-traffic.html.
The dataset is bidirectional NetFlow files. But what are bidirectional NetFlow files? Netflow is an internet protocol developed by Cisco. The goal of this protocol is to collect IP traffic information and monitor network traffic in order to have a clearer view about the network traffic flow. The main components of a NetFlow architecture are a NetFlow Exporter, a Netflow collector, and a Flow Storage. The following diagram illustrates the different components of a NetFlow infrastructure:
When it comes to NetFlow generally, when host A sends an information to host B and from host B to host A as a reply, the operation is named unidirectional NetFlow. The sending and the reply are considered different operations. In bidirectional NetFlow, we consider the flows from host A and host B as one flow. Let’s download the dataset by using the following command:
$ wget --no-check-certificate https://mcfp.felk.cvut.cz/publicDatasets/CTU-13-Dataset/CTU-13-Dataset.tar.bz2
Extract the downloaded tar.bz2 file by using the following command:
# tar xvjf CTU-13-Dataset.tar.bz2
The file contains all the datasets, with the different scenarios. For the demonstration, we are going to use dataset 8 (scenario 8). You can select any scenario or you can use your own collected data, or any other .binetflow files delivered by other institutions:
Load the data using pandas as usual:
>>> import pandas as pd >>> data = pd.read_csv("capture20110816-3.binetflow") >>> data['Label'] = data.Label.str.contains("Botnet")
Exploring the data is essential in any data-centric project. For example, you can start by checking the names of the features or the columns:
>> data.columns
The command results in the columns of the dataset: StartTime, Dur, Proto, SrcAddr, Sport, Dir, DstAddr, Dport, State, sTos, dTos, TotPkts, TotBytes, SrcBytes, and Label. The columns represent the features used in the dataset; for example, Dur represents duration, Sport represents the source port, and so on. You can find the full list of features in the chapter’s GitHub repository.
Before training the model, we need to build some scripts to prepare the data. This time, we are going to build a separate Python script to prepare data, and later we can just import it into the main script.
I will call the first script DataPreparation.py. There are many proposals done to help extract the features and prepare data to build botnet detectors using machine learning. In our case, I customized two new scripts inspired by the data loading scripts built by NagabhushanS:
from __future__ import division import os, sys import threading
After importing the required Python packages, we created a class called Prepare to select training and testing data:
class Prepare(threading.Thread): def __init__(self, X, Y, XT, YT, accLabel=None): threading.Thread.__init__(self) self.X = X self.Y = Y self.XT=XT self.YT=YT self.accLabel= accLabel
def run(self):
X = np.zeros(self.X.shape)
Y = np.zeros(self.Y.shape)
XT = np.zeros(self.XT.shape)
YT = np.zeros(self.YT.shape)
np.copyto(X, self.X)
np.copyto(Y, self.Y)
np.copyto(XT, self.XT)
np.copyto(YT, self.YT)
for i in range(9):
X[:, i] = (X[:, i] - X[:, i].mean()) / (X[:, i].std())
for i in range(9):
XT[:, i] = (XT[:, i] - XT[:, i].mean()) / (XT[:, i].std())
The second script is called LoadData.py. You can find it on GitHub and use it directly in your projects to load data from .binetflow files and generate a pickle file.
Let’s use what we developed previously to train the models. After building the data loader and preparing the machine learning algorithms that we are going to use, it is time to train and test the models.
First, load the data from the pickle file, which is why we need to import the pickle Python library. Don’t forget to import the previous scripts using:
import LoadData import DataPreparation import pickle file = open('flowdata.pickle', 'rb') data = pickle.load(file)
Select the data sections:
Xdata = data[0] Ydata = data[1] XdataT = data[2] YdataT = data[3]
As machine learning classifiers, we are going to try many different algorithms so later we can select the best algorithm for our model. Import the required modules to use four machine learning algorithms from sklearn:
from sklearn.linear_model import * from sklearn.tree import * from sklearn.naive_bayes import * from sklearn.neighbors import *
Prepare the data by using the previous module build. Don’t forget to import DataPreparation by typing import DataPreparation:
>>> DataPreparation.Prepare(Xdata,Ydata,XdataT,YdataT)
Now, we can train the models; and to do that, we are going to train the model with different techniques so later we can select the most suitable machine learning technique for our project. The steps are like what we learned in previous projects: after preparing the data and selecting the features, define the machine learning algorithm, fit the model, and print out the score after defining its variable.
As machine learning classifiers, we are going to test many of them. Let’s start with a decision tree:
- Decision tree model:
>>> clf = DecisionTreeClassifier() >>> clf.fit(Xdata,Ydata) >>> Prediction = clf.predict(XdataT) >>> Score = clf.score(XdataT,YdataT) >>> print (“The Score of the Decision Tree Classifier is”, Score * 100)
The score of the decision tree classifier is 99%
- Logistic regression model:
>>> clf = LogisticRegression(C=10000) >>> clf.fit(Xdata,Ydata) >>> Prediction = clf.predict(XdataT) >>> Score = clf.score(XdataT,YdataT)
>>> print ("The Score of the Logistic Regression Classifier is", Score * 100)
The score of the logistic regression classifier is 96%
- Gaussian Naive Bayes model:
>>> clf = GaussianNB() >>> clf.fit(Xdata,Ydata) >>> Prediction = clf.predict(XdataT) >>> Score = clf.score(XdataT,YdataT) >>> print("The Score of the Gaussian Naive Bayes classifier is", Score * 100)
The score of the Gaussian Naive Bayes classifier is 72%
- k-Nearest Neighbors model:
>>> clf = KNeighborsClassifier() >>> clf.fit(Xdata,Ydata) >>> Prediction = clf.predict(XdataT) >>> Score = clf.score(XdataT,YdataT) >>> print("The Score of the K-Nearest Neighbours classifier is", Score * 100)
The score of the k-Nearest Neighbors classifier is 96%
- Neural network model:
To build a Neural network Model use the following code:
>>> from keras.models import * >>> from keras.layers import Dense, Activation >>> from keras.optimizers import *
model = Sequential()
model.add(Dense(10, input_dim=9, activation="sigmoid")) model.add(Dense(10, activation='sigmoid'))
model.add(Dense(1))
sgd = SGD(lr=0.01, decay=0.000001, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd, loss='mse')
model.fit(Xdata, Ydata, nb_epoch=200, batch_size=100)
Score = model.evaluate(XdataT, YdataT, verbose=0)
Print(“The Score of the Neural Network is”, Score * 100 )
With this code, we imported the required Keras modules, we built the layers, we compiled the model with an SGD optimizer, we fit the model, and we printed out the score of the model.
How to build a Twitter bot detector
In the previous sections, we saw how to build a machine learning-based botnet detector. In this new project, we are going to deal with a different problem instead of defending against botnet malware. We are going to detect Twitter bots because they are also dangerous and can perform malicious actions. For the model, we are going to use the NYU Tandon Spring 2017 Machine Learning Competition: Twitter Bot classification dataset. You can download it from this link: https://www.kaggle.com/c/twitter-bot-classification/data. Import the required Python packages:
>>> import pandas as pd >>> import numpy as np >>> import seaborn
Let’s load the data using pandas and highlight the bot and non-bot data:
>>> data = pd.read_csv('training_data_2_csv_UTF.csv') >>> Bots = data[data.bot==1] >> NonBots = data[data.bot==0]
Visualization with seaborn
In every project, I want to help you discover new data visualization Python libraries because, as you saw, data engineering and visualization are essential to every modern data-centric project. This time, I chose seaborn to visualize the data and explore it before starting the training phase. Seaborn is a Python library for making statistical visualizations. The following is an example of generating a plot with seaborn:
>>> data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2000) >>> data = pd.DataFrame(data, columns=['x', 'y']) >>> for col in 'xy': ... seaborn.kdeplot(data[col], shade=True)
For example, in our case, if we want to identify the missing data:
matplotlib.pyplot.figure(figsize=(10,6)) seaborn.heatmap(data.isnull(), yticklabels=False, cbar=False, cmap='viridis') matplotlib.pyplot.tight_layout()
The previous two code snippets were some examples to learn how to visualize data. Visualization helps data scientists to explore and learn more about the data. Now, let’s go back and continue building our model.
Identify the bag of words by selecting some bad words used by Twitter bots. The following is an example of bad words used by a bot. Of course, you can add more words:
bag_of_words_bot = r'bot|b0t|cannabis|tweet me|mishear|follow me|updates every|gorilla|yes_ofc|forget' \ r'expos|kill|bbb|truthe|fake|anony|free|virus|funky|RNA|jargon' \ r'nerd|swag|jack|chick|prison|paper|pokem|xx|freak|ffd|dunia|clone|genie|bbb' \ r'ffd|onlyman|emoji|joke|troll|droop|free|every|wow|cheese|yeah|bio|magic|wizard|face'
- Now, it is time to identify training features:
data['screen_name_binary'] = data.screen_name.str.contains(bag_of_words_bot, case=False, na=False) data['name_binary'] = data.name.str.contains(bag_of_words_bot, case=False, na=False) data['description_binary'] = data.description.str.contains(bag_of_words_bot, case=False, na=False) data['status_binary'] = data.status.str.contains(bag_of_words_bot, case=False, na=False)
- Feature extraction: Let’s select features to use in our model:
data['listed_count_binary'] = (data.listed_count>20000)==False features = ['screen_name_binary', 'name_binary', 'description_binary', 'status_binary', 'verified', 'followers_count', 'friends_count', 'statuses_count', 'listed_count_binary', 'bot']
- Now, train the model with a decision tree classifier:
from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score, roc_curve, auc from sklearn.model_selection import train_test_split
- We import some previously discussed modules:
X = data[features].iloc[:,:-1] y = data[features].iloc[:,-1]
- We define the classifier:
clf = DecisionTreeClassifier(criterion='entropy', min_samples_leaf=50, min_samples_split=10)
- We split the classifier:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
- We fit the model:
clf.fit(X_train, y_train) y_pred_train = clf.predict(X_train) y_pred_test = clf.predict(X_test)
- We print out the accuracy scores:
print("Training Accuracy: %.5f" %accuracy_score(y_train, y_pred_train)) print("Test Accuracy: %.5f" %accuracy_score(y_test, y_pred_test))
Our model detects Twitter bots with an 88% detection rate, which is a good accuracy rate.
This technique is not the only possible way to detect botnets. Researchers have proposed many other models based on different machine learning algorithms, such as Linear SVM and decision trees. All these techniques have an accuracy of 90%. Most studies showed that feature engineering was a key contributor to improving machine learning models.
Summary
In this tutorial, we learned how to build a botnet detector and a Twitter botnet detecter with different machine learning algorithms.
To become a master at penetration testing using machine learning with Python, check out this book Mastering Machine Learning for Penetration Testing
Read Next
Cisco and Huawei Routers hacked via backdoor attacks and botnets
How to protect yourself from a botnet attack
Tackle trolls with Machine Learning bots: Filtering out inappropriate content just got easy