Deep Learning (DL), a subfield of machine learning, arose to help build algorithms that work like the human mind and are inspired by its structure. Information security professionals are also intrigued by such techniques, as they have provided promising results in defending against major cyber threats and attacks. One of the best-suited candidates for the implementation of DL is malware analysis.
This tutorial is an excerpt taken from the book, Mastering Machine Learning for Penetration Testing written by Chiheb Chebbi. In this book, you will learn to identify ambiguities, extensive techniques to breach an intelligent system, and much more.
In this post, we are going to explore artificial network architectures and learn how to use one of them to help malware analysts and information security professionals to detect and classify malicious code. Before diving into the technical details and the steps for the practical implementation of the DL method, it is essential to learn and discover the other different architectures of artificial neural networks.
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks (CNNs) are a deep learning approach to tackle the image classification problem, or what we call computer vision problems, because classic computer programs face many challenges and difficulties to identify objects for many reasons, including lighting, viewpoint, deformation, and segmentation. This technique is inspired by how the eye works, especially the visual cortex function algorithm in animals. In CNN are arranged in three-dimensional structures with width, height, and depth as characteristics. In the case of images, the height is the image height, the width is the image width, and the depth is RGB channels. To build a CNN, we need three main types of layer:
- Convolutional layer: A convolutional operation refers to extracting features from the input image and multiplying the values in the filter with the original pixel values
- Pooling layer: The pooling operation reduces the dimensionality of each feature map
- Fully-connected layer: The fully-connected layer is a classic multi-layer perceptrons with a softmax activation function in the output layer
import numpy from keras.datasets import mnist from keras.models import Sequential from keras.layers import Dense from keras.layers import Dropout from keras.layers import Flatten from keras.layers.convolutional import Conv2D from keras.layers.convolutional import MaxPooling2D from keras.utils import np_utils from keras import backend backend.set_image_dim_ordering('th')
model = Sequential()
model.add(Conv2D(32, (5, 5), input_shape=(1, 28, 28), activation=’relu’))
model.compile(loss=’categorical_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs) are artificial neural networks where we can make use of sequential information, such as sentences. In other words, RNNs perform the same task for every element of a sequence, with the output depending on the previous computations. RNNs are widely used in language modeling and text generation (machine translation, speech recognition, and many other applications). RNNs do not remember things for a long time.
Long Short Term Memory networks
Long Short Term Memory (LSTM) solves the short memory issue in recurrent neural networks by building a memory block. This block sometimes is called a memory cell.
Hopfield networks were developed by John Hopfield in 1982. The main goal of Hopfield networks is auto-association and optimization. We have two categories of Hopfield network: discrete and continuous.
Boltzmann machine networks
Boltzmann machine networks use recurrent structures and they use only locally available information. They were developed by Geoffrey Hinton and Terry Sejnowski in 1985. Also, the goal of a Boltzmann machine is optimizing the solutions.
Malware detection with CNNs
For this new model, we are going to discover how to build a malware classifier with CNNs. But I bet you are wondering how we can do that while CNNs are taking images as inputs. The answer is really simple, the trick here is converting malware into an image. Is this possible? Yes, it is. Malware visualization is one of many research topics during the past few years. One of the proposed solutions has come from a research study called Malware Images: Visualization and Automatic Classification by Lakshmanan Nataraj from the Vision Research Lab, University of California, Santa Barbara.
The following diagram details how to convert malware into an image:
The following is an image of the Alueron.gen!J malware:
This technique also gives us the ability to visualize malware sections in a detailed way:
By solving the issue of how to feed malware machine learning classifiers that use CNNs by images, information security professionals can use the power of CNNs to train models. One of the malware datasets most often used to feed CNNs is the Malimg dataset. This malware dataset contains 9,339 malware samples from 25 different malware families. You can download it from Kaggle (a platform for predictive modeling and analytics competitions) by visiting this link: https://www.kaggle.com/afagarap/malimg-dataset/data.
These are the malware families:
- Lolyda.AA 1
- Lolyda.AA 2
- Lolyda.AA 3
- Instant access
After converting malware into grayscale images, you can get the following malware representation so you can use them later to feed the machine learning model:
The conversion of each malware to a grayscale image can be done using the following Python script:
import os import scipy import array filename = '
';f = open(filename,'rb'); ln = os.path.getsize(filename); width = 256; rem = ln%width; a = array.array("B"); a.fromfile(f,ln-rem); f.close(); g = numpy.reshape(a,(len(a)/width,width)); g = numpy.uint8(g); scipy.misc.imsave(' .png',g);
For feature selection, you can extract or use any image characteristics, such as the texture pattern, frequencies in image, intensity, or color features, using different techniques such as Euclidean distance, or mean and standard deviation, to generate later feature vectors. In our case, we can use algorithms such as a color layout descriptor, homogeneous texture descriptor, or global image descriptors (GIST). Let’s suppose that we selected the GIST; pyleargist is a great Python library to compute it. To install it, use PIP as usual:
# pip install pyleargist==1.0.1
As a use case, to compute a GIST, you can use the following Python script:
import Image Import leargist image = Image.open('
.png');New_im = image.resize((64,64)); des = leargist.color_gist(New_im); Feature_Vector = des[0:320];
Here, 320 refers to the first 320 values while we are using grayscale images. Don’t forget to save them as NumPy arrays to use them later to train the model.
After getting the feature vectors, we can train many different models, including SVM, k-means, and artificial neural networks. One of the useful algorithms is that of the CNN.
Once the feature selection and engineering is done, we can build a CNN. For our model, for example, we will build a convolutional network with two convolutional layers, with 32 * 32 inputs. To build the model using Python libraries, we can implement it with the previously installed TensorFlow and utils libraries.
So the overall CNN architecture will be as in the following diagram:
This CNN architecture is not the only proposal to build the model, but at the moment we are going to use it for the implementation.
To build the model and CNN in general, I highly recommend Keras. The required imports are the following:
import keras from keras.models import Sequential,Input,Model from keras.layers import Dense, Dropout, Flatten from keras.layers import Conv2D, MaxPooling2D from keras.layers.normalization import BatchNormalization from keras.layers.advanced_activations import LeakyReLU
As we discussed before, the grayscale image has pixel values that range from 0 to 255, and we need to feed the net with 32 * 32 * 1 dimension images as a result:
train_X = train_X.reshape(-1, 32,32, 1) test_X = test_X.reshape(-1, 32,32, 1)
We will train our network with these parameters:
batch_size = 64 epochs = 20 num_classes = 25
To build the architecture, with regards to its format, use the following:
Malware_Model = Sequential() Malware_Model.add(Conv2D(32, kernel_size=(3,3),activation='linear',input_shape=(32,32,1),padding='same')) Malware_Model.add(LeakyReLU(alpha=0.1)) Malware_model.add(MaxPooling2D(pool_size=(2, 2),padding='same')) Malware_Model.add(Conv2D(64, (3, 3), activation='linear',padding='same')) Malware_Model.add(LeakyReLU(alpha=0.1)) Malware_Model.add(Dense(1024, activation='linear')) Malware_Model.add(LeakyReLU(alpha=0.1)) Malware_Model.add(Dropout(0.4)) Malware_Model.add(Dense(num_classes, activation='softmax'))
To compile the model, use the following:
Fit and train the model:
Malware_Model.fit(train_X, train_label, batch_size=batch_size,epochs=epochs,verbose=1,validation_data=(valid_X, valid_label))
As you noticed, we are respecting the flow of training a neural network that was discussed in previous chapters. To evaluate the model, use the following code:
Malware_Model.evaluate(test_X, test_Y_one_hot, verbose=0) print('The accuracy of the Test is:', test_eval)
Thus, in this post, we discovered how to build malware detectors using different machine learning algorithms, especially using the power of deep learning techniques. If you’ve enjoyed reading this post, do check out Mastering Machine Learning for Penetration Testing to find loopholes and surpass a self-learning security system