Machine learning is indeed the tech of present times! Security, which is a growing concern for many organizations today and machine learning is one of the solutions to deal with it. ML can help cybersecurity systems analyze patterns and learn from them to help prevent similar attacks and respond to changing behavior.
To know more about machine learning and its application in Cybersecurity, we had a chat with Emmanuel Tsukerman, a Cybersecurity Data Scientist and the author of Machine Learning for Cybersecurity Cookbook. The book also includes modern AI to create powerful cybersecurity solutions for malware, pentesting, social engineering, data privacy, and intrusion detection. In 2017, Tsukerman’s anti-ransomware product was listed in the Top 10 ransomware products of 2018 by PC Magazine. In his interview, Emmanuel talked about how ML algorithms help in solving problems related to cybersecurity, and also gave a brief tour through a few chapters of his book. He also touched upon the rise of deepfakes and malware classifiers.
On using machine learning for cybersecurity
Using Machine learning in Cybersecurity scenarios will enable systems to identify different types of attacks across security layers and also help to take a correct POA. Can you share some examples of the successful use of ML for cybersecurity you have seen recently?
A recent and interesting development in cybersecurity is that the bad guys have started to catch up with technology; in particular, they have started utilizing Deepfake tech to commit crime; for example,they have used AI to imitate the voice of a CEO in order to defraud a company of $243,000. On the other hand, the use of ML in malware classifiers is rapidly becoming an industry standard, due to the incredible number of never-before-seen samples (over 15,000,000) that are generated each year.
On staying updated with developments in technology to defend against attacks
Machine learning technology is not only used by ethical humans, but also by Cybercriminals who use ML for ML-based intrusions. How can organizations counter such scenarios and ensure the safety of confidential organizational/personal data?
The main tools that organizations have at their disposal to defend against attacks are to stay current and to pentest. Staying current, of course, requires getting educated on the latest developments in technology and its applications. For example, it’s important to know that hackers can now use AI-based voice imitation to impersonate anyone they would like. This knowledge should be propagated in the organization so that individuals aren’t caught off-guard.
The other way to improve one’s security is by performing regular pen tests using the latest attack methodology; be it by attempting to avoid the organization’s antivirus, sending phishing communications, or attempting to infiltrate the network. In all cases, it is important to utilize the most dangerous techniques, which are often ML-based
On how ML algorithms and GANs help in solving cybersecurity problems
In your book, you have mentioned various algorithms such as clustering, gradient boosting, random forests, and XGBoost. How do these algorithms help in solving problems related to cybersecurity?
Unless a machine learning model is limited in some way (e.g., in computation, in time or in training data), there are 5 types of algorithms that have historically performed best: neural networks, tree-based methods, clustering, anomaly detection and reinforcement learning (RL). These are not necessarily disjoint, as one can, for example, perform anomaly detection via neural networks. Nonetheless, to keep it simple, let’s stick to these 5 classes.
Neural networks shine with large amounts of data on visual, auditory or textual problems. For that reason, they are used in Deepfakes and their detection, lie detection and speech recognition. Many other applications exist as well. But one of the most interesting applications of neural networks (and deep learning) is in creating data via Generative adversarial networks (GANs). GANs can be used to generate password guesses and evasive malware. For more details, I’ll refer you to the Machine Learning for Cybersecurity Cookbook.
The next class of models that perform well are tree-based. These include Random Forests and gradient boosting trees. These perform well on structured data with many features. For example, the PE header of PE files (including malware) can be featurized, yielding ~70 numerical features. It is convenient and effective to construct an XGBoost model (a gradient-boosting model) or a Random Forest model on this data, and the odds are good that performance will be unbeatable by other algorithms.
Next there is clustering. Clustering shines when you would like to segment a population automatically. For example, you might have a large collection of malware samples, and you would like to classify them into families. Clustering is a natural choice for this problem.
Anomaly detection lets you fight off unseen and unknown threats. For instance, when a hacker utilizes a new tactic to intrude on your network, an anomaly detection algorithm can protect you even if this new tactic has not been documented.
Finally, RL algorithms perform well on dynamic problems. The situation can be, for example, a penetration test on a network. The DeepExploit framework, covered in the book, utilizes an RL agent on top of metasploit to learn from prior pen tests and becomes better and better at finding vulnerabilities.
Generative Adversarial Networks (GANs) are a popular branch of ML used to train systems against counterfeit data. How can these help in malware detection and safeguarding systems to identify correct intrusion?
A good way to think about GANs is as a pair of neural networks, pitted against each other. The loss of one is the objective of the other. As the two networks are trained, each becomes better and better at its job. We can then take whichever side of the “tug of war” battle, separate it from its rival, and use it. In other cases, we might choose to “freeze” one of the networks, meaning that we do not train it, but only use it for scoring. In the case of malware, the book covers how to use MalGAN, which is a GAN for malware evasion. One network, the detector, is frozen. In this case, it is an implementation of MalConv. The other network, the adversarial network, is being trained to modify malware until the detection score of MalConv drops to zero. As it trains, it becomes better and better at this.
In a practical situation, we would want to unfreeze both networks. Then we can take the trained detector, and use it as part of our anti-malware solution. We would then be confident knowing that it is very good at detecting evasive malware. The same ideas can be applied in a range of cybersecurity contexts, such as intrusion and deepfakes.
On how Machine Learning for Cybersecurity Cookbook can help with easy implementation of ML for Cybersecurity problems
What are some of the tools/ recipes mentioned in your book that can help cybersecurity professionals to easily implement machine learning and make it a part of their day-to-day activities?
The Machine Learning for Cybersecurity Cookbook offers an astounding 80+ recipes. Themost applicable recipes will vary between individual professionals, and even for each individual different recipes will be applicable at different times in their careers. For a cybersecurity professional beginning to work with malware, the fundamentals chapter, chapter 2:ML-based Malware Detection, provides a solid and excellent start to creating a malware classifier. For more advanced malware analysts, Chapter 3:Advanced Malware Detection will offer more sophisticated and specialized techniques, such as dealing with obfuscation and script malware.
Every cybersecurity professional would benefit from getting a firm grasp of chapter 4, “ML for Social Engineering”. In fact, anyone at all should have an understanding of how ML can be used to trick unsuspecting users, as part of their cybersecurity education. This chapter really shows that you have to be cautious because machines are becoming better at imitating humans. On the other hand, ML also provides the tools to know when such an attack is being performed.
Chapter 5, “Penetration Testing Using ML” is a technical chapter, and is most appropriate to cybersecurity professionals that are concerned with pen testing. It covers 10 ways in which pen testing can be improved by using ML, including neural network-assisted fuzzing and DeepExploit, a framework that utilizes a reinforcement learning (RL) agent on top of metasploit to perform automatic pen testing.
Chapter 6, “Automatic Intrusion Detection” has a wider appeal, as a lot of cybersecurity professionals have to know how to defend a network from intruders. They would benefit from seeing how to leverage ML to stop zero-day attacks on their network. In addition, the chapter covers many other use cases, such as spam filtering, Botnet detection and Insider Threat detection, which are more useful to some than to others.
Chapter 7, “Securing and Attacking Data with ML” provides great content to cybersecurity professionals interested in utilizing ML for improving their password security and other forms of data security.
Chapter 8, “Secure and Private AI”, is invaluable to data scientists in the field of cybersecurity. Recipes in this chapter include Federated Learning and differential privacy (which allow to train an ML model on clients’ data without compromising their privacy) and testing adversarial robustness (which allows to improve the robustness of ML models to adversarial attacks).
Your book talks about using machine learning to generate custom malware to pentest security. Can you elaborate on how this works and why this matters?
As a general rule, you want to find out your vulnerabilities before someone else does (who might be up to no-good). For that reason, pen testing has always been an important step in providing security. To pen test your Antivirus well, it is important to use the latest techniques in malware evasion, as the bad guys will certainly try them, and these are deep learning-based techniques for modifying malware.
On Emmanuel’s personal achievements in the Cybersecurity domain
Dr. Tsukerman, in 2017, your anti-ransomware product was listed in the ‘Top 10 ransomware products of 2018’ by PC Magazine. In your experience, why are ransomware attacks on the rise and what makes an effective anti-ransomware product? Also, in 2018, you designed an ML-based, instant-verdict malware detection system for Palo Alto Networks’ WildFire service of over 30,000 customers. Can you tell us more about this project?
If you monitor cybersecurity news, you would see that ransomware continues to be a huge threat. The reason is that ransomware offers cybercriminals an extremely attractive weapon. First, it is very difficult to trace the culprit from the malware or from the crypto wallet address. Second, the payoffs can be massive, be it from hitting the right target (e.g., a HIPAA compliant healthcare organization) or a large number of targets (e.g., all traffic to an e-commerce web page). Thirdly, ransomware is offered as a service, which effectively democratizes it!
On the flip side, a lot of the risk of ransomware can be mitigated through common sense tactics. First, backing up one’s data. Second, having an anti-ransomware solution that provides guarantees. A generic antivirus can provide no guarantee – it either catches the ransomware or it doesn’t. If it doesn’t, your data is toast. However, certain anti-ransomware solutions, such as the one I have developed, do offer guarantees (e.g., no more than 0.1% of your files lost). Finally, since millions of new ransomware samples are developed each year, the malware solution must include a machine learning component, to catch the zero-day samples, which is another component of the anti-ransomware solution I developed.
The project at Palo Alto Networks is a similar implementation of ML for malware detection. The one difference is that unlike the anti-ransomware service, which is an endpoint security tool, it offers protection services from the cloud. Since Palo Alto Networks is a firewall-service provider, that makes a lot of sense, since ideally, the malicious sample will be stopped at the firewall, and never even reach the endpoint.
To learn how to implement the techniques discussed in this interview, grab your copy of the Machine Learning for Cybersecurity Cookbook Don’t wait – the bad guys aren’t waiting.
Emmanuel Tsukerman graduated from Stanford University and obtained his Ph.D. from UC Berkeley. In 2017, Dr. Tsukerman’s anti-ransomware product was listed in the Top 10 ransomware products of 2018 by PC Magazine. In 2018, he designed an ML-based, instant-verdict malware detection system for Palo Alto Networks’ WildFire service of over 30,000 customers. In 2019, Dr. Tsukerman launched the first cybersecurity data science course.
About the book
Machine Learning for Cybersecurity Cookbook will guide you through constructing classifiers and features for malware, which you’ll train and test on real samples. You will also learn to build self-learning, reliant systems to handle cybersecurity tasks such as identifying malicious URLs, spam email detection, intrusion detection, network protection, and tracking user and process behavior, and much more!