Machine Learning introduces a huge potential to reduce costs and generate new revenue in an enterprise. Application of machine learning effectively helps in solving practical problems smartly within an organization.
Machine learning automates tasks that would otherwise need to be performed by a live agent. It has made drastic improvements in the past few years, but many a time, a machine needs the assistance of a human to complete its task. This is why it is necessary for organizations to learn best practices in machine learning which you will learn in this article today.
This article is an excerpt from a book written by Chiheb Chebbi titled Mastering Machine Learning for Penetration Testing
Feature engineering in machine learning
Feature engineering and feature selection are essential to every modern data science product, especially machine learning based projects. According to research, over 50% of the time spent building the model is occupied by cleaning, processing, and selecting the data required to train the model. It is your responsibility to design, represent, and select the features.
Most machine learning algorithms cannot work on raw data. They are not smart enough to do so. Thus, feature engineering is needed, to transform data in its raw status into data that can be understood and consumed by algorithms. Professor Andrew Ng once said:
“Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.”
Feature engineering is a process in the data preparation phase, according to the cross-industry standard process for data mining:
The term Feature Engineering itself is not a formally defined term. It groups together all of the tasks for designing features to build intelligent systems. It plays an important role in the system. If you check data science competitions, I bet you have noticed that the competitors all use the same algorithms, but the winners perform the best feature engineering. If you want to enhance your data science and machine learning skills, I highly recommend that you visit and compete at www.kaggle.com:
When searching for machine learning resources, you will face many different terminologies. To avoid any confusion, we need to distinguish between feature selection and feature engineering. Feature engineering transforms raw data into suitable features, while feature selection extracts necessary features from the engineered data. Featuring engineering is selecting the subset of all features, without including redundant or irrelevant features.
Machine learning best practices
Feature engineering enhances the performance of our machine learning system. We discuss some tips and best practices to build robust intelligent systems. Let’s explore some of the best practices in the different aspects of machine learning projects.
Information security datasets
Data is a vital part of every machine learning model. To train models, we need to feed them datasets. While reading the earlier chapters, you will have noticed that to build an accurate and efficient machine learning model, you need a huge volume of data, even after cleaning data. Big companies with great amounts of available data use their internal datasets to build models, but small organizations, like startups, often struggle to acquire such a volume of data. International rules and regulations are making the mission harder because data privacy is an important aspect of information security.
Every modern business must protect its users’ data. To solve this problem, many institutions and organizations are delivering publicly available datasets, so that others can download them and build their models for educational or commercial use. Some information security datasets are as follows:
- The Controller Area Network (CAN) dataset for intrusion detection (OTIDS): http://ocslab.hksecurity.net/Dataset/CAN-intrusion-dataset
- The car-hacking dataset for intrusion detection: http://ocslab.hksecurity.net/Datasets/CAN-intrusion-dataset
- The web-hacking dataset for cyber criminal profiling: http://ocslab.hksecurity.net/Datasets/web-hacking-profiling
- The API-based malware detection system (APIMDS) dataset: http://ocslab.hksecurity.net/apimds-dataset
- The intrusion detection evaluation dataset (CICIDS2017): http://www.unb.ca/cic/datasets/ids-2017.html
- The Tor-nonTor dataset: http://www.unb.ca/cic/datasets/tor.html
- The Android adware and general malware dataset: http://www.unb.ca/cic/datasets/android-adware.html
Use Project Jupyter
The Jupyter Notebook is an open source web application used to create and share coding documents. I highly recommend it, especially for novice data scientists, for many reasons. It will give you the ability to code and visualize output directly. It is great for discovering and playing with data; exploring data is an important step to building machine learning models.
Jupyter’s official website is http://jupyter.org/:
To install it using pip, simply type the following:
python -m pip install --upgrade pip python -m pip install jupyter
Speed up training with GPUs
As you know, even with good feature engineering, training in machine learning is computationally expensive. The quickest way to train learning algorithms is to use graphics processing units (GPUs). Generally, though not in all cases, using GPUs is a wise decision for training models. In order to overcome CPU performance bottlenecks, the gather/scatter GPU architecture is best, performing parallel operations to speed up computing.
TensorFlow supports the use of GPUs to train machine learning models. Hence, the devices are represented as strings; following is an example:
"/device:GPU:0" : Your device GPU "/device:GPU:1" : 2nd GPU device on your Machine
To use a GPU device in TensorFlow, you can add the following line:
You can use a single GPU or multiple GPUs. Don’t forget to install the CUDA toolkit, by using the following commands:
Wget "http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.44-1_amd64.deb" sudo dpkg -i cuda-repo-ubuntu1604_8.0.44-1_amd64.deb sudo apt-get update sudo apt-get install cuda
Install cuDNN as follows:
sudo tar -xvf cudnn-8.0-linux-x64-v5.1.tgz -C /usr/local export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64" export CUDA_HOME=/usr/local/cuda
Selecting models and learning curves
To improve the performance of machine learning models, there are many hyper parameters to adjust. The more data that is used, the more errors that can happen. To work on these parameters, there is a method called GridSearchCV. It performs searches on predefined parameter values, through iterations. GridSearchCV uses the score() function, by default. To use it in scikit-learn, import it by using this line:
from sklearn.grid_search import GridSearchCV
Learning curves are used to understand the performance of a machine learning model. To use a learning curve in scikit-learn, import it to your Python project, as follows:
from sklearn.learning_curve import learning_curve
Machine learning architecture
In the real world, data scientists do not find data to be as clean as the publicly available datasets. Real world data is stored by different means, and the data itself is shaped in different categories. Thus, machine learning practitioners need to build their own systems and pipelines to achieve their goals and train the models. A typical machine learning project respects the following architecture:
Good coding skills are very important to data science and machine learning. In addition to using effective linear algebra, statistics, and mathematics, data scientists should learn how to code properly. As a data scientist, you can choose from many programming languages, like Python, R, Java, and so on.
Respecting coding’s best practices is very helpful and highly recommended. Writing elegant, clean, and understandable code can be done through these tips:
- Comments are very important to understandable code. So, don’t forget to comment your code, all of the time.
- Choose the right names for variables, functions, methods, packages, and modules.
- Use four spaces per indentation level.
- Structure your repository properly.
- Follow common style guidelines.
If you use Python, you can follow this great aphorism, called the The Zen of Python, written by the legend, Tim Peters:
“Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one– and preferably only one –obvious way to do it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea — let’s do more of those!”
Good data handling leads to successfully building machine learning projects. After loading a dataset, please make sure that all of the data has loaded properly, and that the reading process is performing correctly. After performing any operation on the dataset, check over the resulting dataset.
An intelligent system is highly connected to business aspects because, after all, you are using data science and machine learning to solve a business issue or to build a commercial product, or for getting useful insights from the data that is acquired, to make good decisions. Identifying the right problems and asking the right questions are important when building your machine learning model, in order to solve business issues.
In this tutorial, we had a look at somes tips and best practices to build intelligent systems using Machine Learning.
To become a master at penetration testing using machine learning with Python, check out this book Mastering Machine Learning for Penetration Testing