Recently Andreas Mueller gave a talk on changes in scikit-learn 0.20 and future releases. He is an Associate Research Scientist at the Data Science Institute, University of Columbia, New York. Dr. Mueller is also a core developer of the scikit-learn library. scikit-learn is a popular machine learning library for the Python programming language. In scikit-learn the data is represented as a 2D NumPy array where a row is a sample and a column is a feature from your dataset.
scikit-learn 0.20 is released as of now.
Here are the highlights from the talk about the scikit-learn future.
scikit-learn 0.20 aims to simplify things for user, especially preprocessing.
The OneHotEnoder is rewritten to support strings
Previously, the OneHotEnoder in scikit-learn only supported integers as a result of which categorical variables were encoded as strings.
Another feature to help preprocessing is the ColumnTransformer. It is similar to something that previously existed called a feature union. The ColumnTransformer allows developers to apply different transformations or different preprocessing steps to different columns in a columnar dataset. make_column_transformer makes use of the column names
The basic idea of the PowerTransformer is to do a power transformation that would take the data to some power. The goal is to make the data more normal.
Treatment of missing values
Now the scalers like StandardScaler, MinMaxScaler, RobustScaler etc allow having missing values in the data. Now you can apply the scikit-learn scalers before filling in or imputing missing values. During fitting, they all ignore the missing values. Imputer is now SimpleImputer and is a simplified version but will also add some more complex model based imputation strategies. MissingIndicator is added which allows you to record when have values been imputed often and if a value is missing it will tell you something about the data point.
With this, you can transform the target before building the model and after prediction. In terms of absolute error, this is much better than not using target transformation. There is also no systematic skew in the data any more or lesser than before.
OpenML dataset loader
This replaces ML data which was no longer maintained. OpenML allows you to create tasks on the dataset along with uploading data, you can also upload the results of a problem.
Loky – a robust and reusable executor
joblib is upgraded and now includes a new tool called Loky. This is an alternative to multiprocessing.pool.Pool and concurrent.features.ProcessPoolExecutor. The replacement was necessary as the old tools were not very robust. It has a deadlock free implementation and consistent spawn behavior. It also fixes the random crashes which happened previously with BLAS/OpenMP libraries.
Global config for scikit-learn
A global configuration now exists for scikit-learn which you can use with sklearn.config_context or sklearn.set_config either to set a global state or to use a context manager. This supports two options which are to increase speed or reduce memory consumption.
- Removing the check if an input is valid for large datasets saves on time by setting finite to TRUE.
- Setting working_memory limits RAM usage. Currently works on in distance computation and nearest neighbor computation.
The options can be used like this: set_config(assume_finite=None, working_memory=None)
Early stopping for gradient boosting
You can stop building the model based on the tolerance and number of iterations you set. For example, the model will stop if for the last five iterations there was no improvement beyond 0.01%. There is something similar for stochastic gradient descent too.
A glossary is added which explains all the terms used in scikit-learn to improve scikit-learn future uses and make the library more welcoming for new users.
There are also better default parameters since it was found that most people use algorithms with default parameters. The following changes will be made in the scikit-learn future releases, until then you will receive warnings.
- For random forests the number of estimators will change from 10 to 100 (in version 0.22)
- Cross validation will be 5 fold instead of 3 (in version 0.22)
- In grid search iid will be set to False (in version 0.22) and iid will be removed (in version 0.24)
For LogisticRegression defaults, the following changes will happen in sckit-learn 0.22:
- solver=’lbfgs’ from ‘liblinear’
- multiclass=’auto’ from ‘ovr’
You can avoid warnings in your code by setting the parameters yourself explicitly.
Python 2.7 and 3.4 support will be dropped in scikit-learn 0.21.
If you want to see examples of using the new features and some other useful tips by Dr. Mueller watch the talk on YouTube.