14 min read

There are over 70 Java-based open source machine learning projects listed on the MLOSS.org website and probably many more unlisted projects live at university servers, GitHub, or Bitbucket. In this article, we will review the major machine learning libraries and platforms in Java, the kind of problems they can solve, the algorithms they support, and the kind of data they can work with.

This article is an excerpt taken from Machine learning in Java, written by Bostjan Kaluza and published by Packt Publishing Ltd.

Weka

Weka, which is short for Waikato Environment for Knowledge Analysis, is a machine learning library developed at the University of Waikato, New Zealand, and is probably the most well-known Java library. It is a general-purpose library that is able to solve a wide variety of machine learning tasks, such as classification, regression, and clustering. It features a rich graphical user interface, command-line interface, and Java API. You can check out Weka at http://www.cs.waikato.ac.nz/ml/weka/.

At the time of writing this book, Weka contains 267 algorithms in total: data pre-processing (82), attribute selection (33), classification and regression (133), clustering (12), and association rules mining (7). Graphical interfaces are well-suited for exploring your data, while Java API allows you to develop new machine learning schemes and use the algorithms in your applications.

Weka


Weka is distributed under GNU General Public License (GNU GPL), which means that you can copy, distribute, and modify it as long as you track changes in source files and keep it under GNU GPL. You can even distribute it commercially, but you must disclose the source code or obtain a commercial license.

In addition to several supported file formats, Weka features its own default data format, ARFF, to describe data by attribute-data pairs. It consists of two parts. The first part contains header, which specifies all the attributes (that is, features) and their type; for instance, nominal, numeric, date, and string. The second part contains data, where each line corresponds to an instance. The last attribute in the header is implicitly considered as the target variable, missing data are marked with a question mark. For example, the Bob instance written in an ARFF file format would be as follows:

@RELATION person_dataset

@ATTRIBUTE `Name`  STRING
@ATTRIBUTE `Height`  NUMERIC
@ATTRIBUTE `Eye color`{blue, brown, green}
@ATTRIBUTE `Hobbies`  STRING

@DATA
'Bob', 185.0, blue, 'climbing, sky diving'
'Anna', 163.0, brown, 'reading'
'Jane', 168.0, ?, ?

The file consists of three sections. The first section starts with the @RELATION keyword, specifying the dataset name. The next section starts with the @ATTRIBUTE keyword, followed by the attribute name and type. The available types are STRING, NUMERIC, DATE, and a set of categorical values. The last attribute is implicitly assumed to be the target variable that we want to predict. The last section starts with the @DATA keyword, followed by one instance per line. Instance values are separated by comma and must follow the same order as attributes in the second section.

Weka’s Java API is organized in the following top-level packages:

  • weka.associations: These are data structures and algorithms for association rules learning, including Apriori, predictive apriori, FilteredAssociator, FP-Growth, Generalized Sequential Patterns (GSP), Hotspot, and Tertius.
  • weka.classifiers: These are supervised learning algorithms, evaluators, and data structures. Thepackage is further split into the following components:
    • weka.classifiers.bayes: This implements Bayesian methods, including naive Bayes, Bayes net, Bayesian logistic regression, and so on
    • weka.classifiers.evaluation: These are supervised evaluation algorithms for nominal and numerical prediction, such as evaluation statistics, confusion matrix, ROC curve, and so on
    • weka.classifiers.functions: These are regression algorithms, including linear regression, isotonic regression, Gaussian processes, support vector machine, multilayer perceptron, voted perceptron, and others
    • weka.classifiers.lazy: These are instance-based algorithms such as k-nearest neighbors, K*, and lazy Bayesian rules
    • weka.classifiers.meta: These are supervised learning meta-algorithms, including AdaBoost, bagging, additive regression, random committee, and so on
    • weka.classifiers.mi: These are multiple-instance learning algorithms, such as citation k-nn, diverse density, MI AdaBoost, and others
    • weka.classifiers.rules: These are decision tables and decision rules based on the separate-and-conquer approach, Ripper, Part, Prism, and so on
    • weka.classifiers.trees: These are various decision trees algorithms, including ID3, C4.5, M5, functional tree, logistic tree, random forest, and so on
  • weka.clusterers: These are clustering algorithms, including k-means, Clope, Cobweb, DBSCAN hierarchical clustering, and farthest.
  • weka.core: These are various utility classes, data presentations, configuration files, and so on.
  • weka.datagenerators: These are data generators for classification, regression, and clustering algorithms.
  • weka.estimators: These are various data distribution estimators for discrete/nominal domains, conditional probability estimations, and so on.
  • weka.experiment: These are a set of classes supporting necessary configuration, datasets, model setups, and statistics to run experiments.
  • weka.filters: These are attribute-based and instance-based selection algorithms for both supervised and unsupervised data preprocessing.
  • weka.gui: These are graphical interface implementing explorer, experimenter, and knowledge flowapplications. Explorer allows you to investigate dataset, algorithms, as well as their parameters, and visualize dataset with scatter plots and other visualizations. Experimenter is used to design batches of experiment, but it can only be used for classification and regression problems. Knowledge flows implements a visual drag-and-drop user interface to build data flows, for example, load data, apply filter, build classifier, and evaluate.

Java-ML for machine learning

Java machine learning library, or Java-ML, is a collection of machine learning algorithms with a common interface for algorithms of the same type. It only features Java API, therefore, it is primarily aimed at software engineers and programmers. Java-ML contains algorithms for data preprocessing, feature selection, classification, and clustering. In addition, it features several Weka bridges to access Weka’s algorithms directly through the Java-ML API. It can be downloaded from http://java-ml.sourceforge.net; where, the latest release was in 2012 (at the time of writing this book).

Java machine learning

Java-ML is also a general-purpose machine learning library. Compared to Weka, it offers more consistent interfaces and implementations of recent algorithms that are not present in other packages, such as an extensive set of state-of-the-art similarity measures and feature-selection techniques, for example, dynamic time warping, random forest attribute evaluation, and so on. Java-ML is also available under the GNU GPL license.

Java-ML supports any type of file as long as it contains one data sample per line and the features are separated by a symbol such as comma, semi-colon, and tab.

The library is organized around the following top-level packages:

  • net.sf.javaml.classification: These are classification algorithms, including naive Bayes, random forests, bagging, self-organizing maps, k-nearest neighbors, and so on
  • net.sf.javaml.clustering: These are clustering algorithms such as k-means, self-organizing maps, spatial clustering, Cobweb, AQBC, and others
  • net.sf.javaml.core: These are classes representing instances and datasets
  • net.sf.javaml.distance: These are algorithms that measure instance distance and similarity, for example, Chebyshev distance, cosine distance/similarity, Euclidian distance, Jaccard distance/similarity, Mahalanobis distance, Manhattan distance, Minkowski distance, Pearson correlation coefficient, Spearman’s footrule distance, dynamic time wrapping (DTW), and so on
  • net.sf.javaml.featureselection: These are algorithms for feature evaluation, scoring, selection, and ranking, for instance, gain ratio, ReliefF, Kullback-Liebler divergence, symmetrical uncertainty, and so on
  • net.sf.javaml.filter: These are methods for manipulating instances by filtering, removing attributes, setting classes or attribute values, and so on
  • net.sf.javaml.matrix: This implements in-memory or file-based array
  • net.sf.javaml.sampling: This implements sampling algorithms to select a subset of dataset
  • net.sf.javaml.tools: These are utility methods on dataset, instance manipulation, serialization, Weka API interface, and so on
  • net.sf.javaml.utils: These are utility methods for algorithms, for example, statistics, math methods, contingency tables, and others

Apache Mahout

The Apache Mahout project aims to build a scalable machine learning library. It is built atop scalable, distributed architectures, such as Hadoop, using the MapReduce paradigm, which is an approach for processing and generating large datasets with a parallel, distributed algorithm using a cluster of servers.

Apache Mahout

Mahout features console interface and Java API to scalable algorithms for clustering, classification, and collaborative filtering. It is able to solve three business problems: item recommendation, for example, recommending items such as people who liked this movie also liked…; clustering, for example, of text documents into groups of topically-related documents; and classification, for example, learning which topic to assign to an unlabeled document.

Mahout is distributed under a commercially-friendly Apache License, which means that you can use it as long as you keep the Apache license included and display it in your program’s copyright notice.

Mahout features the following libraries:

  • org.apache.mahout.cf.taste: These are collaborative filtering algorithms based on user-based and item-based collaborative filtering and matrix factorization with ALS
  • org.apache.mahout.classifier: These are in-memory and distributed implementations, includinglogistic regression, naive Bayes, random forest, hidden Markov models (HMM), and multilayer perceptron
  • org.apache.mahout.clustering: These are clustering algorithms such as canopy clustering, k-means, fuzzy k-means, streaming k-means, and spectral clustering
  • org.apache.mahout.common: These are utility methods for algorithms, including distances, MapReduce operations, iterators, and so on
  • org.apache.mahout.driver: This implements a general-purpose driver to run main methods of other classes
  • org.apache.mahout.ep: This is the evolutionary optimization using the recorded-step mutation
  • org.apache.mahout.math: These are various math utility methods and implementations in Hadoop
  • org.apache.mahout.vectorizer: These are classes for data presentation, manipulation, andMapReduce jobs

Apache Spark

Apache Spark, or simply Spark, is a platform for large-scale data processing builds atop Hadoop, but, in contrast to Mahout, it is not tied to the MapReduce paradigm. Instead, it uses in-memory caches to extract a working set of data, process it, and repeat the query. This is reported to be up to ten times as fast as a Mahout implementation that works directly with disk-stored data. It can be grabbed from https://spark.apache.org.

Apache Spark

There are many modules built atop Spark, for instance, GraphX for graph processing, Spark Streaming for processing real-time data streams, and MLlib for machine learning library featuring classification, regression, collaborative filtering, clustering, dimensionality reduction, and optimization.

Spark’s MLlib can use a Hadoop-based data source, for example, Hadoop Distributed File System (HDFS) or HBase, as well as local files. The supported data types include the following:

  • Local vector is stored on a single machine. Dense vectors are presented as an array of double-typed values, for example, (2.0, 0.0, 1.0, 0.0); while sparse vector is presented by the size of the vector, an array of indices, and an array of values, for example, [4, (0, 2), (2.0, 1.0)].
  • Labeled point is used for supervised learning algorithms and consists of a local vector labeled with a double-typed class values. Label can be class index, binary outcome, or a list of multiple class indices (multiclass classification). For example, a labeled dense vector is presented as [1.0, (2.0, 0.0, 1.0, 0.0)].
  • Local matrix stores a dense matrix on a single machine. It is defined by matrix dimensions and a single double-array arranged in a column-major order.
  • Distributed matrix operates on data stored in Spark’s Resilient Distributed Dataset (RDD), which represents a collection of elements that can be operated on in parallel. There are three presentations: row matrix, where each row is a local vector that can be stored on a single machine, row indices are meaningless; and indexed row matrix, which is similar to row matrix, but the row indices are meaningful, that is, rows can be identified and joins can be executed; and coordinate matrix, which is used when a row cannot be stored on a single machine and the matrix is very sparse.

Spark’s MLlib API library provides interfaces to various learning algorithms and utilities as outlined in the following list:

  • org.apache.spark.mllib.classification: These are binary and multiclass classification algorithms, including linear SVMs, logistic regression, decision trees, and naive Bayes
  • org.apache.spark.mllib.clustering: These are k-means clustering
  • org.apache.spark.mllib.linalg: These are data presentations, including dense vectors, sparse vectors, and matrices
  • org.apache.spark.mllib.optimization: These are the various optimization algorithms used as low-level primitives in MLlib, including gradient descent, stochastic gradient descent, update schemes for distributed SGD, and limited-memory BFGS
  • org.apache.spark.mllib.recommendation: These are model-based collaborative filtering implemented with alternating least squares matrix factorization
  • org.apache.spark.mllib.regression: These are regression learning algorithms, such as linear least squares, decision trees, Lasso, and Ridge regression
  • org.apache.spark.mllib.stat: These are statistical functions for samples in sparse or dense vector format to compute the mean, variance, minimum, maximum, counts, and nonzero counts
  • org.apache.spark.mllib.tree: This implements classification and regression decision tree-learning algorithms
  • org.apache.spark.mllib.util: These are a collection of methods to load, save, preprocess, generate, and validate the data

Deeplearning4j

Deeplearning4j, or DL4J, is a deep-learning library written in Java. It features a distributed as well as a single-machinedeep-learning framework that includes and supports various neural network structures such as feedforward neural networks, RBM, convolutional neural nets, deep belief networks, autoencoders, and others. DL4J can solve distinct problems, such as identifying faces, voices, spam or e-commerce fraud.

Deeplearning4j is also distributed under Apache 2.0 license and can be downloaded from http://deeplearning4j.org. The library is organized as follows:

  • org.deeplearning4j.base: These are loading classes
  • org.deeplearning4j.berkeley: These are math utility methods
  • org.deeplearning4j.clustering: This is the implementation of k-means clustering
  • org.deeplearning4j.datasets: This is dataset manipulation, including import, creation, iterating, and so on
  • org.deeplearning4j.distributions: These are utility methods for distributions
  • org.deeplearning4j.eval: These are evaluation classes, including the confusion matrix
  • org.deeplearning4j.exceptions: This implements exception handlers
  • org.deeplearning4j.models: These are supervised learning algorithms, including deep belief network, stacked autoencoder, stacked denoising autoencoder, and RBM
  • org.deeplearning4j.nn: These are the implementation of components and algorithms based on neural networks, such as neural network, multi-layer network, convolutional multi-layer network, and so on
  • org.deeplearning4j.optimize: These are neural net optimization algorithms, including back propagation, multi-layer optimization, output layer optimization, and so on
  • org.deeplearning4j.plot: These are various methods for rendering data
  • org.deeplearning4j.rng: This is a random data generator
  • org.deeplearning4j.util: These are helper and utility methods

MALLET

Machine Learning for Language Toolkit (MALLET), is a large library of natural language processing algorithms and utilities. It can be used in a variety of tasks such as document classification, document clustering, information extraction, and topic modeling. It features command-line interface as well as Java API for several algorithms such as naive Bayes, HMM, Latent Dirichlet topic models, logistic regression, and conditional random fields.

Mallet

MALLET is available under Common Public License 1.0, which means that you can even use it in commercial applications. It can be downloaded from http://mallet.cs.umass.edu. MALLET instance is represented by name, label, data, and source. However, there are two methods to import data into the MALLET format, as shown in the following list:

  • Instance per file: Each file, that is, document, corresponds to an instance and MALLET accepts the directory name for the input.
  • Instance per line: Each line corresponds to an instance, where the following format is assumed: the instance_name label token. Data will be a feature vector, consisting of distinct words that appear as tokens and their occurrence count.

The library comprises the following packages:

  • cc.mallet.classify: These are algorithms for training and classifying instances, including AdaBoost, bagging, C4.5, as well as other decision tree models, multivariate logistic regression, naive Bayes, and Winnow2.
  • cc.mallet.cluster: These are unsupervised clustering algorithms, including greedy agglomerative, hill climbing, k-best, and k-means clustering.
  • cc.mallet.extract: This implements tokenizers, document extractors, document viewers, cleaners, and so on.
  • cc.mallet.fst: This implements sequence models, including conditional random fields, HMM, maximum entropy Markov models, and corresponding algorithms and evaluators.
  • cc.mallet.grmm: This implements graphical models and factor graphs such as inference algorithms, learning, and testing. For example, loopy belief propagation, Gibbs sampling, and so on.
  • cc.mallet.optimize: These are optimization algorithms for finding the maximum of a function, such as gradient ascent, limited-memory BFGS, stochastic meta ascent, and so on.
  • cc.mallet.pipe: These are methods as pipelines to process data into MALLET instances.
  • cc.mallet.topics: These are topics modeling algorithms, such as Latent Dirichlet allocation, four-level pachinko allocation, hierarchical PAM, DMRT, and so on.
  • cc.mallet.types: This implements fundamental data types such as dataset, feature vector, instance, and label.
  • cc.mallet.util: These are miscellaneous utility functions such as command-line processing, search, math, test, and so on.

To design, build, and deploy your own machine learning applications by leveraging key Java machine learning libraries, check out this book Machine learning in Java, published by Packt Publishing.

Read Next:

5 JavaScript machine learning libraries you need to know

A non programmer’s guide to learning Machine learning

Why use JavaScript for machine learning?

 


Also published on Medium.

Being a Senior Content Marketing Editor at Packt Publishing, I handle vast array of content in the tech space ranging from Data science, Web development, Programming, Cloud & Networking, IoT, Security and Game development. With prior experience and understanding of Marketing I aspire to grow leaps and bounds in the Content & Digital Marketing field. On the personal front I am an ambivert and love to read inspiring articles and books on life and in general.