20 min read

In this article by Uday Kamath and Krishna Choppella, authors for the book Mastering Java Machine Learning, will discuss how in recent years a revival of interest is seen in the area ofartificial intelligence (AI)and machine learning, in particular, both in academic circles and industry. In the last decade, AI has seen dramatic successes that eluded practitioners in the intervening years since the original promise of the field gave way to relative decline until its re-emergence in the last few years.

(For more resources related to this topic, see here.)

What made these successes possible, in large part, was the availability of prodigious amounts of data and the inexorable increase in raw computational power. Among the areas of AI leading the resurgence, machine learning has seen spectacular developments and continues to find the widest applicability in an array of domains. The use of machine learning to help in complex decision making at the highest levels of business, and at the same time, its enormous success in improving the accuracy of what are now everyday applications, such assearch, speech recognition, and personal assistants on mobile phones,has made its effects commonplace in the family room and the boardroom alike. Articles breathlessly extolling the power of “deep learning” can be found today not only in the popular science and technology press, but also in mainstream outlets such as The New York Times and The Huffington Post. Machine learning has indeed become ubiquitous in a relatively short time.

An ordinary user encounters machine learning in many ways in his day-to-day activities. Interacting with well-known e-mail providers such as Gmail gives the user automated sorting and categorization of e-mails into categories, such as spam, junk, promotions, and so on,which is made possible using text mining, a branch of machine learning. When shopping online for products on ecommerce websites such as https://www.amazon.com/ or watching movies from content providers such as http://netflix.com/, one is offered recommendations for other products and content by so-called recommender systems, another branch of machine learning. Forecasting the weather, estimating real estate prices, predicting voter turnout and even election results—all use some form of machine learningto see into the future as it were.

The ever-growing availability of data and the promise of systems that can enrich our lives by learning from that data place a growing demand on skills from a limited workforce of professionals in the field of data science. This demand is particularly acute for well-trained experts who know their way around the landscape of machine learning techniques in the more popular languages, including Java, Python, R, and increasingly, Scala. By far, the number and availability of machine learning libraries, tools, APIs, and frameworks in Java outstrip those in other languages. Consequently, mastery of these skills will put any aspiring professional with a desire to enter the field at a distinct advantage in the marketplace.

Perhaps you already apply machine learning techniques in your professional work, or maybe you simply have a hobbyist’s interest in the subject.Clearly, you can bend Java to your will, but now you feel you’re ready to dig deeper and learn how to use thebest of breed open-source ML Java frameworks in your next data science project.

Mastery of a subject, especially one that has such obvious applicability as machine learning, requires more than an understanding of its core concepts and familiarity with its mathematical underpinnings. Unlike an introductory treatment of the subject, a project that purports to help you master the subject must be heavily focused on practical aspects in addition to introducing more advanced topics that would have stretched the scope of the introductory material.To warm up before we embark on sharpening our instrument, we will devote this article to a quick review of what we already know.For the ambitious novice with little or no prior exposure to the subject (who is nevertheless determined to get the fullest benefit from this article), here’s our advice: make sure you do not skip the rest of this article instead, use it as a springboard to explore unfamiliar concepts in more depth. Seek out external resources as necessary.Wikipedia it. Then jump right back in.

For the rest of this article, we will review the following:

  • History and definitions
  • What is not machine learning?
  • Concepts and terminology
  • Important branches of machine learning
  • Different data types in machine learning
  • Applications of machine learning
  • Issues faced in machine learning
  • The meta-process used in most machine learning projects
  • Information on some well-known tools, APIs,and resources that we will employ in this article

Machine learning –history and definition

It is difficult to give an exact history, but the definition of machine learning we use today finds its usage as early as in the 1860s.In Rene Descartes’ Discourse on the Method, he refers to Automata and saysthe following:

For we can easily understand a machine’s being constituted so that it can utter words, and even emit some responses to action on it of a corporeal kind, which brings about a change in its organs; for instance, if touched in a particular part it may ask what we wish to say to it; if in another part it may exclaim that it is being hurt, and so on

http://www.earlymoderntexts.com/assets/pdfs/descartes1637.pd

https://www.marxists.org/reference/archive/descartes/1635/discourse-method.htm

Alan Turing, in his famous publication Computing Machinery and Intelligence, gives basic insights into the goals of machine learning by asking the question “Can machines think?”.

http://csmt.uchicago.edu/annotations/turing.htm

http://www.csee.umbc.edu/courses/471/papers/turing.pdf

Arthur Samuel, in 1959,wrote,”Machine learning is the field of study that gives computers the ability to learn without being explicitly programmed.”.

Tom Mitchell, in recent times, gave a more exact definition of machine learning:”A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.”

Machine Learning has a relationship with several areas:

  • Statistics: This uses the elements of data sampling, estimation, hypothesis testing, learning theory, and statistical based modeling, to name a few
  • Algorithms and computation: This uses basics of search, traversal, parallelization, distributed computing, and so on from basic computer science
  • Database and knowledge discovery: This has the ability to store, retrieve, access information in various formats
  • Pattern recognition: This has the ability to find interesting patterns from the data either to explore, visualize, or predict
  • Artificial Intelligence: Though it is considered a branch of artificial intelligence, it also has relationships with other branches, such as heuristics, optimization, evolutionary computing, and so on.

What is not machine learning?

It is important to recognize areas that share a connection with machine learning but cannot themselves be considered as being part of machine learning. Some disciplines may overlap to a smaller or larger extent, yet the principles underlying machine learning are quite distinct:

  • Business intelligence and reporting: Reporting Key Performance Indicators (KPIs), querying OLAP for slicing, dicing, and drilling into the data, dashboards,and so on. that form central components of BI are not machine learning.
  • Storage and ETL: Data storage and ETL are key elements needed in any machine learning process, but by themselves, they don’t qualify as machine learning.
  • Information retrieval, search, and queries: The ability to retrieve the data or documents based on search criteria or indexes, which form the basis of information retrieval, are not really machine learning. Many forms of machine learning, such as semi-supervised learning, can rely on search of similar data for modeling but that doesn’t qualify search as machine learning.
  • Knowledge representation and reasoning: Representing knowledge for performing complex tasks such as Ontology, Expert Systems, and Semantic Web do not qualify as machine learning.

Machine learning –concepts and terminology

In this article, we will describe different concepts and terms normally used in machine learning:

  • Data or dataset: The basics of machine learning rely on understanding the data. The data or dataset normally refers to content available in structured or unstructured format for using in machine learning. Structured datasets have specific formats, and an unstructured dataset is normally in the form of some free flowing text. Data can be available in various storage types or formats. In structured data, every element known as an instance or an example or row follows a predefined structure. Data can be also be categorized by size; small or medium data have a few hundreds to thousands of instances, whereas big data refers to large volume, mostly in the millions or billions, which cannot be stored or accessed using common devices or fit in the memory of such devices.
  • Features, attributes, variables or dimensions: In structured datasets, as mentioned earlier, there are predefined elements with their own semantic and data type, which are known variously as features, attributes, variables, or dimensions.
  • Data types: The preceding features defined need some form of typing in many machine learning algorithms or techniques. The most commonly used data types are as follows:
    • Categorical or nominal: This indicates well-defined categories or values present in the dataset. For example, eye color, such as black, blue, brown, green, or grey; document content type, such as text, image, or video.
    • Continuous or numeric: This indicates the numeric nature of the data field. For example, a person’s weight measured by a bathroom scale, temperature from a sensor, the monthly balance in dollars on a credit card account.
    • Ordinal: This denotes the data that can be ordered in some way. For example, garment size, such as small, medium, or large; boxing weight classes, such as heavyweight, light heavyweight, middleweight, lightweight,and bantamweight.
  • Target or label: A feature or set of features in the dataset, which is used for learning from training data and predicting in unseen dataset, is known as a target or a label. A label can have any form as specified earlier, that is, categorical, continuous, or ordinal.
  • Machine learning model: Each machine learning algorithm, based on what it learned from the dataset, maintains the state of its learning for predicting or giving insights into future or unseen data. This is referred to as the machine learning model.
  • Sampling: Data sampling is an essential step in machine learning. Sampling means choosing a subset of examples from a population with the intent of treating the behavior seen in the (smaller) sample as being representative of the behavior of the (larger) population. In order for the sample to be representative of the population, care must be taken in the way the sample is chosen. Generally, a population consists of every object sharing the properties of interest in the problem domain,for example,all people eligible to vote in the general election, all potential automobile owners in the next four years.Since it is usually prohibitive (or impossible) to collect data for all the objects in a population, a well-chosen subset is selected for the purposes of analysis.A crucial consideration in the sampling process is that the sample be unbiased with respect to the population. The following are types of probability based sampling:
    • Uniform random sampling: A sampling method when the sampling is done over a uniformly distributed population, that is, each object has an equal probability of being chosen.
    • Stratified random sampling: A sampling method when the data can be categorized into multiple classes.In such cases, in order to ensure all categories are represented in the sample, the population is divided into distinct strata based on these classifications, and each stratum is sampled in proportion to the fraction of its class in the overall population. Stratified sampling is common when the population density varies across categories, and it is important to compare these categories with the same statistical power.
    • Cluster sampling: Sometimes there are natural groups among the population being studied, and each group is representative of the whole population.An example is data that spans many geographical regions. In cluster sampling you take a random subset of the groups followed by a random sample from within each of those groups to construct the full data sample.This kind of sampling can reduce costs of data collection without compromising the fidelity of distribution in the population.
    • Systematic sampling: Systematic or interval sampling is used when there is a certain ordering present in the sampling frame (a finite set of objects treated as the population and taken to be the source of data for sampling, for example, the corpus of Wikipedia articles arranged lexicographically by title). If the sample is then selected by starting at a random object and skipping a constant k number of object before selecting the next one, that is called systematic sampling.K is calculated as the ratio of the population and the sample size.
  • Model evaluation metrics: Evaluating models for performance is generally based on different evaluation metrics for different types of learning. In classification, it is generally based on accuracy, receiver operating characteristics (ROC) curves, training speed, memory requirements, false positive ratio,and so on. In clustering, the number of clusters found, cohesion, separation, and so on form the general metrics. In stream-based learning apart from preceding standard metrics mentioned, adaptability, speed of learning, and robustness to sudden changes are some of the conventional metrics for evaluating the performance of the learner.

To illustrate these concepts, a concrete example in the form of a well-known weather dataset is given.The data gives a set of weather conditions and a label that indicates whether the subject decided to play a game of tennis on the day or not:

@relation weather

@attribute outlook {sunny, overcast, rainy}
@attribute temperature numeric
@attribute humidity numeric
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}

@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no

The dataset is in the format of an ARFF (Attribute-Relation File Format) file. It consists of a header giving the information about features or attributes with their data types and actual comma separated data following the data tag. The dataset has five features: outlook, temperature, humidity, windy, and play. The features outlook and windy are categorical features, while humidity and temperature are continuous. The feature play is the target and is categorical.

Machine learning –types and subtypes

We will now explore different subtypes or branches of machine learning. Though the following list is not comprehensive, it covers the most well-known types:

  • Supervised learning: This is the most popular branch of machine learning, which is about learning from labeled data. If the data type of the label is categorical, it becomes a classification problem, and if numeric, it becomes a regression problem. For example, if the target of the dataset is detection of fraud, which has categorical values of either true or false, we are dealing with a classification problem. If, on the other hand, the target is to predict thebest price to list the sale of a home at, which is a numeric dollar value, the problem is one of regression. The following diagram illustrates labeled data that is conducive to classification techniques that are suitable for linearly separable data, such as logistic regression:

    Mastering Java Machine Learning

    Linearly separable data

    Mastering Java Machine Learning

    An example of dataset that is not linearly separable.

    This type of problem calls for classification techniques such asSupport Vector Machines.

  • Unsupervised learning: Understanding the data and exploring it in order to buildmachine learning models when the labels are not given is called unsupervised learning. Clustering, manifold learning, and outlier detection are techniques that are covered in this topic. Examples of problems that require unsupervised learning are many; grouping customers according to their purchasing behavior is one example.In the case of biological data, tissues samples can be clustered based on similar gene expression values using unsupervised learning techniques

    The following diagram represents data with inherent structure that can be revealed as distinct clusters using an unsupervised learning technique such as K-Means:

    Mastering Java Machine Learning

    Clusters in data

    Different techniques are used to detect global outliers—examples that are anomalous with respect to the entire data set, and local outliers—examples that are misfits in their neighborhood. In the following diagam, the notion of local and global outliers is illustrated for a two-feature dataset:

    Mastering Java Machine Learning

    Local and Global outliers

  • Semi-supervised learning: When the dataset has some labeled data and large data, which is not labeled, learning from such dataset is called semi-supervised learning. When dealing with financial data with the goal to detect fraud, for example, there may be a large amount of unlabeled data and only a small number of known fraud and non-fraud transactions.In such cases, semi-supervised learning may be applied.
  • Graph mining: Mining data represented as graph structures is known as graph mining. It is the basis of social network analysis and structure analysis in different bioinformatics, web mining, and community mining applications.
  • Probabilistic graphmodeling and inferencing: Learning and exploiting structures present between features to model the data comes under the branch of probabilistic graph modeling.
  • Time-series forecasting: This is a form of learning where data has distinct temporal behavior and the relationship with time is modeled.A common example is in financial forecasting, where the performance of stocks in a certain sector may be the target of the predictive model.
  • Association analysis:This is a form of learning where data is in the form of an item set or market basket and association rules are modeled to explore and predict the relationships between the items. A common example in association analysis is to learn relationships between the most common items bought by the customers when they visit the grocery store.
  • Reinforcement learning: This is a form of learning where machines learn to maximize performance based on feedback in the form of rewards or penalties received from the environment. A recent example that famously used reinforcement learning was AlphaGo, the machine developed by Google that beat the world Go champion Lee Sedoldecisively, in March 2016.Using a reward and penalty scheme, the model first trained on millions of board positions in the supervised learning stage, then played itselfin the reinforcement learning stage to ultimately become good enough to triumph over the best human player.

    http://www.theatlantic.com/technology/archive/2016/03/the-invisible-opponent/475611/

    https://gogameguru.com/i/2016/03/deepmind-mastering-go.pdf

  • Stream learning or incremental learning: Learning in supervised, unsupervised, or semi-supervised manner from stream data in real time or pseudo-real time is called stream or incremental learning. Learning the behaviors of sensors from different types of industrial systems for categorizing into normal and abnormal needs real time feed and detection.

Datasets used in machine learning

To learn from data, we must be able to understand and manage data in all forms.Data originates from many different sources, and consequently, datasets may differ widely in structure or have little or no structure at all.In this section, we present a high level classification of datasets with commonly occurring examples.

Based on structure, dataset may be classified as containing the following:

  • Structured or record data: Structured data is the most common form of dataset available for machine learning. The data is in theform of records or rows following a well-known format with features that are either columns in a table or fields delimited by separators or tokens. There is no explicit relationship between the records or instances. The dataset is available mostly in flat files or relational databases. The records of financial transactions at a bank shown in the following screenshotare an example of structured data:

Mastering Java Machine Learning

Financial Card Transactional Data with labels of Fraud.

  • Transaction or market data: This is a special form of structured data whereeach corresponds to acollection of items. Examples of market dataset are the list of grocery item purchased by different customers, or movies viewed by customers as shown in the following screenshot:

    Market Dataset for Items bought from grocery store.

    Mastering Java Machine Learning

  • Unstructured data: Unstructured data is normally not available in well-known formats such as structured data. Text data, image, and video data are different formats of unstructured data. Normally, a transformation of some form is needed to extract features from these forms of data to the aforementioned structured datasets so that traditional machine learning algorithms can be applied:

    Mastering Java Machine Learning

    Sample Text Data from SMS with labels of spam and ham from by Tiago A. Almeida from the Federal University of Sao Carlos.

  • Sequential data: Sequential data have an explicit notion of order to them. The order can be some relationship between features and time variable in time series data, or symbols repeating in some form in genomic datasets. Two examples are weather data and genomic sequence data. The following diagram shows the relationship between time and the sensor level for weather:

    Mastering Java Machine Learning

    Time Series from Sensor Data

    Three genomic sequences are taken into consideration to show the repetition of the sequences CGGGT and TTGAAAGTGGTG in all the three genomic sequences:

    Mastering Java Machine Learning

    Genomic Sequences of DNA as sequence of symbols.

  • Graph data: Graph data is characterized by the presence of relationships between entities in the data to form a graph structure. Graph datasets may be in structured record format or unstructured format. Typically, the graph relationship has to be mined from the dataset. Claims in the insurance domain can be considered structured records containingrelevant claims details withclaimants related through addresses/phonenumbers,and so on.This can be viewed in graph structure. Using theWorld Wide Web as an example, we have web pages available as unstructured datacontaininglinks,and graphs of relationships between web pages that can be built using web links, producing some of the most mined graph datasets today:

    Mastering Java Machine Learning

    Insurance Claim Data, converted into graph structure with relationship between vehicles, drivers, policies and addresses

Machine learning applications

Given the rapidly growing use of machine learning in diverse areas of human endeavor, any attempt to list typical applications in the different industries, where some form of machine learning is in use,must necessarily be incomplete. Nevertheless, in this section we list a broad set of machine learning applications by domain, uses and the type of learning used:

Domain/Industry

Applications

Machine Learning Type

Financial

Credit Risk Scoring, Fraud Detection, Anti-Money Laundering

Supervised, Unsupervised, Graph Models, Time Series, and Stream Learning

Web

Online Campaigns, Health Monitoring, Ad Targeting

Supervised, Unsupervised, Semi-Supervised

Healthcare

Evidence-based Medicine, Epidemiological Surveillance, Drug Events Prediction, Claim Fraud Detection

Supervised, Unsupervised, Graph Models, Time Series, and Stream Learning

Internet of Thing (IoT)

Cyber Security, Smart Roads, Sensor Health Monitoring

Supervised, Unsupervised, Semi-Supervised, and Stream Learning

Environment

Weather forecasting, Pollution modeling, Water quality measurement

Time Series, Supervised, Unsupervised, Semi-Supervised, and Stream Learning

Retail

Inventory, Customer Management and Recommendations, Layout and Forecasting

Time Series, Supervised, Unsupervised, Semi-Supervised, and Stream Learning

Summary:

  • A revival of interest is seen in the area of artificial intelligence (AI)and machine learning, in particular, both in academic circles and industry.
  • The use of machine learning  is to help in complex decision making at the highest levels of business. It has also achieved enormous success in improving the accuracy of everyday applications, such as search, speech recognition, and personal assistants on mobile phones.
  • The basics of machine learning rely on understanding of data.Structured datasets have specific formats, and an unstructured dataset is normally in the form of some free flowing text.
  • Machine learning is of two types: Supervised learning is the popular branch of machine learning, which is about learning from labeled data and Unsupervised learning is understanding the data and exploring it in order to build machine learning models when the labels are not given. 

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here