10 min read

Recently we see computers allow natural forms of interaction and are becoming more ubiquitous, more capable, and more ingrained in our daily lives. They are becoming less like heartless dumb tools and more like friends, able to entertain us, look out for us, and assist us with our work.

This article is an excerpt taken from the book Machine Learning with Core ML authored by Joshua Newnham.

With this shift comes a need for computers to be able to understand our emotional state. For example, you don’t want your social robot cracking a joke after you arrive back from work having lost your job (to an AI bot!). This is a field of computer science known as affective computing (also referred to as artificial emotional intelligence or emotional AI), a field that studies systems that can recognize, interpret, process, and simulate human emotions. The first stage of this is being able to recognize the emotional state. In this article, we will be creating a model that can detect the exact face expression or emotion using CoreML.

Input data and preprocessing

We will implement the preprocessing functionality required to transform images into something the model is expecting. We will build up this functionality in a playground project before migrating it across to our project in the next section.

If you haven’t done so already, pull down the latest code from the accompanying repository: https://github.com/packtpublishing/machine-learning-with-core-ml. Once downloaded, navigate to the directory Chapter4/Start/ and open the Playground project ExploringExpressionRecognition.playground. Once loaded, you will see the playground for this extract, as shown in the following screenshot:

Before starting, to avoid looking at images of me, please replace the test images with either personal photos of your own or royalty free images from the internet, ideally a set expressing a range of emotions.

Along with the test images, this playground includes a compiled Core ML model (we introduced it in the previous image) with its generated set of wrappers for inputs, outputs, and the model itself. Also included are some extensions for UIImage, UIImageView, CGImagePropertyOrientation, and an empty CIImage extension, to which we will return later in the extract. The others provide utility functions to help us visualize the images as we work through this playground.

When developing machine learning applications, you have two broad paths. The first, which is becoming increasingly popular, is to use an end-to-end machine learning model capable of just being fed the raw input and producing adequate results. One particular field that has had great success with end-to-end models is speech recognition. Prior to end-to-end deep learning, speech recognition systems were made up of many smaller modules, each one focusing on extracting specific pieces of data to feed into the next module, which was typically manually engineered. Modern speech recognition systems use end-to-end models that take the raw input and output the result. Both of the described approaches can been seen in the following diagram:

Obviously, this approach is not constrained to speech recognition and we have seen it applied to image recognition tasks, too, along with many others. But there are two things that make this particular case different; the first is that we can simplify the problem by first extracting the face. This means our model has less features to learn and offers a smaller, more specialized model that we can tune. The second thing, which is no doubt obvious, is that our training data consisted of only faces and not natural images. So, we have no other choice but to run our data through two models, the first to extract faces and the second to perform expression recognition on the extracted faces, as shown in this diagram:

Luckily for us, Apple has mostly taken care of our first task of detecting faces through the Vision framework it released with iOS 11. The Vision framework provides performant image analysis and computer vision tools, exposing them through a simple API. This allows for face detection, feature detection and tracking, and classification of scenes in images and video. The latter (expression recognition) is something we will take care of using the Core ML model introduced earlier.

Prior to the introduction of the Vision framework, face detection would typically be performed using the Core Image filter. Going back further, you had to use something like OpenCV. You can learn more about Core Image here: https://developer.apple.com/library/content/documentation/GraphicsImaging/Conceptual/CoreImaging/ci_detect_faces/ci_detect_faces.html.

Now that we have got a bird’s-eye view of the work that needs to be done, let’s turn our attention to the editor and start putting all of this together. Start by loading the images; add the following snippet to your playground:

var images = [UIImage]()
for i in 1...3{
    guard let image = UIImage(named:"images/joshua_newnham_\(i).jpg")
        else{ fatalError("Failed to extract features") }

images.append(image)
}

let faceIdx = 0
let imageView = UIImageView(image: images[faceIdx])
imageView.contentMode = .scaleAspectFit

In the preceding snippet, we are simply loading each of the images we have included in our resources’ Images folder and adding them to an array we can access conveniently throughout the playground. Once all the images are loaded, we set the constant faceIdx, which will ensure that we access the same images throughout our experiments. Finally, we create an ImageView to easily preview it. Once it has finished running, click on the eye icon in the right-hand panel to preview the loaded image, as shown in the following screenshot:

Next, we will take advantage of the functionality available in the Vision framework to detect faces. The typical flow when working with the Vision framework is defining a request, which determines what analysis you want to perform, and defining the handler, which will be responsible for executing the request and providing means of obtaining the results (either through delegation or explicitly queried). The result of the analysis is a collection of observations that you need to cast into the appropriate observation type; concrete examples of each of these can be seen here:

As illustrated in the preceding diagram, the request determines what type of image analysis will be performed; the handler, using a request or multiple requests and an image, performs the actual analysis and generates the results (also known as observations). These are accessible via a property or delegate if one has been assigned. The type of observation is dependent on the request performed; it’s worth highlighting that the Vision framework is tightly integrated into Core ML and provides another layer of abstraction and uniformity between you and the data and process. For example, using a classification Core ML model would return an observation of type VNClassificationObservation. This layer of abstraction not only simplifies things but also provides a consistent way of working with machine learning models.

In the previous figure, we showed a request handler specifically for static images. Vision also provides a specialized request handler for handling sequences of images, which is more appropriate when dealing with requests such as tracking. The following diagram illustrates some concrete examples of the types of requests and observations applicable to this use case:

So, when do you use VNImageRequestHandler and VNSequenceRequestHandler? Though the names provide clues as to when one should be used over the other, it’s worth outlining some differences.

The image request handler is for interactive exploration of an image; it holds a reference to the image for its life cycle and allows optimizations of various request types. The sequence request handler is more appropriate for performing tasks such as tracking and does not optimize for multiple requests on an image.

Let’s see how this all looks in code; add the following snippet to your playground:

let faceDetectionRequest = VNDetectFaceRectanglesRequest()
let faceDetectionRequestHandler = VNSequenceRequestHandler()

Here, we are simply creating the request and handler; as discussed in the preceding code, the request encapsulates the type of image analysis while the handler is responsible for executing the request. Next, we will get faceDetectionRequestHandler to run faceDetectionRequest; add the following code:

try? faceDetectionRequestHandler.perform(
    [faceDetectionRequest],
    on: images[faceIdx].cgImage!,
    orientation: CGImagePropertyOrientation(images[faceIdx].imageOrientation))

The perform function of the handler can throw an error if it fails; for this reason, we wrap the call with try? at the beginning of the statement and can interrogate the error property of the handler to identify the reason for failing. We pass the handler a list of requests (in this case, only our faceDetectionRequest), the image we want to perform the analysis on, and, finally, the orientation of the image that can be used by the request during analysis.

Once the analysis is done, we can inspect the observation obtained through the results property of the request itself, as shown in the following code:

if let faceDetectionResults = faceDetectionRequest.results as? [VNFaceObservation]{
    for face in faceDetectionResults{
        // ADD THE NEXT SNIPPET OF CODE HERE
    }
}

The type of observation is dependent on the analysis; in this case, we’re expecting a VNFaceObservation. Hence, we cast it to the appropriate type and then iterate through all the observations.

Next, we will take each recognized face and extract the bounding box. Then, we’ll proceed to draw it in the image (using an extension method of UIImageView found within the UIImageViewExtension. swift file). Add the following block within the for loop shown in the preceding code:

if let currentImage = imageView.image{
    let bbox = face.boundingBox

let imageSize = CGSize(
width:currentImage.size.width,
height: currentImage.size.height)

let w = bbox.width * imageSize.width
let h = bbox.height * imageSize.height
let x = bbox.origin.x * imageSize.width
let y = bbox.origin.y * imageSize.height

let faceRect = CGRect(
x: x,
y: y,
width: w,
height: h)

let invertedY = imageSize.height – (faceRect.origin.y + faceRect.height)
let invertedFaceRect = CGRect(
x: x,
y: invertedY,
width: w,
height: h)

imageView.drawRect(rect: invertedFaceRect)
}

We can obtain the bounding box of each face via the let boundingBox property; the result is normalized, so we then need to scale this based on the dimensions of the image. For example, you can obtain the width by multiplying boundingBox with the width of the image: bbox.width * imageSize.width.

Next, we invert the axis as the coordinate system of Quartz 2D is inverted with respect to that of UIKit’s coordinate system, as shown in this diagram:

We invert our coordinates by subtracting the bounding box’s origin and height from height of the image and then passing this to our UIImageView to render the rectangle. Click on the eye icon in the right-hand panel in line with the statement imageView.drawRect(rect: invertedFaceRect) to preview the results; if successful, you should see something like the following:

An alternative to inverting the face rectangle would be to use an AfflineTransform, such as:
var transform = CGAffineTransform(scaleX: 1, y: -1)
transform = transform.translatedBy(x: 0, y: -imageSize.height)
let invertedFaceRect = faceRect.apply(transform)

This approach leads to less code and therefore less chances of errors. So, it is the recommended approach. The long approach was taken previously to help illuminate the details.

As a designer and builder of intelligent systems, it is your task to interpret these results and present them to the user. Some questions you’ll want to ask yourself are as follows:

  • What is an acceptable threshold of a probability before setting the class as true?
  • Can this threshold be dependent on probabilities of other classes to remove ambiguity? That is, if Sad and Happy have a probability of 0.3, you can infer that the prediction is inaccurate, or at least not useful.
  • Is there a way to accept multiple probabilities?
  • Is it useful to expose the threshold to the user and have it manually set and/or tune it?

These are only a few questions you should ask. The specific questions and their answers will depend on your use case and users. At this point, we have everything we need to preprocess and perform inference

We briefly explored some use cases showing how emotion recognition could be applied. For a detailed overview of this experiment, check out our book, Machine Learning with Core ML to further implement Core ML for visual-based applications using the principles of transfer learning and neural networks.

Read Next

Amazon Rekognition can now ‘recognize’ faces in a crowd at real-time

5 cool ways Transfer Learning is being used today

My friend, the robot: Artificial Intelligence needs Emotional Intelligence


Subscribe to the weekly Packt Hub newsletter. We'll send you the results of our AI Now Survey, featuring data and insights from across the tech landscape.

* indicates required

LEAVE A REPLY

Please enter your comment!
Please enter your name here