In this article by Leif Henning Larsen, author of the book Learning Microsoft Cognitive Services, we will look into what Microsoft Cognitive Services offer. You will then learn how to utilize one of the APIs by recognizing faces in images.

Microsoft Cognitive Services give developers the possibilities of adding AI-like capabilities to their applications. Using a few lines of code, we can take advantage of powerful algorithms that would usually take a lot of time, effort, and hardware to do yourself.

(For more resources related to this topic, see here.)

Overview of Microsoft Cognitive Services

Using Cognitive Services means you have 21 different APIs at your hand. These are in turn separated into 5 top-level domains according to what they do. They are vision, speech, language, knowledge, and search. Let's see more about them in the following sections.

Vision

APIs under the vision flags allows your apps to understand images and video content. It allows you to retrieve information about faces, feelings, and other visual content. You can stabilize videos and recognize celebrities. You can read text in images and generate thumbnails from videos and images.

There are four APIs contained in the vision area, which we will see now.

Computer Vision

Using the Computer Vision API, you can retrieve actionable information from images. This means you can identify content (such as image format, image size, colors, faces, and more). You can detect whether an image is adult/racy. This API can recognize text in images and extract it to machine-readable words. It can detect celebrities from a variety of areas. Lastly, it can generate storage-efficient thumbnails with smart cropping functionality.

Emotion

The Emotion API allows you to recognize emotions, both in images and videos. This can allow for more personalized experiences in applications. The emotions that are detected are cross-cultural emotions: anger, contempt, disgust, fear, happiness, neutral, sadness, and surprise.

Face

We have already seen the very basic example of what the Face API can do. The rest of the API revolves around the same—to detect, identify, organize, and tag faces in photos. Apart from face detection, you can see how likely it is that two faces belong to the same person. You can identify faces and also find similar-looking faces.

Video

The Video API is about analyzing, editing, and processing videos in your app. If you have a video that is shaky, the API allows you to stabilize it. You can detect and track faces in videos. If a video contains a stationary background, you can detect motion. The API lets you to generate thumbnail summaries for videos, which allows users to see previews or snapshots quickly.

Speech

Adding one of the Speech APIs allows your application to hear and speak to your users. The APIs can filter noise and identify speakers. They can drive further actions in your application based on the recognized intent.

Speech contains three APIs, which we will discuss now.

Bing Speech

Adding the Bing Speech API to your application allows you to convert speech to text and vice versa. You can convert spoken audio to text either by utilizing a microphone or other sources in real time or by converting audio from files. The API also offer speech intent recognition, which is trained by Language Understanding Intelligent Service to understand the intent.

Speaker Recognition

The Speaker Recognition API gives your application the ability to know who is talking. Using this API, you can use verify that someone speaking is who they claim to be. You can also determine who an unknown speaker is, based on a group of selected speakers.

Custom Recognition

To improve speech recognition, you can use the Custom Recognition API. This allows you to fine-tune speech recognition operations for anyone, anywhere. Using this API, the speech recognition model can be tailored to the vocabulary and speaking style of the user. In addition to this, the model can be customized to match the expected environment of the application.

Language

APIs related to language allow your application to process natural language and learn how to recognize what users want. You can add textual and linguistic analysis to your application as well as natural language understanding.

The following five APIs can be found in the Language area.

Bing Spell Check

Bing Spell Check API allows you to add advanced spell checking to your application.

Language Understanding Intelligent Service (LUIS)

Language Understanding Intelligent Service, or LUIS, is an API that can help your application understand commands from your users. Using this API, you can create language models that understand intents. Using models from Bing and Cortana, you can make these models recognize common requests and entities (such as places, time, and numbers). You can add conversational intelligence to your applications.

Linguistic Analysis

Linguistic Analysis API lets you parse complex text to explore the structure of text. Using this API, you can find nouns, verbs, and more from text, which allows your application to understand who is doing what to whom.

Text Analysis

Text Analysis API will help you in extracting information from text. You can find the sentiment of a text (whether the text is positive or negative). You will be able to detect language, topic, and key phrases used throughout the text.

Web Language Model

Using the Web Language Model (WebLM) API, you are able to leverage the power of language models trained on web-scale data. You can use this API to predict which words or sequences follow a given sequence or word.

Knowledge

When talking about Knowledge APIs, we are talking about APIs that allow you to tap into rich knowledge. This may be knowledge from the Web, it may be academia, or it may be your own data. Using these APIs, you will be able to explore different nuances of knowledge.

The following four APIs are contained in the Knowledge area.

Academic

Using the Academic API, you can explore relationships among academic papers, journals, and authors. This API allows you to interpret natural language user query strings, which allows your application to anticipate what the user is typing. It will evaluate the said expression and return academic knowledge entities.

Entity Linking

Entity Linking is the API you would use to extend knowledge of people, places, and events based on the context. As you may know, a single word may be used differently based on the context. Using this API allows you to recognize and identify each separate entity within a paragraph based on the context.

Knowledge Exploration

The Knowledge Exploration API will let you add the ability to use interactive search for structured data in your projects. It interprets natural language queries and offers auto-completions to minimize user effort. Based on the query expression received, it will retrieve detailed information about matching objects.

Recommendations

The Recommendations API allows you to provide personalized product recommendations for your customers. You can use this API to add frequently bought together functionality to your application. Another feature you can add is item-to-item recommendations, which allow customers to see what other customers who likes this also like. This API will also allow you to add recommendations based on the prior activity of the customer.

Search

Search APIs give you the ability to make your applications more intelligent with the power of Bing. Using these APIs, you can use a single call to access data from billions of web pages, images videos, and news.

The following five APIs are in the search domain.

Bing Web Search

With Bing Web Search, you can search for details in billions of web documents indexed by Bing. All the results can be arranged and ordered according to the layout you specify, and the results are customized to the location of the end user.

Bing Image Search

Using Bing Image Search API, you can add advanced image and metadata search to your application. Results include URLs to images, thumbnails, and metadata. You will also be able to get machine-generated captions and similar images and more. This API allows you to filter the results based on image type, layout, freshness (how new is the image), and license.

Bing Video Search

Bing Video Search will allow you to search for videos and returns rich results. The results contain metadata from the videos, static- or motion- based thumbnails, and the video itself. You can add filters to the result based on freshness, video length, resolution, and price.

Bing News Search

If you add Bing News Search to your application, you can search for news articles. Results can include authoritative image, related news and categories, information on the provider, URL, and more. To be more specific, you can filter news based on topics.

Bing Autosuggest

Bing Autosuggest API is a small, but powerful one. It will allow your users to search faster with search suggestions, allowing you to connect powerful search to your apps.

Detecting faces with the Face API

We have seen what the different APIs can do. Now we will test the Face API. We will not be doing a whole lot, but we will see how simple it is to detect faces in images.

The steps we need to cover to do this are as follows:

Add necessary NuGet packages to our project.

Add some UI to the test application.

Detect faces on command.

Head over to https://www.microsoft.com/cognitive-services/en-us/face-api to start the process of registering for a free subscription to the Face API. By clicking on the yellow button, stating Get started for free,you will be taken to a login page. Log in with your Microsoft account, or if you do not have one, register for one.

Once logged in, you will need to verify that the Face API Preview has been selected in the list and accept the terms and conditions. With that out of the way, you will be presented with the following:

microsoft-cognitive-services-img-0

You will need one of the two keys later, when we are accessing the API.

In Visual Studio, create a new WPF application. Following the instructions at https://www.codeproject.com/articles/100175/model-view-viewmodel-mvvm-explained, create a base class that implements the INotifyPropertyChanged interface and a class implementing the ICommand interface. The first should be inherited by the ViewModel, the MainViewModel.cs file, while the latter should be used when creating properties to handle button commands.

The Face API has a NuGet package, so we need to add that to our project. Head over to NuGet Package Manager for the project we created earlier. In the Browse tab, search for the Microsoft.ProjectOxford.Face package and install the it from Microsoft:

microsoft-cognitive-services-img-1

As you will notice, another package will also be installed. This is the Newtonsoft.Json package, which is required by the Face API.

The next step is to add some UI to our application. We will be adding this in the MainView.xaml file.

First, we add a grid and define some rows for the grid:

<Grid>
   <Grid.RowDefinitions>
      <RowDefinition Height="*" />
      <RowDefinition Height="20" />
      <RowDefinition Height="30" />
   </Grid.RowDefinitions>

Three rows are defined. The first is a row where we will have an image. The second is a line for status message, and the last is where we will place some buttons:

Next, we add our image element:

<Image x_Name="FaceImage" Stretch="Uniform" Source="{Binding ImageSource}" Grid.Row="0" />

We have given it a unique name. By setting the Stretch parameter to Uniform, we ensure that the image keeps its aspect ratio. Further on, we place this element in the first row. Last, we bind the image source to a BitmapImage interface in the ViewModel, which we will look at in a bit.

The next row will contain a text block with some status text. The text property will be bound to a string property in the ViewModel:

<TextBlock x_Name="StatusTextBlock" Text="{Binding StatusText}" Grid.Row="1" />

The last row will contain one button to browse for an image and one button to be able to detect faces. The command properties of both buttons will be bound to the DelegateCommand properties in the ViewModel:

<Button x_Name="BrowseButton" Content="Browse" Height="20" Width="140" HorizontalAlignment="Left" Command="{Binding BrowseButtonCommand}" Margin="5, 0, 0, 5" Grid.Row="2" />

<Button x_Name="DetectFaceButton" Content="Detect face" Height="20" Width="140" HorizontalAlignment="Right" Command="{Binding DetectFaceCommand}" Margin="0, 0, 5, 5" Grid.Row="2"/>

With the View in place, make sure that the code compiles and run it. This should present you with the following UI:

microsoft-cognitive-services-img-2

The last part is to create the binding properties in our ViewModel and make the buttons execute something. Open the MainViewModel.cs file. First, we define two variables:

private string _filePath;
private IFaceServiceClient _faceServiceClient;

The string variable will hold the path to our image, while the IFaceServiceClient variable is to interface the Face API. Next we define two properties:

private BitmapImage _imageSource;
public BitmapImage ImageSource
{
   get { return _imageSource; }
   set
   {
      _imageSource = value;
      RaisePropertyChangedEvent("ImageSource");
   }
}

private string _statusText;
public string StatusText
{
   get { return _statusText; }
   set
   {
      _statusText = value;
      RaisePropertyChangedEvent("StatusText");
   }
}

What we have here is a property for the BitmapImage mapped to the Image element in the view. We also have a string property for the status text, mapped to the text block element in the view. As you also may notice, when either of the properties is set, we call the RaisePropertyChangedEvent method. This will ensure that the UI is updated when either of the properties has new values.

Next, we define our two DelegateCommand objects and do some initialization through the constructor:

public ICommand BrowseButtonCommand { get; private set; }
public ICommand DetectFaceCommand { get; private set; }

public MainViewModel()
{
   StatusText = "Status: Waiting for image...";

   _faceServiceClient = new FaceServiceClient("YOUR_API_KEY_HERE");

   BrowseButtonCommand = new DelegateCommand(Browse);
   DetectFaceCommand = new DelegateCommand(DetectFace, CanDetectFace);
}

In our constructor, we start off by setting the status text. Next, we create an object of the Face API, which needs to be created with the API key we got earlier.

At last, we create the DelegateCommand object for our command properties. Note how the browse command does not specify a predicate. This means it will always be possible to click on the corresponding button. To make this compile, we need to create the functions specified in the DelegateCommand constructors—the Browse, DetectFace, and CanDetectFace functions:

private void Browse(object obj)
{
   var openDialog = new Microsoft.Win32.OpenFileDialog();

   openDialog.Filter = "JPEG Image(*.jpg)|*.jpg";
   bool? result = openDialog.ShowDialog();

   if (!(bool)result) return;

We start the Browse function by creating an OpenFileDialog object. This dialog is assigned a filter for JPEG images, and in turn it is opened. When the dialog is closed, we check the result. If the dialog was cancelled, we simply stop further execution:

_filePath = openDialog.FileName;
   Uri fileUri = new Uri(_filePath);

With the dialog closed, we grab the filename of the file selected and create a new URI from it:

BitmapImage image = new BitmapImage(fileUri);
            
   image.CacheOption = BitmapCacheOption.None;
   image.UriSource = fileUri;

With the newly created URI, we want to create a new BitmapImage interface. We specify it to use no cache, and we set the URI source the URI we created:

ImageSource = image;
   StatusText = "Status: Image loaded...";
}

The last step we take is to assign the bitmap image to our BitmapImage property, so the image is shown in the UI. We also update the status text to let the user know the image has been loaded.

Before we move on, it is time to make sure that the code compiles and that you are able to load an image into the View:

private bool CanDetectFace(object obj)
{
   return !string.IsNullOrEmpty(ImageSource?.UriSource.ToString());
}

The CanDetectFace function checks whether or not the detect faces button should be enabled. In this case, it checks whether our image property actually has a URI. If it does, by extension that means we have an image, and we should be able to detect faces:

private async void DetectFace(object obj)
{
   FaceRectangle[] faceRects = await UploadAndDetectFacesAsync();

   string textToSpeak = "No faces detected";

   if (faceRects.Length == 1)
      textToSpeak = "1 face detected";
   else if (faceRects.Length > 1)
      textToSpeak = $"{faceRects.Length} faces detected";

   Debug.WriteLine(textToSpeak);
}

Our DetectFace method calls an async method to upload and detect faces. The return value contains an array of FaceRectangles. This array contains the rectangle area for all face positions in the given image. We will look into the function we call in a bit.

After the call has finished executing, we print a line with the number of faces to the debug console window:

private async Task<FaceRectangle[]> UploadAndDetectFacesAsync()
{
   StatusText = "Status: Detecting faces...";

   try
   {
      using (Stream imageFileStream = File.OpenRead(_filePath))
      {

In the UploadAndDetectFacesAsync function, we create a Stream object from the image. This stream will be used as input for the actual call to the Face API service:

Face[] faces = await _faceServiceClient.DetectAsync(imageFileStream, true, true, new List<FaceAttributeType>() { FaceAttributeType.Age });

This line is the actual call to the detection endpoint for the Face API. The first parameter is the file stream we created in the previous step. The rest of the parameters are all optional. The second parameter should be true if you want to get a face ID. The next specifies if you want to receive face landmarks or not. The last parameter takes a list of facial attributes you may want to receive. In our case, we want the age parameter to be returned, so we need to specify that.

The return type of this function call is an array of faces with all the parameters you have specified:

List<double> ages = faces.Select(face => face.FaceAttributes.Age).ToList();
      FaceRectangle[] faceRects = faces.Select(face => face.FaceRectangle).ToArray();

      StatusText = "Status: Finished detecting faces...";

      foreach(var age in ages)
      {
         Console.WriteLine(age);
      }

      return faceRects;
   }
}

The first line in the previous code iterates over all faces and retrieves the approximate age of all faces. This is later printed to the debug console window, in the following foreach loop.

The second line iterates over all faces and retrieves the face rectangle with the rectangular location of all faces. This is the data we return to the calling function.

Add a catch clause to finish the method. In case an exception is thrown, in our API call, we catch that. You want to show the error message and return an empty FaceRectangle array.

With that code in place, you should now be able to run the full example. The end result will look like the following image:

microsoft-cognitive-services-img-3

The resulting debug console window will print the following text:

1 face detected
23,7

Summary

In this article, we looked at what Microsoft Cognitive Services offer. We got a brief description of all the APIs available. From there, we looked into the Face API, where we saw how to detect faces in images.