12 min read

Augmented Reality (AR) filters that are used on applications such as Snapchat and Instagram have gained worldwide popularity.

This tutorial is an excerpt taken from the book ‘Machine Learning Projects for Mobile Applications’ written by Karthikeyan NG.

In this tutorial,  we will look at how you can build your own Augmented Reality (AR) filter using TensorFlow Lite, a platform that allows you to run machine learning models on mobile and embedded devices.  With this application, we will place AR filters on top of a real-time camera view.

Using AR filters, we can add a mustache to a male’s facial key point, and we can add a relevant emotional expression on top of the eyes. The TensorFlow Lite model is used to detect gender and emotion from the camera view. We will be looking at concepts such as MobileNet models and building the dataset required for model conversion before looking at how to build the Android application.

MobileNet models

We use the MobileNet model to identify gender, while the AffectNet model is used to detect emotion. Facial key point detection is achieved using Google’s Mobile Vision API.

TensorFlow offers various pre-trained models, such as drag and drop models, in order to identify approximately 1,000 default objects. When compared with other similar models such as the Inception model datasets, MobileNet works better with latency, size, and accuracy. In terms of output performance, there is a significant amount of lag, with a full-fledged model. However, the trade-off is acceptable when the model is deployable on a mobile device and for real-time offline model detection.

The MobileNet architecture deals with a 3 x 3 convolution layer in a different way from a typical CNN.

For a more detailed explanation of the MobileNet architecture, please visit https://arxiv.org/pdf/1704.04861.pdf.

Let’s look at an example of how to use MobileNet. Let’s not build one more generic dataset in this case. Instead, we will write a simple classifier to find Pikachu in an image. The following are sample pictures showing an image of Pikachu and an image without Pikachu:

Building the dataset

To build our own classifier, we need to have datasets that contain images with and without Pikachu. You can start with 1,000 images on each database and you can pull down such images here: https://search.creativecommons.org/.

Let’s create two folders named pikachu and no-pikachu and drop those images accordingly. Always ensure that you have the appropriate licenses to use any images, especially for commercial purposes.

Image scrapper from the Google and Bing API: https://github.com/rushilsrivastava/image_search.

Now we have an image folder, which is structured as follows:

/dataset/
     /pikachu/[image1,..]
     /no-pikachu/[image1,..]

Retraining of images 

We can now start labeling our images. With TensorFlow, this job becomes easier. Assuming that you have installed TensorFlow already, download the following retraining script:

curl 
 https://github.com/tensorflow/hub/blob/master/examples/
 image_retraining/retrain.py

Let’s retrain the image with the Python script now:

python retrain.py \
 --image_dir ~/MLmobileapps/Chapter5/dataset/ \
 --learning_rate=0.0001 \
 --testing_percentage=20 \
 --validation_percentage=20 \
 --train_batch_size=32 \
 --validation_batch_size=-1 \
 --eval_step_interval=100 \
 --how_many_training_steps=1000 \
 --flip_left_right=True \
 --random_scale=30 \
 --random_brightness=30 \
 --architecture mobilenet_1.0_224 \
 --output_graph=output_graph.pb \
 --output_labels=output_labels.txt
If you set validation_batch_size to -1, it will validate the whole dataset; learning_rate = 0.0001 works well. You can adjust and try this for yourself. In the architecture flag, we choose which version of MobileNet to use, from versions 1.0, 0.75, 0.50, and 0.25. The suffix number 224 represents the image resolution. You can specify 224, 192, 160, or 128 as well.

Model conversion from GraphDef to TFLite

TocoConverter is used to convert from a TensorFlow GraphDef file or SavedModel into either a TFLite FlatBuffer or graph visualization. TOCO stands for TensorFlow Lite Optimizing Converter.

We need to pass the data through command-line arguments. There are a few command-line arguments listed in the following with TensorFlow 1.10.0:

 --output_file OUTPUT_FILE
 Filepath of the output tflite model.
 --graph_def_file GRAPH_DEF_FILE
 Filepath of input TensorFlow GraphDef.
 --saved_model_dir 
 Filepath of directory containing the SavedModel.
 --keras_model_file
 Filepath of HDF5 file containing tf.Keras model.
 --output_format {TFLITE,GRAPHVIZ_DOT}
 Output file format.
 --inference_type {FLOAT,QUANTIZED_UINT8}
 Target data type in the output
 --inference_input_type {FLOAT,QUANTIZED_UINT8}
 Target data type of real-number input arrays. 
 --input_arrays INPUT_ARRAYS
 Names of the input arrays, comma-separated.
 --input_shapes INPUT_SHAPES
 Shapes corresponding to --input_arrays, colon-separated.
 --output_arrays OUTPUT_ARRAYS
 Names of the output arrays, comma-separated.

We can now use the toco tool to convert the TensorFlow model into a TensorFlow Lite model:

toco \ 
 --graph_def_file=/tmp/output_graph.pb
 --output_file=/tmp/optimized_graph.tflite 
 --input_arrays=Mul 
 --output_arrays=final_result 
 --input_format=TENSORFLOW_GRAPHDEF 
 --output_format=TFLITE 
 --input_shape=1,${224},${224},3 
 --inference_type=FLOAT 
 --input_data_type=FLOAT

Similarly, we have two model files used in this application: the gender model and emotion model. These will be explained in the following two sections.

To convert ML models in TensorFlow 1.9.0 to TensorFlow 1.11.0, use TocoConverter. TocoConverter is semantically identically to TFLite Converter. To convert models prior to TensorFlow 1.9, use the toco_convert function. Run help(tf.contrib.lite.toco_convert) to get details about acceptable parameters.

Gender model

This is built on the IMDB WIKI dataset, which contains 500k+ celebrity faces. It uses the MobileNet_V1_224_0.5 version of MobileNet.

The link to the data model project can be found here: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/.

It is very rare to find public datasets with thousands of images. This dataset is built on top of a large collection of celebrity faces. There are two common places: one is IMDb and the other one is Wikipedia. More than 100K celebrities’ details were retrieved from their profiles from both sources through scripts. Then it was organized by removing noises(irrelevant content).

Emotion model

This is built on the AffectNet model with more than 1 million images. It uses the MobileNet_V2_224_1.4 version of MobileNet.

The link to the data model project can be found here: http://mohammadmahoor.com/affectnet/.

The AffectNet model is built by collecting and annotating facial images of more than 1 million faces from the internet. The images were sourced from three search engines, using around 1,250 related keywords in six different languages.

Comparison of MobileNet versions

In both of our models, we use different versions of MobileNet models. MobileNet V2 is mostly an updated version of V1 that makes it even more efficient and powerful in terms of performance. We will see a few factors between both the models:

The picture above shows the numbers from MobileNet V1 and V2 belong to the model versions with 1.0 depth multiplier. It is better if the numbers are lower in this table. By seeing the results we can assume that V2 is almost twice as fast as V1 model. On a mobile device when memory access is limited than the computational capability V2 works very well.

MACs—multiply-accumulate operations. This measures how many calculations are needed to perform inference on a single 224×224 RGB image. When the image size increases more MACs are required.

From the number of MACs alone, V2 should be almost twice as fast as V1. However, it’s not just about the number of calculations. On mobile devices, memory access is much slower than computation. But here V2 has the advantage too: it only has 80% of the parameter count that V1 has. Now, let’s look into the performance in terms of accuracy:

The figure shown above is tested on ImageNet dataset. These numbers can be misleading as it depends on all the constraints that is taken into account while deriving these numbers.

The IEEE paper behind the model can be found here: http://mohammadmahoor.com/wp-content/uploads/2017/08/AffectNet_oneColumn-2.pdf.

Building the Android application

Now create a new Android project from Android Studio. This should be called ARFilter, or whatever name you prefer:

On the next screen, select the Android OS versions that our application supports and select API 15 which is not shown on the image. That covers almost all existing Android phones. When you are ready, press Next. On the next screen, select Add No Activity and click Finish. This creates an empty project:

Once the project is created, let’s add one Empty Activity. We can select different activity styles based on our needs:

Name the created activity Launcher Activity by selecting the checkbox. This adds an intent filter under the particular activity in the AndroidManifest.xml file:

<intent-filter>
    <action android:name="android.intent.action.MAIN" />
    <category android:name="android.intent.category.LAUNCHER" />
</intent-filter>
<intent-filter>: To advertise which implicit intents your app can receive, declare one or more intent filters for each of your app components with an <intent-filter> element in your manifest file. Each intent filter specifies the type of intents it accepts based on the intent’s action, data, and category. The system delivers an implicit intent to your app component only if the intent can pass through one of your intent filters. Here, the intent is to keep this activity as the first activity when the app is opened by the user.

Next, we will name the launcher activity:

Once the activity is created, let’s start designing the user interface (UI) layout for the activity. Here, the user selects which model to utilize in this application. We have two models for gender and emotion detection, whose details we discussed earlier. In this activity, we will add two buttons and their corresponding model classifiers, shown as follows:

With the selection of the corresponding model, we will launch the next activity accordingly using a clickListener event with the ModelSelectionActivity class as follows. Based on the clicks on the buttons on gender identification or emotion identification, we will pass on the information to the ARFilterActivity. So that the corresponding model will be loaded into memory:

@Override
public void onClick(View view) {
    int id = view.getId();

    if(id==R.id.genderbtn){
        Intent intent = new Intent(this, ARFilterActivity.class);
        intent.putExtra(ARFilterActivity.MODEL_TYPE,"gender");
        startActivity(intent);
    }
    else if(id==R.id.emotionbtn){
        Intent intent = new Intent(this,ARFilterActivity.class);
        intent.putExtra(ARFilterActivity.MODEL_TYPE,"emotion");
        startActivity(intent);
    }
}
Intent: An Intent is a messaging object you can use to request an action from another app component. Although intents facilitate communication between components in several ways, there are three fundamental use cases such as starting an Activity, starting a service and delivering a broadcast.

In ARFilterActivity, we will have the real-time view classification. The object that has been passed on will be received inside the filter activity, where the corresponding classifier will be invoked as follows. Based on the classifier selected from the previous activity, the corresponding model will be loaded into ARFilterActivity inside the OnCreate() method as shown as follows:

public static String classifierType(){
    String type = mn.getIntent().getExtras().getString("TYPE");
    if(type!=null) {
        if(type.equals("gender"))
            return "gender";
        else
            return "emotion";
    }
    else
        return null;
}

The UI will be designed accordingly in order to display the results in the bottom part of the layout via the activity_arfilter layout as follows. CameraSourcePreview initiates the Camera2 API for a view inside that we will add GraphicOverlay class. It is a view which renders a series of custom graphics to be overlayed on top of an associated preview (that is the camera preview). The creator can add graphics objects, update the objects, and remove them, triggering the appropriate drawing and invalidation within the view.

It supports scaling and mirroring of the graphics relative the camera’s preview properties. The idea is that detection item is expressed in terms of a preview size but need to be scaled up to the full view size, and also mirrored in the case of the front-facing camera:

<com.mlmobileapps.arfilter.CameraSourcePreview
    android:id="@+id/preview"
    android:layout_width="wrap_content"
    android:layout_height="wrap_content">

    <com.mlmobileapps.arfilter.GraphicOverlay
        android:id="@+id/faceOverlay"
        android:layout_width="match_parent"
        android:layout_height="match_parent" />
</com.mlmobileapps.arfilter.CameraSourcePreview>

We use the CameraPreview class from the Google open source project and the CAMERA object needs user permission based on different Android API levels:

Once we have the Camera API ready, we need to have the appropriate permission from the user side to utilize the camera as shown below. We need these following permissions:

  • Manifest.permission.CAMERA
  • Manifest.permission.WRITE_EXTERNAL_STORAGE
private void requestPermissionThenOpenCamera() {
    if(ContextCompat.checkSelfPermission(context,
Manifest.permission.CAMERA) == PackageManager.PERMISSION_GRANTED) {
        if (ContextCompat.checkSelfPermission(context, Manifest.permission.WRITE_EXTERNAL_STORAGE) == PackageManager.PERMISSION_GRANTED) {
            Log.e(TAG, "requestPermissionThenOpenCamera: 
                       "+Build.VERSION.SDK_INT);
            useCamera2 = (Build.VERSION.SDK_INT >= Build.VERSION_CODES.LOLLIPOP);
            createCameraSourceFront();
        } else {
            ActivityCompat.requestPermissions(this, new String[]
{Manifest.permission.WRITE_EXTERNAL_STORAGE}, REQUEST_STORAGE_PERMISSION);
        }
    } else {
        ActivityCompat.requestPermissions(this, new String[]{Manifest.permission.CAMERA}, REQUEST_CAMERA_PERMISSION);
    }
}

With this, we now have an application that has a screen where we can choose which model to load. On the next screen, we have the camera view ready. We now have to load the appropriate model, detect the face on the screen, and apply the filter accordingly.

Face detection on the real camera view is done through the Google Vision API. This can be added on your build.gradle as a dependency as follows. You should always use the latest version of the api:

api 'com.google.android.gms:play-services-vision:15.0.0'

The image classification object is initialized inside the OnCreate() method of the ARFilterActivity and inside the ImageClassifier class. The corresponding model is loaded based on user selection as follows:

private void initPaths(){
  String type = ARFilterActivity.classifierType();
  if(type!=null)
  {
    if(type.equals("gender")){
      MODEL_PATH = "gender.lite";
      LABEL_PATH = "genderlabels.txt";
    }
    else{
      MODEL_PATH = "emotion.lite";
      LABEL_PATH = "emotionlabels.txt";
    }
  }
}

Once the model is decided, we will read the file and load them into memory. Thus in this article, we looked at concepts such as mobile net models and building the dataset required for the model application, we then looked at how to build a Snapchat-like AR filter.  If you want to know the further steps to build AR filter such as loading the model, and so on, be sure to check out the book  ‘Machine Learning Projects for Mobile Applications’.

Read Next

Snapchat source code leaked and posted to GitHub

Snapchat is losing users – but revenue is up

15 year old uncovers Snapchat’s secret visual search function

Tech writer at the Packt Hub. Dreamer, book nerd, lover of scented candles, karaoke, and Gilmore Girls.