MIT’s Transparency by Design Network: A high performance model that uses visual reasoning for machine interpretability

3 min read

A team of researchers from MIT Lincoln Laboratory’s Intelligence and Decision Technologies Group have created a neural network, named the Transparency by Design Network ( TbD-net).

This network is capable of performing human-like reasoning to respond to questions about the contents of images. The Transparency by Design model visually renders its thought process to solve problems, thereby, helping human analysts analyze its decision-making process.

The developers of Transparency by Design network built it with an aim to make the inner workings of the neural network transparent, meaning it focuses on finding out how the neural network works and thinks what it thinks. One such example is finding out answers to questions like “what do the neural networks used in self-driving cars think the difference is between a pedestrian and stop sign?”, “when was the neural network able to come up with that difference?”, and so on. Finding out these answers will help researchers teach the neural network to correct incorrect assumptions.

Other than that, Transparency by Design Network closing the gap between performance and interpretability, which is a common problem with today’s neural networks.

“Progress on improving performance in visual reasoning has come at the cost of interpretability,” says Ryan Soklaski, a TbD-net developer, as mentioned in the MIT blog post. The TbD-net comprises a collection of “modules,” which are small neural networks specialized to perform specific subtasks. So, whenever a visual-reasoning question is asked to TbD-net about an image, it first breaks down a question into subtasks, then assigns the appropriate module to fulfill its part.

According to Majumdar, another TbD-net developer, “Breaking a complex chain of reasoning into a series of smaller subproblems, each of which can be solved independently and composed, is a powerful and intuitive means for reasoning”.

After this, each module learns from the module before it and eventually produces the final, correct answer. Each module’s output is visually presented in an “attention mask” which shows heat-map blobs over objects within an image that the module considers an answer.

Overall, for the entire process, TbD-net uses AI techniques such as Adam optimization to interpret the human language questions and break these sentences into subtasks. It also uses multiple computer vision AI techniques like convolution neural networks that help interpret the imagery and uses visual reasoning to share its decision-making process.

When TbD-net was put to test, it achieved results surpassing the best-performing visual reasoning models. The model was evaluated using a visual question-answering dataset. This dataset consisted of 70,000 training images and 700,000 questions as well as test and validation sets of 15,000 images and 150,000 questions. The model managed to achieve a whopping 98.7 percent test accuracy on the dataset. But, the developers further improved this model’s result, achieving 99.1 % accuracy on the CLEVER dataset, with the help of regularization and increasing the spatial resolution.

The attention masks produced by the modules helped the researchers figure out what went wrong, thereby, helping them refine the model. This further resulted in a performance of 99.1 percent accuracy.

“Our model provides straightforward, interpretable outputs at every stage of the visual reasoning process,” says Mascharka.

For more information, be sure to check out the official research paper.