NIPS 2017 Special: How machine learning for genomics is bridging the gap between research and clinical trial success by Brendan Frey

10 min read

Brendan Frey is the founder and CEO of Deep Genomics. He is the professor of engineering and medicine at the University of Toronto. His major work focuses on using machine learning to model genome biology and understand genetic disorders.

This article attempts to bring our readers to Brendan’s Keynote speech at NIPS 2017. It highlights how the human genome can be reprogrammed using Machine Learning and gives a glimpse into some of the significant work going on in this field. After reading this article, head over to the NIPS Facebook page for the complete keynote. All images in this article come from Brendan’s presentation slides and do not belong to us.

65% of people in their lifetime are at a risk of acquiring a disease with a genetic basis. 8 million births per year are estimated to have a serious genetic defect. According to the US healthcare system, the lifetime average cost of such a baby is 5M$ per child. These are just some statistics. If we also add the emotional component to this data, it gives us an alarming picture of the state of the healthcare industry today.

According to a recent study,  investing in pharma is no longer as lucrative as it used to be in the 90s. Funding for this sector is dwindling, which serves as a barrier to drug discovery, trial, and deployment. All of these, in turn, add to the rising cost of healthcare.

Better to stuff your money in a mattress than put it in a pharmaceutical company!

Genomics as a field is rich in data. Experts in genomics strive to determine complete DNA sequences and perform genetic mapping to help understand a disease. However, the main problem confronting Genome Biology and Genomics is the inability to decipher information from the human genome i.e. how to convert the genome into actionable information.

What genes are made of and why sequencing matter

Essentially each gene consists of a promoter region, which basically activates the gene. Following the promoter region, there are alternating Exons and Introns. Introns are almost 10,000 nucleotides long. Exons are relatively short, around 100 nucleotides long. In software terms, you can think of Exons as print statements. Exons are the part that ends up in proteins. Introns get cut out/removed. However, Introns contain crucial control logic. There are words embedded in introns that tells the cells how to cut and paste these exons together and make the gene.

A DNA sequence is transcribed into RNA and then the RNA is processed in various ways to translate into proteins. However, the picture is much more complicated than this. Proteins go back and interact with the DNA. Proteins also interact with RNA. even, RNA interacts with protein. So all these entities are interrelated.

All these technicalities and interrelationships make biology generally complex for researchers or even a group of researchers to fully understand and make sense of the data. Another way to look at this is, that in the recent years our ability to measure biology (fitbits, tabloids, genomes) and the ability to alter biology (DNA editing) has far surpassed our ability to understand biology. In short, in this field, we have become very good at collecting data but not as good with interpreting it.

Machine Learning brought to genomes

Deep Genomics, is a genetic medicine company that uses an AI-driven platform to support geneticists, molecular biologists and chemists in the development of genetic therapies.

In 2010, Deep Genomics used machine learning to understand how words embedded in introns control print statements splicing which puts exons into proteins. They also used machine learning to reverse engineer to infer those code words using datasets.

Another deep genomic research project talks about Protein-DNA binding data. There are datasets which allow you to measure interactions between protein and DNA and understand how that works.  In this research, they took a dataset from Ray et al 2013 which consisted of 240,000 designed sequences and then evaluated which are the proteins that the sequence likes to stick to. Thus generating a big data matrix of proteins and designed sequences.

The machine learning task here was to learn to take a sequence and predict whether the protein will bind to that sequence.

How was this done?

They took batches of data containing the designed sequences and fed it into a convolutional neural network. The CNN swept across those sequences to generate an intermediary representation.  This representation was then fed into different layers of convolutional pooling and fully connected layers produced the output. The output was then compared to the measurement (the data matrix of proteins and designed sequences described earlier ) and backpropagation was used to update the parameters.

One of the challenges was figuring out the right metric. For this, they compared the measured binding affinity (how much protein sticks to the sequence) to the output of the neural network and determined the right cos function for producing a neural network that is useful in practice.


One of the use cases of this neural network is to identify the pathological mutation and fix them.

The above illustration is a sequence of the cholesterol gene. The researchers artificially in silico looked at every possible mutation in the promoter. So for each nucleotide, say if the nucleotide had a value of A, they switched it to G, C, and T and for each of those possibilities, they ran the entire promoter through a neural network and looked at its output.

The neural network then predicted the mutations that will disrupt the protein binding. The heights of the letter showed the measured binding affinity i.e the output of the neural network.  The white box displays how much the mutation changed the output.  Pink or bright red was used in case of positive mutation, blue in case of negative mutation and white for no change. This map was then compared with known results to see the accuracy and also make predictions never seen before in a clinical trial.

As shown in the image, Blues, which are the potential or known harmful mutations have correctly fallen in the white spaces. But there are some unknown mutations as well. Machine learning output such as this can help researchers narrow their focus on learning about new diseases and also in diagnosing existing ones and treating them.

Another group of researchers used a neural network to figure out the 3D structure or the chromatin interaction structure of the DNA. The data used was a matrix form and showed how strongly two parts of a DNA are likely to interact. The researchers trained a multilayer convolutional network to take as input the raw DNA sequence and also a signal called chromatin accessibility( tells how available the DNA is) and fed it into CNN. The output of that system predicted the probability of contact which is crucial for gene expression.

Deep genomics: Using AI to build a new universe of digital medicines

The founding belief at Deep Genomics is that the future of medicine will rely on artificial intelligence, because biology is too complex for humans to understand.

The goal of deep genomics is to build AI platform for detecting and treating genetic disease.

Genome tools

Genome processing tools are tools which help in identification of mutation, e.g DeepVariant. At deep genomics, the tool used is called genomic kit which is 20 to 800 times faster than other existing tools.

Disease mechanism Prediction

This is about figuring whether the disease mechanism is pathological or the mutation which simply changes hair color.

Therapeutic Development

Helping patients by providing them with better medicines.

These are the basics of any drug development procedure.

  • We start with patient genetic data and clinical mutations.
  • Then we find the disease mechanism and figure the mechanism of action( the steps to remediate the problem). However, the disease mechanism and mechanism of action of a potential drug may not be the inverse of one another.
  • The next step is to design a drug. With Digital medicines, if we know the mechanism of action that we are trying to achieve, and we have ML systems, like the ones described earlier, we can simulate the effects of modifying DNA or RNA. Thus we can, in silico design the compound we want to test.
  • Next, we test the experimental work in the wet lab to actually see if it alters the way in which ML systems predicted.
  • The next thing is toxicity or off-target effects. This evaluates if the compound is going to change some other part of the genome or has some unintended consequences.
  • Next, we have clinical trials. In case of clinical trials, the biggest problems facing pharmaceutical companies is patient’s gratification.
  • Then comes the marketing and distribution of that drug which is highly costly. This includes marketing strategies to convince people to buy those drugs, insurance companies to pay for them, and legal companies to deal with litigations.

Here’s how long it took Ionis and Biogen, to develop Spinraza, which is a drug for curing Spinal Muscular Atrophy (SMA).

It is the most effective drug for curing SMA. It has saved hundreds of lives already. However, it costs 750,000$ per child per year. Why does it cost so much? If we look at the timeline of the development of Spinraza, the initial period of testing was quite long.

The goal of deep Genomics is to use ML to accelerate the research period of drugs such as Spinraza from 8 years down to a couple of years. They also aim to use AI to accelerate clinical trials, toxicity studies, and other aspects of drug development. The whole idea is to reduce the amount of time needed to develop the drug.

Deep genomics uses AI to automate and accelerate each of these steps and make it fast and accurate. However, apart from AI, they also test compounds at their wet lab in human cells to see if they work.

They also use Cloud Laboratory. At cloud lab, they upload a python script. Once uploaded, it specifies the experimental protocols and then robots conduct these experiments. These labs rapidly scale up the ability to do experiments, test compounds, and solve other problems.

Earning trust of stakeholders

One of the major issues ML systems face in the genomics industry is earning the trust of the stakeholders. These stakeholders include the patients, the physicians treating the patients, the insurance companies paying for the treatments, different technology providers, and the hospitals.

Machine learning practitioners are also often criticized for producing black boxes, that are not open to interpretation.

The way to gain this trust is to exactly figure out what these stakeholders need.  For this, machine learning systems need to explain the intermediary steps of a prediction. For instance, instead of directly recommending double mastectomy, the system says you have a mutation, the mutation is going to cause splicing to go wrong, leading to malfunctioning protein, which is likely to lead to breast cancer. The likelihood is x%.

The road ahead

Researchers at Deep Genomics pare currently working primarily on Project Saturn. The idea is to use a Machine learning system to scan a vast space of 69 billion molecules all in silico and identify about a thousand active compounds. Active compounds allow us to manipulate cell biology. Think about it as 1000 control switches which we can turn and twist to adjust what is going inside a cell, a toolkit for therapeutic development. They plan to have 3 compounds in clinical trials within the next 3 years.

Sugandha Lahoti
Content Marketing Editor at Packt Hub. I blog about new and upcoming tech trends. In my free time, you can find me humming songs and hunting for new novels to read.

Share this post


12,000+ unsecured MongoDB databases deleted by Unistellar attackers

Over the last three weeks, more than 12,000 unsecured MongoDB databases have been deleted. The cyber-extortionist have left only an email contact, most likely...