Deep Learning is all set to revolutionize the music industry

Isn’t it spooky how Facebook can identify faces of your friends before you manually tag them? Have you been startled by Cortana, Siri or Google Assistant when they instantly recognize and act on your voice as you speak to these virtual assistants? Deep Learning is the driving force behind these uncanny yet innovative applications. The next thing that is all set to dive into deep learning is the music industry. Neural networks not only ease production and generation of songs, but also assist in music recommendation, transcription and classification.

Here are some ways that deep learning will elevate music and the listening experience itself:

Generating melodies with Neural Nets

At the most basic level, a deep learning algorithm follows 3 simple steps for music generation:

First, the neural net is trained with a sample data set of songs which are labelled. The labelling is done based on the emotions you want your song to convey (happy, sad, funny, etc).
For training, the program converts the speech of the data set in text format and then creates vector for each word. The training data can also be in the form of MIDI format which is a standard protocol for encoding musical notes.
After completing the training, the program is fed with a set of emotions as input. It identifies the associated input vectors and compares them to training vectors. The output is a melody or chords that represent the desired emotions.

Long short-term memory (LSTM) architectures are also used for music generation. They take structured input of a music notation. These inputs are then encoded as vectors and fed into an LSTM at each timestep. LSTM then predicts the encoding of the next timestep. Fully connected convolutional layers are utilized to increase the music quality and to represent rich features in the frequency domain.

Magenta, the popular art and music project of Google has launched Performance RNN, which is an LSTM-based recurrent neural network. It is designed to produce multiple sounds with expressive timing and dynamics. In other words, Performance RNN determines which notes to play, when to play them, and how hard to strike each note.

IBM’s Watson Beat uses a neural network to produce complete tracks by understanding music theory, structure, and emotional intent. According to Richard Daskas, a music composer working on the Watson Beat project, “Watson only needs about 20 seconds of musical inspiration to create a song.”

Transcripting music with deep learning

Deep learning methods can also be used for arranging a piece of music for a different instrument. LSTM networks are a popular choice for music transcription and modelling. These networks are trained using a large dataset of pre-labelled music transcriptions (expressed with ABC notation). These transcriptions are then used to generate new music transcriptions.

In fact, transformed audio data can be used to predict the group of notes currently being played. This can be achieved by treating the transcription model as an image classification problem. For this, an image of an audio is used, called as Spectrogram. A spectrogram displays how the spectrum or frequency content changes over time. A Short Time Fourier Transform (STFT) or a constant Q transform is used to create this spectrogram. The spectrogram is then feeded to a Convolutional Neural network(CNN). The CNN estimates current notes from audio data and determines what specific notes are present by analysing 88 output nodes for each of the piano keys. This network is generally trained using large number of examples from MIDI files spanning several different genres of music.

Magenta has developed The NSynth dataset, which is a high-quality multi-note dataset for music transcription. It is inspired by image recognition datasets and has a huge collection of annotated musical notes.

Make better music recommendations

Neural Nets are also used to make intelligent music recommendations and are a step ahead of the traditional Collaborative filtering networks. Using neural networks, the system can analyse the songs saved by the users, and then utilize those songs to make new recommendations. Neural nets can also be used to analyze songs based on musical qualities such as pitch, chord progression, bass, etc. Using the similarities between songs having the same traits as each other, neural networks can detect and predict new songs. Thus providing recommendation based on similar lyrical and musical styles.

Convolutional neural networks (CNNs) are utilized for making music recommendations. A time-frequency representation of the audio signal is fed into the network as the input. 3 second audio clips are randomly chosen from the audio samples to train the neural network.

The CNNs are then used to predict latent factors from music audio by taking the average of the predictions for consecutive clips. The feature extraction layers and pooling layers permits operation on several timescales.

Spotify is working on a music recommendation system with a CNN. This recommendation system, when trained on short clips of songs, can create playlists based on the audio content only.

Classifying music according to genre

Classifying music according to a genre is another achievement of neural nets. At the heart of this application lies the LSTM network.

At the very first stage, convolutional layers are used for feature extraction from the spectrograms of the audio file.
The sequence of features so obtained is given as input to the LSTM layer. LSTM evaluates dependencies of the song across both short time period as well as long term structure.
After the LSTM, the input is fed into a fully connected, time-distributed layer which essentially gives us a sequence of vectors.
These vectors are then used to output the network's evaluation of the genre of the song at the particular point of time.

Deepsound uses GTZAN dataset and an LSTM network to create a model for music genre recognition. On comparing the mean output distribution with the correct genre, the model gives almost 67% of accuracy.

For musical pattern extraction, MFCC feature dataset is used for audio analysis.

First, the audio signal is extracted in the MFCC format. Next, the input song is modified into an MFCC map.
This Map is then split to feed it as the input of the CNN. Supervised learning is used for automatically obtaining musical pattern extractors, considering the song label is provided.
The extractors so acquired, are used for restoring high-order pattern-related features.
After high-order classification, the result is combined and undergoes a voting process to produce the song-level label.

Scientists from Queen Mary University of London trained a neural net with over 6000 songs in a ballad, hip-hop, and dance to develop a neural network that achieves almost 75% accuracy in song classification.

The road ahead

Neural networks have advanced the state of music to whole new level where one would no longer require physical instruments or vocals to compose music. The future would see more complex models and data representations to understand the underlying melodic structure. This would help models create compelling artistic content on their own. Combination of music with technology would also foster a collaborative community consisting of artists, coders and deep learning researchers, leading to a tech-driven, yet artistic future.