How to Build a music recommendation system with PageRank Algorithm

[box type="note" align="" class="" width=""]This article is an excerpt from a book Mastering Spark for Data Science written by Andrew Morgan and Antoine Amend. In this book, you will learn about advanced Spark architectures, how to work with geographic data in Spark, and how to tune Spark algorithms to scale them linearly.[/box]

In today’s tutorial, we will learn to build a recommender with PageRank algorithm.

The PageRank algorithm

Instead of recommending a specific song, we will recommend playlists. A playlist would consist of a list of all our songs ranked by relevance, most to least relevant. Let's begin with the assumption that people listen to music in a similar way to how they browse articles on the web, that is, following a logical path from link to link, but occasionally switching direction, or teleporting, and browsing to a totally different website. Continuing with the analogy, while listening to music one can either carry on listening to music of a similar style (and hence follow their most expected journey), or skip to a random song in a totally different genre. It turns out that this is exactly how Google ranks websites by popularity using a PageRank algorithm.

For more details on the PageRank algorithm visit: h t t p ://i l p u b s . s t a n f o r d . e d u :8090/422/1/1999- 66. p d f .

The popularity of a website is measured by the number of links it points to (and is referred from). In our music use case, the popularity is built as the number hashes a given song shares with all its neighbors. Instead of popularity, we introduce the concept of song commonality.

Building a Graph of Frequency Co-occurrence

We start by reading our hash values back from Cassandra and re-establishing the list of song IDs for each distinct hash. Once we have this, we can count the number of hashes for each song using a simple reduceByKey function, and because the audio library is relatively small, we collect and broadcast it to our Spark executors:

val hashSongsRDD = sc.cassandraTable[HashSongsPair]("gzet", "hashes")

val songHashRDD = hashSongsRDD flatMap { hash =>

hash.songs map { song =>

((hash, song), 1)

}

val songTfRDD = songHashRDD map { case ((hash,songId),count) =>

(songId, count)

} reduceByKey(_+_)

val songTf = sc.broadcast(songTfRDD.collectAsMap())

Next, we build a co-occurrence matrix by getting the cross product of every song sharing a same hash value, and count how many times the same tuple is observed. Finally, we wrap the song IDs and the normalized (using the term frequency we just broadcast) frequency count inside of an Edge class from GraphX:

implicit class Crossable[X](xs: Traversable[X]) {

def cross[Y](ys: Traversable[Y]) = for { x <- xs; y <- ys } yield (x,

val crossSongRDD = songHashRDD.keys

.groupByKey()

.values

.flatMap { songIds =>

songIds cross songIds filter { case (from, to) =>

from != to

}.map(_ -> 1)

}.reduceByKey(_+_)

.map { case ((from, to), count) =>

val weight = count.toDouble /

songTfB.value.getOrElse(from, 1)

Edge(from, to, weight)

}.filter { edge =>

edge.attr > minSimilarityB.value

}

val graph = Graph.fromEdges(crossSongRDD, 0L)

We are only keeping edges with a weight (meaning a hash co-occurrence) greater than a predefined threshold in order to build our hash frequency graph.

Running PageRank

Contrary to what one would normally expect when running a PageRank, our graph is undirected. It turns out that for our recommender, the lack of direction does not matter, since we are simply trying to find similarities between Led Zeppelin and Spirit. A possible way of introducing direction could be to look at the song publishing date. In order to find musical influences, we could certainly introduce a chronology from the oldest to newest songs giving directionality to our edges.

In the following pageRank, we define a probability of 15% to skip, or teleport as it is known, to any random song, but this can be obviously tuned for different needs:

val prGraph = graph.pageRank(0.001, 0.15)

Finally, we extract the page ranked vertices and save them as a playlist in Cassandra via an RDD of the Song case class:

case class Song(id: Long, name: String, commonality: Double)

val vertices = prGraph

.vertices

.mapPartitions { vertices =>

val songIds = songIdsB

.value

.vertices

.map { case (songId, pr) =>

val songName = songIds.get(vId).get

Song(songId, songName, pr)

}

vertices.saveAsCassandraTable("gzet", "playlist")

The reader may be pondering the exact purpose of PageRank here, and how it could be used as a recommender? In fact, our use of PageRank means that the highest ranking songs would be the ones that share many frequencies with other songs. This could be due to a common arrangement, key theme, or melody; or maybe because a particular artist was a major influence on a musical trend. However, these songs should be, at least in theory, more popular (by virtue of the fact they occur more often), meaning that they are more likely to have mass appeal.

On the other end of the spectrum, low ranking songs are ones where we did not find any similarity with anything we know. Either these songs are so avant-garde that no one has explored these musical ideas before, or alternatively are so bad that no one ever wanted to copy them! Maybe they were even composed by that up-and-coming artist you were listening to in your rebellious teenage years. Either way, the chance of a random user liking these songs is treated as negligible. Surprisingly, whether it is a pure coincidence or whether this assumption really makes sense, the lowest ranked song from this particular audio library is Daft Punk's–Motherboard it is a title that is quite original (a brilliant one though) and a definite unique sound.

To summarize, we have learnt how to build a complete recommendation system for a song playlist. You can check out the book Mastering Spark for Data Science to deep dive into Spark and deliver other production grade data science solutions.

Read our post on how deep learning is revolutionizing the music industry. And here is how you can analyze big data using the pagerank algorithm.