[box type=”note” align=”” class=”” width=””]This article is an excerpt taken from a book Big Data Analytics with Java written by Rajat Mehta. In this book, you will learn how to perform real-time streaming analytics on big data using machine learning algorithms and power of Java. [/box]
From the article given below, you will learn why graph analytics is a favourable choice in order to analyze complex datasets.
Graph analytics Vs Relational Databases
The biggest advantage to using graphs is you can analyze these graphs and use them for analyzing complex datasets. You might ask what is so special about graph analytics that we can’t do by relational databases. Let’s try to understand this using an example, suppose we want to analyze your friends network on Facebook and pull information about your friends such as their name, their birth date, their recent likes, and so on. If Facebook had a relational database, then this would mean firing a query on some table using the foreign key of the user requesting this info. From the perspective of relational database, this first level query is easy. But what if we now ask you to go to the friends at level four in your network and fetch their data (as shown in the following diagram). The query to get this becomes more and more complicated from a relational database perspective but this is a trivial task on a graph or graphical database (such as Neo4j). Graphs are extremely good on operations where you want to pull information from one end of the node to another, where the other node lies after a lot of joins and hops. As such, graph analytics is good for certain use cases (but not for all use cases, relational database are still good on many other use cases):
As you can see, the preceding diagram depicts a huge social network (though the preceding diagram might just be depicting a network of a few friends only). The dots represent actual people in a social network. So if somebody asks to pick one user on the left-most side of the diagram and see and follow host connections to the right-most side and pull the friends at the say 10th level or more, this is something very difficult to do in a normal relational database and doing it and maintaining it could easily go out of hand.
There are four particular use cases where graph analytics is extremely useful and used frequently (though there are plenty more use cases too):
- Path analytics: As the name suggests, this analytics approach is used to figure out the paths as you traverse along the nodes of a graph. There are many fields where this can be used—simplest being road networks and figuring out details such as shortest path between cities, or in flight analytics to figure out the shortest time taking flight or direct flights.
- Connectivity analytics: As the name suggests, this approach outlines how the nodes within a graph are connected to each other. So using this you can figure out how many edges are flowing into a node and how many are flowing out of the node. This kind of information is very useful in analysis. For example, in a social network if there is a person who receives just one message but gives out say ten messages within his network then this person can be used to market his favorite products as he is very good in responding to messages.
- Community Analytics: Some graphs on big data are huge. But within these huge graphs there might be nodes that are very close to each other and are almost stacked in a cluster of their own. This is useful information as based on this you can extract out communities from your data. For example, in a social network if there are people who are part of some community, say marathon runners, then they can be clubbed into a single community and further tracked.
- Centrality Analytics: This kind of analytical approach is useful in finding central nodes in a network or graph. This is useful in figuring out sources that are single handedly connected to many other sources. It is helpful in figuring out influential people in a social network, or a central computer in a computer network.
From the perspective of this article, we will be covering some of these use cases in our sample case studies and for this we will be using a library on Apache Spark called GraphFrames.
GraphFrames
GraphX library is advanced and performs well on massive graphs, but, unfortunately, it’s currently only implemented in Scala and does not have any direct Java API. GraphFrames is a relatively new library that is built on top of Apache Spark and provides support for dataframe (now dataset) based graphs. It contains a lot of methods that are direct wrappers over the underlying sparkx methods. As such it provides similar functionality as GraphX except that GraphX acts on the Spark SRDD and GraphFrame works on the dataframe so GraphFrame is more user friendly (as dataframes are simpler to use). All the advantages of firing Spark SQL queries, joining datasets, filtering queries are all supported on this.
To understand GraphFrames and representing massive big data graphs, we will take small baby steps first by building some simple programs using GraphFrames before building full-fledged case studies. First, let’s see how to build a graph using Spark and GraphFrames on some sample dataset.
Building a graph using GraphFrames
Consider that you have as simple graph as shown next. This graph depicts four people Kai, John, Tina, and Alex and the relation they share whether they follow each other or are friends.
We will now try to represent this basic graph using the GraphFrame
library on top of Apache Spark and in the meantime, we will also start learning the GraphFrame
API.
Since GraphFrame
is a module on top of Spark, let’s first build the Spark configuration and spark sql
context for brevity:
SparkConfconf= ...
JavaSparkContextsc= ...
SQLContextsqlContext= ...
We will now build the JavaRDD
object that will contain the data for our vertices or the people Kai, John, Alex, and Tina in this small network. We will create some sample data using the RowFactory
class of Spark API and provide the attributes (ID of the person, and their name and age) that we need per row of the data:
JavaRDD<Row>verRow =
sc.parallelize(Arrays.asList(RowFactory.create(101L,”Kai”,27),
RowFactory.create(201L,”John”,45),
RowFactory.create(301L,”Alex”,32),
RowFactory.create(401L,”Tina”,23)));
Next we will define the structure or schema of the attributes used to build the data. The ID of the person is of type long and the name of the person is a string, and the age of the person is an integer as shown next in the code:
List<StructField>verFields = newArrayList<StructField>();
verFields.add(DataTypes.createStructField(“id”,DataTypes.LongType, true));
verFields.add(DataTypes.createStructField(“name”,DataTypes.StringType,
true));
verFields.add(DataTypes.createStructField(“age”,DataTypes.IntegerType, true));
Now, let’s build the sample data for the relations between these people and this can basically be represented as the edges of the graph later. This data item of relationship will have the IDs of the persons that are connected together and the type of relationship they share (that is friends or followers). Again we will use the Spark provided RowFactory
and build some sample data per row and create the JavaRDD
with this data:
JavaRDD<Row>edgRow = sc.parallelize(Arrays.asList(
RowFactory.create(101L,301L,”Friends”),
RowFactory.create(101L,401L,”Friends”),
RowFactory.create(401L,201L,”Follow”),
RowFactory.create(301L,201L,”Follow”),
RowFactory.create(201L,101L,”Follow”)));
Again, define the schema of the attributes added as part of the edges earlier. This schema is later used in building the dataset for the edges. The attributes passed are the source ID of the node, destination ID of the other node, as well as the relationType
, which is a string:
List<StructField>EdgFields = newArrayList<StructField>();
EdgFields.add(DataTypes.createStructField(“src”,DataTypes.LongType,true));
EdgFields.add(DataTypes.createStructField(“dst”,DataTypes.LongType,true));
EdgFields.add(DataTypes.createStructField(“relationType”,DataTypes.StringType,true));
Using the schemas that we have defined for the vertices and edges, let’s now build the actual dataset for the vertices and the edges. For this, first create the StructType
object that holds the schema details for the vertices and the edges data and using this structure and the actual data we will next build the dataset of the verticles (verDF) and the dataset for the edges (edgDF):
StructTypeverSchema = DataTypes.createStructType(verFields);
StructTypeedgSchema = DataTypes.createStructType(EdgFields);
Dataset<Row>verDF = sqlContext.createDataFrame(verRow, verSchema);
Dataset<Row>edgDF = sqlContext.createDataFrame(edgRow, edgSchema);
Finally, we will now use the vertices and the edges dataset and pass it as a parameter to the GraphFrame
constructor and build the GraphFrame
instance:
GraphFrameg = newGraphFrame(verDF,edgDF);
Time has now come to see some mild analytics on the graph we just created.
Let’s first visualize our data for the graphs; let’s see the data on the vertices
. For this, we will invoke the vertices
method on the GraphFrame
instance and invoke the standard show method on the generated vertices dataset (GraphFrame
would generate a new dataset when the vertices
method is invoked).
g.vertices().show();
This would print the output as follows:
Let’s also see the data on the edges:
g.edges().show();
This would print as the output as follows:
Let’s also see the number of edges and the number of vertices:
System.out.println(“Number of Vertices : “ + g.vertices().count());
System.out.println(“Number of Edges : “ + g.edges().count());
This would print the result as follows:
Number of Vertices : 4
Number of Edges : 5
GraphFrame has a handy method to find all the indegrees (out degree or degree)
g.inDegrees().show();
This would print the in degrees of all the vertices as shown next:
Finally, let’s see one more small thing on this simple graph. As GraphFrames work on the datasets, all the dataset handy methods such as filtering, map, and so on can be applied on them. We will use the filter method and run it on the vertices dataset to figure out the people in the graph with age greater than thirty:
g.vertices().filter(“age > 30”).show();
This would print the result as follows:
From this post, we learned about graph analytics. We saw how graphs can be built from massive big datasets in order to derive quick insights. You will understand when to implement graph analytics or relational database based on the growing challenges in your organization.
To know more about preparing and refining big data and to perform smart data analytics using machine learning algorithms you can refer to the book Big Data Analytics with Java.