





















































Cypher is a highly efficient language that not only makes querying simpler but also strives to optimize the result-generation process to the maximum. A lot more optimization in performance can be achieved with the help of knowledge related to the data domain of the application being used to restructure queries.
This article by Sonal Raj, the author of Neo4j High Performance, covers a few tricks that you can implement with Cypher for optimization.
(For more resources related to this topic, see here.)
There are certain techniques you can adopt in order to get the maximum performance out of your Cypher queries. Some of them are:
START n =node(*) MATCH (n)-[:KNOWS]-(m) WHERE n.identity = "Batman" RETURN m
Since Cypher is a greedy pattern-matching language, it avoids discrimination unless explicitly told to. Filtering data with a start point should be undertaken at the initial stages of execution to speed up the result-generation process.
In Neo4j versions greater than 2.0, the START statement in the preceding query is not required, and unless otherwise specified, the entire graph is searched.
The use of labels in the graphs and in queries can help to optimize the search process for the pattern. For example:
START n =node(*) MATCH (n:superheroes)-[:KNOWS]-(m) WHERE n.identity = "Batman" RETURN m
Using the superheroes label in the preceding query helps to shrink the domain, thereby making the operation faster. This is referred to as a label-based scan.
CREATE INDEX ON: superheroes(identity)
Otherwise, to create a constraint on the particular property such as making the value of the property unique so that it can be directly referenced, we can use the following:
CREATE CONSTRAINT ON n:superheroes ASSERT n.identity IS UNIQUE
We will learn more about indexing, its types, and its utilities in making Neo4j more efficient for large dataset-based operations in the next sections.
MATCH (m:Game), (p:Player)
This will end up mapping all possible games with all possible players and that can lead to undesired results. Let's use an example to see how to avoid Cartesian products in queries:
MATCH ( a:Actor), (m:Movie), (s:Series) RETURN COUNT(DISTINCT a), COUNT(DISTINCT m), COUNT(DISTINCTs)
This statement will find all possible triplets of the Actor, Movie, and Series labels and then filter the results. An optimized form of querying will include successive counting to get a final result as follows:
MATCH (a:Actor) WITH COUNT(a) as actors MATCH (m:Movie) WITH COUNT(m) as movies, actors MATCH (s:Series) RETURN COUNT(s) as series, movies, actors
This increases the 10x improvement in the execution time of this query on the same dataset.
Split MATCH patterns further: Rather than having multiple match patterns in the same MATCH statement in a comma-separated fashion, you can split the patterns in several distinct MATCH statements. This process considerably decreases the query time since it can now search on smaller or reduced datasets at each successive match stage.
When splitting the MATCH statements, you must keep in mind that the best practices include keeping the pattern with labels of the smallest cardinality at the head of the statement. You must also try to keep those patterns generating smaller intermediate result sets at the beginning of the match statements block.
Profiling of queries: You can monitor your queries' processing details in the profile of the response that you can achieve with the PROFILE keyword, or setting profile parameter to True while making the request. Some useful information can be in the form of _db_hits that show you how many times an entity (node, relationship, or property) has been encountered.
Returning data in a Cypher response has substantial overhead. So, you should strive to restrict returning complete nodes or relationships wherever possible and instead, simply return the desired properties or values computed from the properties.
MATCH (p:Player) –[:PLAYED]-(game) WHERE p.id = {pid} RETURN game
When Cypher is building execution plans, it looks at the schema to see whether it can find useful indexes. These index decisions are only valid until the schema changes, so adding or removing indexes leads to the execution plan cache being flushed.
Add the direction arrowhead in cases where the graph is to be queries in a directed manner. This will reduce a lot of redundant operations.
Sometimes, the query optimizations can be a great way to improve the performance of the application using Neo4j, but you can incorporate some fundamental practices while you define your database so that it can make things easier and faster for usage:
Explicit definition: If the graph model we are working upon contains implicit relationships between components. A higher efficiency in queries can be achieved when we define these relations in an explicit manner. This leads to faster comparisons but it comes with a drawback that now the graph would require more storage space for an additional entity for all occurrences of data. Let's see this in action with the help of an example.
In the following diagram, we see that when two players have played in the same game, they are most likely to know each other. So, instead of going through the game entity for every pair of connected players, we can define the KNOWS relationship explicitly between the players.
MATCH (m:Movie) WHERE m.releaseDate >1343779201 AND m.releaseDate< 1369094401 RETURN m
This query is to compare whether a movie has been released in a particular year; it can be optimized if the release year of the movie is inherently stored in the properties of the movie nodes in the graph as the year range 2012-2013. So, for the new format of the data, the query will now change to this:
MATCH (m:Movie)-[:CONTAINS]->(d) WHERE s.name = "2012-2013" RETURN g
This gives a marked improvement in the performance of the query in terms of its execution time.
These are the various tricks that can be implemented in Cypher for optimization.
Further resources on this subject: