7 min read

Cypher is a highly efficient language that not only makes querying simpler but also strives to optimize the result-generation process to the maximum. A lot more optimization in performance can be achieved with the help of knowledge related to the data domain of the application being used to restructure queries.

This article by Sonal Raj, the author of Neo4j High Performance, covers a few tricks that you can implement with Cypher for optimization.

(For more resources related to this topic, see here.)

Query optimizations

There are certain techniques you can adopt in order to get the maximum performance out of your Cypher queries. Some of them are:

  • Avoid global data scans: The manual mode of optimizing the performance of queries depends on the developer’s effort to reduce the traversal domain and to make sure that only the essential data is obtained in results. A global scan searches the entire graph, which is fine for smaller graphs but not for large datasets. For example:
    START n =node(*)
    MATCH (n)-[:KNOWS]-(m)
    WHERE n.identity = "Batman"
    RETURN m

    Since Cypher is a greedy pattern-matching language, it avoids discrimination unless explicitly told to. Filtering data with a start point should be undertaken at the initial stages of execution to speed up the result-generation process.

    In Neo4j versions greater than 2.0, the START statement in the preceding query is not required, and unless otherwise specified, the entire graph is searched.

    The use of labels in the graphs and in queries can help to optimize the search process for the pattern. For example:

    START n =node(*)
    MATCH (n:superheroes)-[:KNOWS]-(m)
    WHERE n.identity = "Batman"
    RETURN m

    Using the superheroes label in the preceding query helps to shrink the domain, thereby making the operation faster. This is referred to as a label-based scan.

  • Indexing and constraints for faster search: Searches in the graph space can be optimized and made faster if the data is indexed, or we apply some sort of constraint on it. In this way, the traversal avoids redundant matches and goes straight to the desired index location. To apply an index on a label, you can use the following:
    CREATE INDEX ON: superheroes(identity)

    Otherwise, to create a constraint on the particular property such as making the value of the property unique so that it can be directly referenced, we can use the following:

    CREATE CONSTRAINT ON n:superheroes
    ASSERT n.identity IS UNIQUE

    We will learn more about indexing, its types, and its utilities in making Neo4j more efficient for large dataset-based operations in the next sections.

  • Avoid Cartesian Products Generation: When creating queries, we should include entities that are connected in some way. The use of unspecific or nonrelated entities can end up generating a lot of unused or unintended results. For example:
    MATCH (m:Game), (p:Player)

    This will end up mapping all possible games with all possible players and that can lead to undesired results. Let’s use an example to see how to avoid Cartesian products in queries:

    MATCH ( a:Actor), (m:Movie), (s:Series)
    RETURN COUNT(DISTINCT a), COUNT(DISTINCT m), COUNT(DISTINCTs)

    This statement will find all possible triplets of the Actor, Movie, and Series labels and then filter the results. An optimized form of querying will include successive counting to get a final result as follows:

    MATCH (a:Actor)
    WITH COUNT(a) as actors
    MATCH (m:Movie)
    WITH COUNT(m) as movies, actors
    MATCH (s:Series)
    RETURN COUNT(s) as series, movies, actors

    This increases the 10x improvement in the execution time of this query on the same dataset.

  • Use more patterns in MATCH rather than WHERE: It is advisable to keep most of the patterns used in the MATCH clause. The WHERE clause is not exactly meant for pattern matching; rather it is used to filter the results when used with START and WITH. However, when used with MATCH, it implements constraints to the patterns described. Thus, the pattern matching is faster when you use the pattern with the MATCH section. After finding starting points—either by using scans, indexes, or already-bound points—the execution engine will use pattern matching to find matching subgraphs. As Cypher is declarative, it can change the order of these operations. Predicates in WHERE clauses can be evaluated before, during, or after pattern matching.
  • Split MATCH patterns further: Rather than having multiple match patterns in the same MATCH statement in a comma-separated fashion, you can split the patterns in several distinct MATCH statements. This process considerably decreases the query time since it can now search on smaller or reduced datasets at each successive match stage.

    When splitting the MATCH statements, you must keep in mind that the best practices include keeping the pattern with labels of the smallest cardinality at the head of the statement. You must also try to keep those patterns generating smaller intermediate result sets at the beginning of the match statements block.

  • Profiling of queries: You can monitor your queries’ processing details in the profile of the response that you can achieve with the PROFILE keyword, or setting profile parameter to True while making the request. Some useful information can be in the form of _db_hits that show you how many times an entity (node, relationship, or property) has been encountered.

    Returning data in a Cypher response has substantial overhead. So, you should strive to restrict returning complete nodes or relationships wherever possible and instead, simply return the desired properties or values computed from the properties.

  • Parameters in queries: The execution engine of Cypher tries to optimize and transform queries into relevant execution plans. In order to optimize the amount of resources dedicated to this task, the use of parameters as compared to literals is preferred. With this technique, Cypher can re-utilize the existing queries rather than parsing or compiling the literal-hbased queries to build fresh execution plans:
    MATCH (p:Player) –[:PLAYED]-(game)
    WHERE p.id = {pid}
    RETURN game

    When Cypher is building execution plans, it looks at the schema to see whether it can find useful indexes. These index decisions are only valid until the schema changes, so adding or removing indexes leads to the execution plan cache being flushed.

    Add the direction arrowhead in cases where the graph is to be queries in a directed manner. This will reduce a lot of redundant operations.

Graph model optimizations

Sometimes, the query optimizations can be a great way to improve the performance of the application using Neo4j, but you can incorporate some fundamental practices while you define your database so that it can make things easier and faster for usage:

  • Explicit definition: If the graph model we are working upon contains implicit relationships between components. A higher efficiency in queries can be achieved when we define these relations in an explicit manner. This leads to faster comparisons but it comes with a drawback that now the graph would require more storage space for an additional entity for all occurrences of data. Let’s see this in action with the help of an example.

    In the following diagram, we see that when two players have played in the same game, they are most likely to know each other. So, instead of going through the game entity for every pair of connected players, we can define the KNOWS relationship explicitly between the players.

  • Property refactoring: This refers to the situation where complex time-consuming operations in the WHERE or MATCH clause can be included directly as properties in the nodes of the graph. This not only saves computation time resulting in much faster queries but it also leads to more organized data storage practices in the graph database for utility. For example:
    MATCH (m:Movie)
    WHERE m.releaseDate >1343779201 AND m.releaseDate< 1369094401
    RETURN m

    This query is to compare whether a movie has been released in a particular year; it can be optimized if the release year of the movie is inherently stored in the properties of the movie nodes in the graph as the year range 2012-2013. So, for the new format of the data, the query will now change to this:

    MATCH (m:Movie)-[:CONTAINS]->(d)
    WHERE s.name = "2012-2013"
    RETURN g

    This gives a marked improvement in the performance of the query in terms of its execution time.

Summary

These are the various tricks that can be implemented in Cypher for optimization.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here