In the distributed systems verification of Dgraph 1.0.2 through 1.0.6, Jepsen has found 23 issues including multiple deadlocks and crashes in the cluster, duplicate upserted records, snapshot isolation violations, records with missing fields, and in some cases, the loss of all but one inserted record.
Dgraph is an open source, fast, distributed graph database which uses Raft for per-shard replication and a custom transactional protocol, based on Omid, Reloaded, for snapshot-isolated cross-shard transactions.
Dgraph has a custom transaction system to provide transactional isolation across different Raft groups. Storage nodes, called Alpha, are controlled by a supervisory system, called Zero. Zero nodes form a single Raft cluster, which organizes Alpha nodes into shards called groups. Each group runs an independent Raft cluster.
Jepsen test suite design
Jepsen is a framework to analyze distributed systems under stress and verify that the safety properties of a distributed system hold up, given concurrency, non-determinism, and partial failure. It is an effort to improve the safety of distributed databases, queues, consensus systems, and more.
To verify safety properties of Dgraph, a suite of Jepsen tests was designed using a five node cluster with replication factor three. Alpha nodes were organized into two groups: one with three replicas, and one with two. Every node ran an instance of both Zero and Alpha.
Many operations were tested, out of which some are listed here:
- Set: Inserting a sequence of unique numbers into Dgraph, then querying for all extant values. Finally, check if every successfully acknowledged insert is present in the final read.
- Upsert: An upsert is a common database operation in which a record is created if and only if an equivalent record does not already exist.
- Delete: In the delete test, concurrent attempts were made to delete any records for an indexed value. Since deleting can only lower the number of records, not increase it, it was expected to never observe more than one record at any given time.
- Bank: The bank test stresses several invariants provided by snapshot isolation. A set of bank accounts were created, each with three attributes:
- type: It is always account. We use this to query for all accounts.
- key: It is an integer which identifies that account.
- amount: It is the amount of money in that account.
Issues found in the Dgraph
Here are some of the issues found by the test:
- Cluster join issues: Race conditions were discovered in the Dgraph’s cluster join procedure.
- Duplicate upserts: In the bank test, it was discovered that test initialization process concurrently upserts a single initial account resulting in dozens of copies of that account record, rather than one.
- Delete anomalies: With a mix of upserts, deletes, and reads of single records identified by an indexed field key, several unusual behaviors were found. Unusual results were found like values disappeared due to deletion, get stuck in a dangling state, then reappear as full records.
- Read skew: With a more reliable cluster join process, a read skew anomaly in the bank test was discovered.
- Lost inserts with network partitions: In pure insert workloads, Dgraph could lose acknowledged writes during network partitions. While performing set tests, which insert unique integer values and attempt to perform a final read, huge number of acknowledged values could be lost.
- Write loss on node crashes: When Alpha nodes crash and restart, the set test revealed that small windows of successfully acknowledged writes could be lost right around the time the process(es) crashed. Dgraph also constructed records with missing values.
- Unavailability after crashes: Despite every Alpha and Zero node running, and with total network connectivity, nodes could return timeouts for all requests.
- Read skew in healthy clusters: A bank test revealed failures without any migration, or even any failures at all. Dgraph could still return incorrect account totals, or records with missing values.
The identified safety issues were mostly associated with process crashes, restarts, and predicate migration. Out of 23 issues, 4 still remain unresolved, including the corruption of data in healthy clusters.
This analysis was funded by Dgraph and Jepsen has documented the full report on their official website.