GroveDB Secondary Indexes
Secondary indexes are an essential part of most database use cases. They enable massive efficiency gains for almost all queries more complex than single-key retrievals. Yet, until now, there have been essentially no databases which have enabled cryptographic proofs for queries on secondary indexes. The engineers at Dash Core Group, in pursuit of greater decentralization, felt that this was an essential feature for Dash Platform, so we created our own solution in the form of a multilayered, specialized, provable database: GroveDB.
This article will discuss the applications of secondary index query proofs and why they will make Dash Platform very attractive for dApp developers before diving into the background, architecture, and some implementation details of secondary indexes built on top of GroveDB.
Special thanks to Ivan Shumkov, Samuel Westrich, ThePhez, and Virgile Bartolo for their contributions to this article.
Secondary index query proofs will become an invaluable tool for systems that handle sensitive data, as they are the only true method of enabling users to have absolute certainty that the data they receive hasn’t been tampered with, while at the same time allowing for efficient complex queries. Use cases can be imagined for many industries, but in the blockchain space, an industry where trustlessness is a major focus, their need is especially clear.
Many popular blockchain dApps, including Metamask, Uniswap, and Opensea, currently rely on external services like The Graph for indexing. However, these services don’t offer any query proofs to ensure the authenticity of query results. Instead, users are forced to rely on economic incentives and reputation systems which try to ensure nodes supply authentic data. However, they ultimately provide no guarantees or means of verification. GroveDB’s secondary index query proofs can provide such means and guarantees, and it’s therefore likely that many applications will want to use GroveDB for that reason.
While GroveDB was built to be a standalone product which any system using RocksDB can integrate, Dash Platform will be the first to do so. Developers who deploy their dApps on Dash Platform will thus have an edge over dApps deployed on other platforms, as they will have security guarantees and query capabilities enabled by GroveDB on a level that can’t be found elsewhere.
We at DCG are very excited to see the impact GroveDB’s complex query proofs will have on the blockchain database field, and invite any interested readers, developers, and product owners to reach out with questions. For more information, check out the GitHub repository, and keep an eye out for the release of the official website coming soon.
GroveDB stores its data as key-values, which is the default method for storing data on blockchain databases. However, it also contains logic that allows the key-values to be combined into another form of hybrid data somewhere between a relational database table record and a NoSQL document. Because they’re handled as JSON objects and even called documents in Dash Platform, we also usually just refer to them as documents when talking in the context of GroveDB.
“city”: “New York”,
A NoSQL document with five key-value fields.
Documents in document-oriented databases and records in relational databases require unique identifiers so they can be distinguished from other documents/records that may otherwise be identical. Usually, an array of random bytes is used under the “id” field. The ID serves as the primary index, so that specific documents/records can easily be retrieved by querying their ID. However, querying by ID often isn’t very useful. Most applications need to be able to query by specific fields, such as name, city, cuisine, acceptsDash, etc., which are secondary indexes.
With secondary indexes, clients are able to execute queries that do things like, for example, retrieve the names of all the restaurants within a given city, given that a secondary index is created for the city field. The database would simply have to navigate to the city index and iterate over the documents/records for the given city. If a client wanted to perform the same query without secondary indexes, they would need to iterate through every single document/record in the database to check if that restaurant is in that city, and then return the name. Through this example it should be clear that secondary indexes enable massive efficiency gains for most database use cases.
Cryptographic proofs are the lifeblood of trustless databases. If you’re given a cryptographic proof of the results of your query, you can verify with certainty that those results are what’s really stored in the database. This is a very important feature when dealing with, for example, financial applications distributed across a large number of anonymously-hosted servers. Without cryptographic proofs, the hosts could easily manipulate the data and clients would have no way of verifying whether the returned data is authentic. Somewhat shockingly, this is how almost every database in the world works today. Every time a user queries a database without proofs, they are trusting that the hosts returns accurate information. To see exactly how proofs work in GroveDB, see our Merk implementation docs.
While no previous solution has enabled secondary index query proofs (at least that is open source), we can take a look at how secondary indexes alone have been implemented in other key-value databases in order to provide a backdrop of where GroveDB’s secondary index architecture fits into the grand scheme of things. There are two main categories of secondary index structures for key-value databases: inverted indexes and b-trees:
- Inverted indexes: In this approach, each secondary index value is mapped to the primary keys of the documents that contain that value, usually in the form of a list.
Example of an inverted index (source).
- B-Trees: In this approach, b-trees are created for each secondary index and are stored as objects in a hash table. This is the method The Graph uses, which is a popular data indexing service used by many popular blockchain dApps today.
Example of b-tree secondary indexes (source).
Both of these approaches could be extended to support cryptographic proofs, but no production-level implementations exist yet, at least to our knowledge. Anyways, neither are catered to supporting proofs, and it would be a rather inefficient approach. Meanwhile, GroveDB was built with secondary index query proofs in mind, and extra measures were taken to keep them as efficient as possible.
GroveDB is the first production-level implementation of a hierarchical authenticated data structure, at least that we are aware of, largely inspired by the paper Database Outsourcing with Hierarchical Authenticated Data Structures. It’s a graph, or grove, of Merkle trees, where the root hashes of lower-level trees are stored in nodes of higher-level trees. Its innovative structure enables GroveDB to be the first-ever database to support cryptographic proofs for queries on secondary indexes.
The best way to describe the secondary index architecture in GroveDB is by looking at an example, so we’ll walk through how Dash Platform uses GroveDB secondary indexes. As the following is meant to be a high-level walkthrough to give an understanding of the general architecture, certain implementation details have been omitted and shouldn’t be worried about for casual readers, though they are explained in Implementation Details below. The Dash Platform data contract used in the example can be found there as well.
In Dash Platform, all documents of a data contract are stored under their respective document type tree*, of which a data contract may have multiple. The keys in a document type tree are the document IDs (primary keys) and the values are the fully serialized documents (with some meta information). The root hash of each document type tree is stored in the nodes of their respective contract tree**, and the root hash of each contract tree is stored in the Contracts Tree.
Diagram 1: Dash Platform contract documents tree structure (slightly simplified).
As the diagrams will become too complex to represent in the same way as Diagram 1 going forward, we will switch to a more compact scheme. Diagram 2 below shows the same information as Diagram 1 under the new scheme. In this representation, the arrows do not signify parent-child relationships within a single tree. Rather, the arrows stemming from one node point to the nodes which comprise the tree whose root hash is stored in that higher-level node, as illustrated with the dashed lines in Diagram 1.
Diagram 2: A Dash Platform data contract with two document types (slightly simplified).
In addition to storing the documents, GroveDB also creates trees under each document type tree for each secondary index of that document type. For example, there could be a tree under the “Document Type 1” tree for an index on the “City” field. The “City” tree would hold subtrees for every unique value of the “City” field, like New York and Phoenix, and each of those trees would hold references to the documents that share that value***.
Diagram 3: A secondary index for “City” in “Document Type 1” (slightly simplified).
The secondary index subtrees only store references to the nodes holding the actual documents, rather than storing the full documents multiple times.
That’s the basic architecture of secondary indexes in GroveDB. Next we’ll look at two special, yet common, cases: unique indexes and compound indexes.
GroveDB allows data contract creators to designate any of their contract’s secondary indexes to be unique, in which case no documents may share the same value for that index. That way, there is always only one document under a unique index value’s subtree (for example the Phoenix subtree from the diagram above), with its key being “0” and the value being the reference.
GroveDB also allows for compound indexes, which are essentially multiple indexes strung together. For example, you could have an index for city-cuisine, which would sort all the restaurants of the same cuisine within a given city into their own subtrees. Compound indexes may be unique, too, so if the city-cuisine compound index was designated to be unique, there can only be one restaurant per cuisine within the same city.
Diagram 4: A secondary index for “City” and a compound index for “City-Cuisine” in “Document Type 2”. The “City-Cuisine” index shown here could just as well be unique.
* In a document type tree, documents are actually stored in a tree whose hash is stored under key “0”.
** In a contract tree, the serialized contract is stored under key 0, and the document type trees are stored in a tree whose hash is under key 1.
Detailed Diagram 2.
Detailed Diagram 3.
*** For non-unique secondary indexes like those in the example, document references, like documents, are stored under subtrees with key “0” as well. In unique indexes, there is no key “0” subtree, rather the document references themselves are stored with key “0” directly under the index value subtree.
Detailed Diagram 4. Assumes the City index is non-unique but the City-Cuisine index is unique.
The example data contract outlined in the Architecture section is shown below. Also, the document in the Background section could be submitted to this data contract, as it passes all the schema validations.
“description”: “Name of the restaurant.”
“description”: “True if the restaurant accepts Dash as a form of payment.”
An example of a Dash Platform data contract in JSON format.