Indexer

This document outlines how Maru can be used to build a fully-verifiable indexer.

Overview

Maru can be viewed as having three major "components":

  1. Ingress - pulling external data streams into Maru
  2. Dataflow - a graph of operators, each of which takes one or more streams as input and produces one or more streams as output
  3. Egress - external consumers (applications, relayers, etc) subscribe to the outputs of the operators they are interested in.

To build an indexer, we split the process into two parts:

  1. Index-Construction Operator - We can express the construction of an index as a Maru operator that takes in a stream of deltas and outputs a collection of (hash, node_kind, key, value, left_child_hash, right_child_hash) tuples corresponding to a merkle index of the input stream's data, consolidating deltas as they come in.
  2. Query Service: Then subscribe to this operator's output from an external service which, in turn, provides an API through which clients may interactively query this index.

This means that the only queries users may dynamically perform at run-time are, in SQL-speak, SELECT WHERE statements - developers can't simply point their webapps at a magic API and ask for any query they want if they never defined it up-front.

That said, defining the expensive queries up-front doesn't end up being a blocker for applications. The only difference is developers have to specify their more complex queries up-front - to get any sort of performance, developers typically have to do this anyways, and proper tooling can smooth over any difference in developer experience.

Index-Construction Operator

This operator is more or less an "export" of an arrangement. That is, the output is simply a stream of updates to the arrangement. While arrangements can be expensive, they only need to be done once for any stream, and they're updated incrementally. Furthermore, many operators (join, group_by, etc) require the usage of an arrangement anyways, so in the case the index's search key is the same as the one used in the arrangement, this operator is virtually free. In any case, the same index is used across every application, and every down-stream query service (second component) as well - the cost to update it only needs to be performed once.

Arrangements are sharded, indexed representations of a stream of deltas. In Maru, arrangements are sharded Merkle trees that are constructed from a stream according to a particular search key, and each tree comes with a proof linking the Merkle root of the indexed stream with the stream's running multiset hash. See the page on arrangements for more information.

In the general case where we don't already have an arrangement of the stream by the search key we want, it will look something like this:

  1. use a map each record R in the input collection (abstraction of a stream) to a tuple (K, R) according to a custom key function f which defines the search key.
  2. create an arrangement of the stream by-key (the first element of the tuple).
  3. flatten every batch of updates to the arrangement to a stream of merkle tree node additions and removals.

Query Service

The query service subscribes to the index-construction operator(s) and provides a verifiable interface for querying the merkle indices. This can manifest in a myriad of ways, each with different tradeoffs. These are the two we've considered so far:

  1. A centralized API that keeps the Merkle tree nodes in a key-value store and provides a REST-API for querying values. Upon request, the API responds with a merkle proof for the query (be it a membership, non-membership, or range query) and the fully-accumulated SNARK proof that the merkle root is correct. Presumably the client has its own RPC node with which it can check whether or not the given block hash is in the canonical chain(s) for the data from which the input was pulled.
    • In principle, this could be decentralized with its own protocol for discovering indexed streams to serve and balancing bandwidth costs across the network via its own tokenomics. If MARU were used here, that would present a significant additional value capture mechanism for Maru.
  2. Snapshot the Merkle node collection periodically, dump its contents into a flat file, and publish it in some cheap, immutable file store (Arweave, S3, IPFS, Filecoin, etc)