Crafting a system for indexing, curating, and deriving blockchain data in multichain future

The blockchain industry is interesting in the fact that each respective chain is a single threaded single source of truth for all data. It is a very unique and different data problem to most current architectures. Sure it can be said that their are many offchain data sources and data sets that integrate into these solutions but those in most true blockchain products are merely contextual data to the goings on of the actual transactions on the blockchain.

  1. The problem
  2. The solution
  3. The method
  4. The technologies.

The problem - Endless State Growth

The blockchain trilemma of the tradeoff between scalability, decentralization, and security goes hand in hand with this discussion of indexing of blockchain data. For many businesses in the blockchain space the decentralization aspect can be disregarded; thus we must focus on two.

Lets first address scalability. It is paradoxical, like many precient technologies that in order to be successful blockchains must increase velocity of TPS thereby storing more transactions at a faster rate, for more users, and this is happeing accross several high TVL chains. All of these factors greatly contribute to and accelerate state growth on each of their respective chains. So this is not just an issue for node providers but also the indexers that structure that data. With indexing also logically including more data duplication as well as including offchain data that data acceleration is much more dire. But remember that being a company we can do away with optimizing for decentralization, we only need to perfect scalability.

Scaling for infinite usecases

Not only are there an ever increasing number of accounts, transactions, and assets on each chain but there are also an ever increasing number of usecases. Anticipating these use cases to be able to index for them is a fools errand. We must design a system that can index for any usecase and any data set that must be derived. This is a very difficult problem to solve and one that I believe is best solved by a generalized indexing solution.

The Solution - Derived Datasets

If it can be said that all blockchain products, businesses, daos, etc. originate from the single threaded source of truth of one or many blockchains than it is a reasonable assumption that in order to properly and efficiently scale these products and businesses they must become experts in building derived blockchain datasets and then enrich them with offchain data.

The Method - Event Driven Architecture

Event driven architectures in blockchain

There really couldnt be a better more suitable usecase for event driven architecture patterns. Block production is inherently event driven and the trigger for all other events.

NoSQL vs NoSQL

Vertical vs Horizonal scaling.

First off, if we can all agree a multi-chain future of mass adoption is what we need to optimize for then that can serve as the stage from which we need to design and optimize for. If mass adoption is to ever occur then we can and should reasonably expect hundreds of millions if not billions of accounts across an ever growing number of networks with a growing number of assets (tokens, nfts, etc.) and a growing amount of activity on each respective chain. This is firmly in big data territory, which in my mind means we must rely on only solutions that can be horizontally scaled, not vertically. SQL databases are excellent but in my opinion its perilous to expect it to be suitable for a generalized indexing solution. Sure we can use of managed DBs like AWS Aurora and use Read/Write replicas and all the other modern things to scale a SQL database but in my opinion its still not a matter of IF it will fail to scale but WHEN. Quite a costly gamble in my opinion. To make matters worse we have zero control over the growth of the data we are trying to index so we can neither predict nor optimize around performance degradation and scaling needs. In this future I think the success of the blockchain ecosystem would mean a failure of ours. The growth of accounts, networks, protocols, users, transactions, etc. is theoretically limitless and that growth is the very thing the industry is innovating to make happen. My simplistic mental model of what we need to index is as follows. Networks x Accounts x Transactions = Exponential growth (Not scalable)

From a purely EVM centric view there are very few base level entities (accounts, blocks, transactions, logs) and a couple pseudo base entities (internal txs, transfers) from which there are very few well defined access patterns which need to be optimized for with a generalized EVM indexing system. This can much more effectively be served by a no SQL solution like DynamoDB or others like it where large distributed data partitioning is possible. We need to be optimizing for exactly what our application needs are and we cant afford the luxury of a DB that is accommodating OLAP and ad-hoc queries, we require OLTP at scale. Lets take the following simple use case which is arguably the most valuable and frequent access pattern. “Give me all of a users transactions across all chains” Its a very simple usecase but even for a SQL database this is quite a complex join. The required solution would need to join normal, internal, and token transfers with price data, token metadata, tx receipts, etc. This type of query will grow slower and more expensive overtime and will be made worse by the very success of mass adoption, with denormalized data in partitions these types of queries have no performance hit and would be just as efficient regardless of the size of the ecosystem and its growth. Additonally, each block can be processed in isolation.