How we went about refactoring the infrastructure behind SNS-Optics

Apr 13, 23

10min read

As the Solana Name Service grew, we wanted domain names to be easy to resell on third-party marketplaces. In order to achieve this goal, we made domains tokenizable as NFTs. This enabled our users to sell of their registered domains on secondary markets such as MagicEden or Hyperspace. This created a problem : a variety of platforms offered domain names, which meant it became harder for users to discover domains they might be interested in. What we needed was a bigger combined picture of what was going across various marketplaces.

We rolled out SNS-Optics a while back to solve this problem. Integrated within the main naming.bonfida.org website, it offers an aggregated picture of all major marketplaces which trade Solana Name Service domains, allowing users to find out seamlessly whether and where a domain is available for sale.

Running this service requires constant collection of information from three sources : the MagicEden API, the Hyperspace API, and Solana itself. At first, we designed the system in such a way that every data source had its own custom collector, with very little shared code. This meant that maintenance, optimisation and adding of new features proved quite labour-intensive. We chose this naive solution to roll out SNS-Optics as fast as possible, but this incurred a lot of technical debt. The services themselves were pretty unstable, requiring constant monitoring and restarts.

After a few months, we finally set aside the time to completely rewrite the collector infrastructure, with three goals in mind :

Sharing as much code as possible between different data sources
Running the entire collector set as one process for better sharing of resources
Better error logging and handling This article focuses on the use of traits, generic objects and dynamic typing to maximize code sharing.

Maximizing code sharing using traits and generics

The key to maximizing code sharing in the new collectors infrastructure is to isolate what is specific about each data source in an object with minimal functionality, but which implements a common trait. Thinking about what a collector actually does, it retrieves raw information from an API, parses it and then retrieves additional information sequentially, and then commits the resulting data into the database. The entire operation runs in a loop with each iteration parametrized by an offset, which can either be a number or even a transaction signature. For instance, looking at a Solana program's activity, we would be looking at transactions and retrieving the most recent transactions starting from the latest transaction which is already in our database. We then retrieve more and more recent transactions starting from most recent we processed. The most recent transaction’s signature is the offset here. Once this is done, we start the cycle again.

The first thing to notice is that we're really looking at very different kinds of information which are indexed by very different things. Transactions are indexed by a signature, whereas listings in the MagicEden Api are indexed by an integer. In the case of program accounts, there isn't even an offset as we just retrieve the entirety of the accounts we're interested in in one request. Our software architecture has to account for all these different cases. Let's try and give a whirlwind tour of how we structured our code to handle every data source at once.

The `DataSource` trait

All the different data sources have to implement one generic trait, which we call DataSource. Here is a trimmed down version of the trait definition.

#[async_trait]
pub trait DataSource<RawRecord> {
    type RawOffset;
    type StopOffset;

    async fn get_batch(
        &self,
        offset: Option<Self::RawOffset>,
    ) -> Result<
        (
            Option<Self::RawOffset>,
            Vec<WithOffset<RawRecord, Self::RawOffset>>,
        ),
        crate::ErrorWithTrace,
    >;

    async fn process_raw_record(
        &self,
        record: RawRecord,
    ) -> Result<Vec<Record>, crate::ErrorWithTrace>;
}

The trick here is to uncouple raw batch retrieval from the individual processing of the included records. This allows us to design a common infrastructure that handles DataSource objects without worrying about their specifics. The get_batch method will handle all calls to the Apis or Solana Rpc nodes, whereas the process_raw_record method will handle each record. The collector itself then handles as many records in parallel as possible using the tokio asynchronous runtime.

When no offset logic is required, we can define the RawOffset type as () .

The `CollectorRunner` generic and the `Collector` trait

In order to handle the collection itself, we use the following CollectorRunner:

pub struct CollectorRunner<RawRecord, T: DataSource<RawRecord>> {
    // The data_source is in an `Arc` to facilitate spawning parallel `process_raw_record` and `get_batch` tasks
    data_source: Arc<T>,
    // A task feeds a queue `RawRecord` objects which can then be processed in parallel
    raw_queue: SyncedReceiver<WithOffset<RawRecord, T::RawOffset>>,
    ..
}

Now if we want to be able to handle dynamically typed objects in Rust, we need a common trait, the Collector trait, which is implemented by our CollectorRunner:

#[async_trait]
pub trait Collector {
    async fn get_next_records(&self) -> Result<Option<Vec<Record>>, crate::ErrorWithTrace>;
}

Finally, we can use the same logic to handle all our different CollectorRunner instances with different types by boxing them, a common Rust trick to enable the use of dynamic typing : Box, which means that we can gather all our collectors in a Vec>, and launch them simultaneously.

Dynamic typing has a minimal runtime cost, but it can build up quickly for performance-critical logic. In this case, the call is sufficiently low down the stack that we shouldn't worry about it. However, as a general rule of thumb, we should always be thinking critically about whether dynamic typing is an appropriate solution for the problem at hand. The last thing we want is to coerce Rust into generating slow code!

A side-note on memory leaks

One of the things that started us down this journey of refactoring was the observation that our previous collector implementations was leaking memory. After a couple of days running, the host instance would run out of memory and lock up, requiring a complete system restart.

As an interim solution, we forced the process to restart a couple of times a day. This solved the instability issues, but didn’t sit quite right with us. We investigated the memory leak for a while, but couldn’t find the root cause. We moved on, expecting to come back to the problem once the refactor came about.

Our first intuition was that by rewriting the entire business logic, the memory leak would either disappear, or would be easier to diagnose and then fix. No such thing happened, and the new version continued leaking memory with no apparent reason. This gave us an intuition that perhaps the problem came from upstream, in one of our dependencies. After doing some research, we found evidence that the reqwest crate had a known issue with memory leaks when using the default memory allocator, and suggested the use of the excellent jemalloc in its stead. Funnily enough, jemalloc used to be Rust’s default memory allocator, before being replaced by the glibc malloc implementation, its only real advantage being that it could be dynamically linked from available system libraries, reducing binary size.

We therefore recommend the use of jemalloc when writing for the tokio asynchronous runtime, especially when dealing with a long-running process. It’s as easy as adding this line to your Cargo.toml :

tikv-jemallocator ={ version = "0.5.0", features = ["unprefixed_malloc_on_supported_platforms"] }

And this to your main.rs :

use tikv_jemallocator::Jemalloc;

#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;

jemalloc not only reduces the occurrence of memory leaks, but also comes with a powerful suite of benchmarking and introspection tools.

How we went about refactoring the infrastructure behind SNS-Optics

Maximizing code sharing using traits and generics

The DataSource trait

The CollectorRunner generic and the Collector trait

A side-note on memory leaks

The `DataSource` trait

The `CollectorRunner` generic and the `Collector` trait