As the Solana Name Service grew, we wanted domain names to be easy to resell on third-party marketplaces. In order to achieve this goal, we made domains tokenizable as NFTs. This enabled our users to sell of their registered domains on secondary markets such as MagicEden or Hyperspace. This created a problem : a variety of platforms offered domain names, which meant it became harder for users to discover domains they might be interested in. What we needed was a bigger combined picture of what was going across various marketplaces.
We rolled out SNS-Optics a while back to solve this problem. Integrated within the main naming.bonfida.org
website, it offers an aggregated picture of all major marketplaces which trade Solana Name Service domains, allowing users to find out seamlessly whether and where a domain is available for sale.
Running this service requires constant collection of information from three sources : the MagicEden API, the Hyperspace API, and Solana itself. At first, we designed the system in such a way that every data source had its own custom collector, with very little shared code. This meant that maintenance, optimisation and adding of new features proved quite labour-intensive. We chose this naive solution to roll out SNS-Optics as fast as possible, but this incurred a lot of technical debt. The services themselves were pretty unstable, requiring constant monitoring and restarts.
After a few months, we finally set aside the time to completely rewrite the collector infrastructure, with three goals in mind :
The key to maximizing code sharing in the new collectors infrastructure is to isolate what is specific about each data source in an object with minimal functionality, but which implements a common trait. Thinking about what a collector actually does, it retrieves raw information from an API, parses it and then retrieves additional information sequentially, and then commits the resulting data into the database. The entire operation runs in a loop with each iteration parametrized by an offset, which can either be a number or even a transaction signature. For instance, looking at a Solana program's activity, we would be looking at transactions and retrieving the most recent transactions starting from the latest transaction which is already in our database. We then retrieve more and more recent transactions starting from most recent we processed. The most recent transaction’s signature is the offset here. Once this is done, we start the cycle again.
The first thing to notice is that we're really looking at very different kinds of information which are indexed by very different things. Transactions are indexed by a signature, whereas listings in the MagicEden Api are indexed by an integer. In the case of program accounts, there isn't even an offset as we just retrieve the entirety of the accounts we're interested in in one request. Our software architecture has to account for all these different cases. Let's try and give a whirlwind tour of how we structured our code to handle every data source at once.
DataSource
traitAll the different data sources have to implement one generic trait, which we call DataSource
. Here is a trimmed down version of the trait definition.
#[async_trait]
pub trait DataSource<RawRecord> {
type RawOffset;
type StopOffset;
async fn get_batch(
&self,
offset: Option<Self::RawOffset>,
) -> Result<
(
Option<Self::RawOffset>,
Vec<WithOffset<RawRecord, Self::RawOffset>>,
),
crate::ErrorWithTrace,
>;
async fn process_raw_record(
&self,
record: RawRecord,
) -> Result<Vec<Record>, crate::ErrorWithTrace>;
}
The trick here is to uncouple raw batch retrieval from the individual processing of the included records. This allows us to design a common infrastructure that handles DataSource objects without worrying about their specifics. The get_batch
method will handle all calls to the Apis or Solana Rpc nodes, whereas the process_raw_record
method will handle each record. The collector itself then handles as many records in parallel as possible using the tokio
asynchronous runtime.
When no offset logic is required, we can define the RawOffset
type as ()
.
CollectorRunner
generic and the Collector
traitIn order to handle the collection itself, we use the following CollectorRunner
:
pub struct CollectorRunner<RawRecord, T: DataSource<RawRecord>> {
// The data_source is in an `Arc` to facilitate spawning parallel `process_raw_record` and `get_batch` tasks
data_source: Arc<T>,
// A task feeds a queue `RawRecord` objects which can then be processed in parallel
raw_queue: SyncedReceiver<WithOffset<RawRecord, T::RawOffset>>,
..
}
Now if we want to be able to handle dynamically typed objects in Rust, we need a common trait, the Collector
trait, which is implemented by our CollectorRunner
:
#[async_trait]
pub trait Collector {
async fn get_next_records(&self) -> Result<Option<Vec<Record>>, crate::ErrorWithTrace>;
}
Finally, we can use the same logic to handle all our different CollectorRunner
instances with different types by boxing them, a common Rust trick to enable the use of dynamic typing : Box
, which means that we can gather all our collectors in a Vec>
, and launch them simultaneously.
Dynamic typing has a minimal runtime cost, but it can build up quickly for performance-critical logic. In this case, the call is sufficiently low down the stack that we shouldn't worry about it. However, as a general rule of thumb, we should always be thinking critically about whether dynamic typing is an appropriate solution for the problem at hand. The last thing we want is to coerce Rust into generating slow code!
One of the things that started us down this journey of refactoring was the observation that our previous collector implementations was leaking memory. After a couple of days running, the host instance would run out of memory and lock up, requiring a complete system restart.
As an interim solution, we forced the process to restart a couple of times a day. This solved the instability issues, but didn’t sit quite right with us. We investigated the memory leak for a while, but couldn’t find the root cause. We moved on, expecting to come back to the problem once the refactor came about.
Our first intuition was that by rewriting the entire business logic, the memory leak would either disappear, or would be easier to diagnose and then fix. No such thing happened, and the new version continued leaking memory with no apparent reason. This gave us an intuition that perhaps the problem came from upstream, in one of our dependencies. After doing some research, we found evidence that the reqwest
crate had a known issue with memory leaks when using the default memory allocator, and suggested the use of the excellent jemalloc
in its stead. Funnily enough, jemalloc
used to be Rust’s default memory allocator, before being replaced by the glibc malloc
implementation, its only real advantage being that it could be dynamically linked from available system libraries, reducing binary size.
We therefore recommend the use of jemalloc
when writing for the tokio asynchronous runtime, especially when dealing with a long-running process. It’s as easy as adding this line to your Cargo.toml
:
tikv-jemallocator ={ version = "0.5.0", features = ["unprefixed_malloc_on_supported_platforms"] }
And this to your main.rs
:
use tikv_jemallocator::Jemalloc;
#[global_allocator]
static GLOBAL: Jemalloc = Jemalloc;
jemalloc
not only reduces the occurrence of memory leaks, but also comes with a powerful suite of benchmarking and introspection tools.