Satori Storage Architecture Overview
This document walks through the current storage engine blueprint implemented in src/. It assumes no prior familiarity with the codebase and covers the core data structures, caches, metadata, and execution flows that already exist in the repository.
High-Level Topology
At runtime, main.rs wires together the storage layers into the following pipeline:
┌─────────────────────────────────────────────────────────────┐
│ TableMetaStore │
│ - column ranges + MVCC versions │
│ - owns Arc<PageMetadata> entries │
└──────────────┬──────────────────────────────────────────────┘
│ Arc<PageMetadata> (id, disk path, offset)
▼
┌─────────────────────────────────────────────────────────────┐
│ PageHandler │
│ ┌────────────────────────┬───────────────────────────────┐ │
│ │ Uncompressed Page Cache│ Compressed Page Cache │ │
│ │ (UPC / hot Pages) │ (CPC / cold blobs) │ │
│ │ RwLock<LruCache> │ RwLock<LruCache> │ │
│ └──────────────┬─────────┴──────────────┬────────────────┘ │
│ │ │ │
│ UPC hit? CPC hit? │
│ │ │ │
│ ▼ ▼ │
│ Arc<PageCacheEntry> Arc<PageCacheEntry> │
│ │ │ │
│ └─────────────┬──────────┘ │
│ ▼ │
│ Compressor │
│ (lz4_flex + bincode) │
│ │ │
│ ▼ │
│ PageIO │
│ (64B metadata + blob on disk) │
└───────────────────────────────┬─────────────────────────────┘
│
▼
Persistent Storage
All public operations (ops_handler.rs) rely on TableMetaStore to map logical column ranges to physical pages, then ask PageHandler to materialize the requested pages through the cache strata down to disk.
Data Model
Entry (src/entry.rs)
Entryrepresents a single logical row fragment belonging to a column.- Metadata fields (
prefix_meta/suffix_meta) are placeholders for future per-entry bookkeeping. current_epoch_millis()timestamps are reused for both cache LRU and MVCC bookkeeping.
Page (src/page.rs)
- A
Pagegroups an ordered list ofEntryinstances plus optional page-level metadata. - Pages serialize via Serde, making them compatible with bincode and LZ4 compression.
Page::add_entrycurrently appends in-memory; flushing to disk is deferred to cache eviction and IO paths.
Page
├─ page_metadata : String (placeholder)
└─ entries : Vec<Entry>
├─ prefix_meta : String
├─ data : String
└─ suffix_meta : String
Metadata Catalog (src/metadata_store.rs)
TableMetaStore is the authoritative mapping from logical column ranges to physical pages and their on-disk location. It has two maps:
col_data : HashMap<ColumnName, Arc<RwLock<Vec<TableMetaStoreEntry>>>>
page_data : HashMap<PageId, Arc<PageMetadata>>
PageMetadataholds immutable page IDs, disk paths, and file offsets.TableMetaStoreEntryspans a contiguous(start_idx, end_idx)range and stores the MVCC history for that slice.MVCCKeeperEntrycaptures a particular version (page_id,commit_time,locked_byplaceholder).- Range lookups use binary search over the
(start_idx, end_idx)boundaries, then pick the newest version whosecommit_timeis ≤ the supplied upper bound.
Metadata Lifecycles
- Adding a new page:
add_new_page_metafabricates aPageMetadata(ID TODO: currently stubbed as"1111111").add_new_page_to_colappends the MVCC entry to the column’s newestTableMetaStoreEntry.
- Lookup helpers:
get_latest_page_meta(column)returns the head of the most recent range.get_ranged_pages_meta(column, l, r, ts)returnsArc<PageMetadata>handles for each qualifying range.
Locks are held briefly: read locks cover the binary search, after which the code drops guards and clones Arc<PageMetadata> values to unblock writers quickly.
Metadata Diagram
col_data["users.age"] ───────────┐
▼
Arc<RwLock<Vec<TableMetaStoreEntry>>> (per-column ranges)
│
├─ TableMetaStoreEntry #0
│ start_idx = 0
│ end_idx = 1024
│ page_metas:
│ [ MVCCKeeperEntry(page_id="p1", commit=100),
│ MVCCKeeperEntry(page_id="p2", commit=120) ]
│
└─ TableMetaStoreEntry #1
start_idx = 1024
end_idx = 2048
page_metas:
[ MVCCKeeperEntry(page_id="p9", commit=180) ]
page_data:
"p1" -> Arc<PageMetadata { id="p1", disk_path="/data/...", offset=... }>
"p2" -> Arc<PageMetadata { ... }>
"p9" -> Arc<PageMetadata { ... }>
Cache Hierarchy (src/page_cache.rs)
There are two symmetric caches layered by temperature:
- Uncompressed Page Cache (UPC) stores hot
Pagestructs for CPU-side mutations and reads. - Compressed Page Cache (CPC) stores compressed byte blobs mirroring disk layout, ready to be decompressed back into the UPC.
Each cache is a PageCache<T> with:
store: HashMap<PageId, PageCacheEntry<T>>for fast lookups.lru_queue: BTreeSet<(used_time, PageId)>implementing an LRU eviction policy capped atLRUsize = 10.PageCacheEntry<T>wraps the payload in anArcso multiple consumers can reuse a single cached page without copying.
Planned improvements (documented in comments but not yet implemented):
Drophandlers for cache entries should trigger cross-cache promotion (UPC eviction → compress into CPC; CPC eviction → flush to disk). These are currently TODOs.
Cache Flow
Request page "p42"
│
├─ UPC (HashMap) hit? ──┐
│ │ return Arc<PageCacheEntryUncompressed>
│
├─ CPC hit? ────────────┤
│ decompress │
│ insert into │
│ UPC ┘
│
└─ Fetch from disk
│
├─ PageIO::read_from_path (path, offset)
├─ insert compressed bytes into CPC
└─ decompress + seed UPC
Compressor (src/compressor.rs)
compress: SerializesPageviabincode, compresses withlz4_flex, and returns aPageCacheEntryCompressed.decompress: Reverses the process, producing aPageCacheEntryUncompressed.- Accepts/returns
Arc<PageCacheEntry*>so the compressor can operate without extra cloning.
Persistent Storage Layout (src/page_io.rs)
Pages on disk are stored as a 64-byte metadata prefix followed by the compressed payload:
┌───────────────────────────────┬───────────────────────────────┐
│ 64-byte Metadata (rkyv) │ LZ4 Compressed Page bytes │
│ - read_size : usize │ (output of Compressor::compress)│
└───────────────────────────────┴───────────────────────────────┘
read_from_path(path, offset):- Seeks to
offset, reads the 64-byte prefix. - Uses
rkyv::archived_rootfor zero-copy access to metadata. - Reads the exact number of bytes for the compressed page and returns
PageCacheEntryCompressed.
- Seeks to
write_to_path(path, offset, data):- Serializes the metadata structure with
rkyv. - Pads the prefix to exactly 64 bytes.
- Writes prefix + page bytes in a single call, then
sync_all()to persist.
- Serializes the metadata structure with
NOTE: The file is created/truncated with File::create; future work should seek cautiously when appending.
PageHandler (src/page_handler.rs)
PageHandler orchestrates the cache hierarchy and IO. It exposes:
get_page(PageMetadata)→ single-page access with UPC → CPC → disk fallback.get_pages(Vec<PageMetadata>)→ batch access preserving original order and minimizing lock contention.
Batch Retrieval Flow
Input: [PageMetadata{id="p1"}, PageMetadata{id="p7"}, ...]
1. UPC sweep (read lock):
collect hits in request order, track missing IDs in `meta_map`.
2. CPC sweep (read lock):
collect compressed hits; remove from `meta_map`.
3. Decompress CPC hits:
for each (id, Arc<PageCacheEntryCompressed>):
decompress → PageCacheEntryUncompressed
insert into UPC (write lock)
4. Remaining misses:
for each metadata still in `meta_map`:
PageIO::read_from_path → insert into CPC
decompress → insert into UPC
5. Final UPC pass (read lock):
gather pages in original order for any IDs not emitted in step 1.
Locks are held only around direct cache map access; decompression happens outside of locks to prevent blocking other threads.
Operation Layer (src/ops_handler.rs)
External API surface for column operations:
upsert_data_into_column(meta_store, page_handler, col, data)- Reads latest page metadata via
TableMetaStore. - Gets the page through
PageHandler. - Clones, mutates (
Page::add_entry), and reinserts into UPC. - TODOs remain for updating metadata ranges and persisting dirty pages.
- Reads latest page metadata via
update_column_entry(meta_store, page_handler, col, data, row)- Similar to upsert but updates a specific row index if it exists.
range_scan_column_entry(meta_store, page_handler, col, l_row, r_row, ts)- Fetches MVCC-aware page metadata for the range.
- Calls
PageHandler::get_pagesto materialize entries. - Returns concatenated
Entryvectors (no row slicing yet).
Scheduler Stub (src/scheduler.rs)
Contains a minimal thread pool (ThreadPool) backed by a channel and worker threads. It is currently unused but illustrates the intended direction for coordinating WAL/ops/background work.
WAL & Latest Modules
src/wal.rsandsrc/latest.rsare placeholders containing design notes for future write-ahead logging and in-place “latest view” storage. They document intent but contribute no runtime behavior yet.
Putting It All Together
Current execution path for a write request:
Client upsert
│
▼
ops_handler::upsert_data_into_column
│ read lock
▼
TableMetaStore::get_latest_page_meta
│
▼
PageHandler::get_page
├─ UPC? -> return Arc (hot path)
├─ CPC? -> decompress + seed UPC
└─ disk -> PageIO::read + CPC insert + decompress + seed UPC
│
▼
Clone Arc<PageCacheEntryUncompressed>
│ mutate Page entries
▼
UPC write lock
│ add updated page back into cache
▼
Return success (metadata update TODO)
Current execution path for a range scan:
Client range query
│
▼
ops_handler::range_scan_column_entry
│ read lock
▼
TableMetaStore::get_ranged_pages_meta
│
▼
PageHandler::get_pages (batch)
├─ UPC sweep -> immediate hits
├─ CPC sweep -> decompress + UPC
└─ Disk fetch -> CPC insert -> decompress + UPC
│
▼
Gather Pages in order
│
▼
Collect Entry vectors (no row slicing yet)
Open TODOs Highlighted in Code
- Real page ID generation and persistence of metadata updates.
- UPC eviction should compress into CPC; CPC eviction should flush to disk.
get_page_path_and_offsetshould guard against missing IDs (currently unwraps).- WAL integration, scheduling, and compaction logic are future work but already have design notes.
Page::add_entryneeds a disk persistence hook once the eviction path is implemented.
This blueprint should give new contributors a map of the current system and the intended direction. Use it alongside inline comments in src/ to trace the exact code paths.