Component Deep Dive: src/page_io.rs

PageIO handles disk interactions for compressed pages. It defines the on-disk format, performs direct reads/writes with fixed-size metadata prefixes, and takes care to minimize kernel caching where appropriate.

Source Outline

src/page_io.rs
 13  const PREFIX_META_SIZE: usize = 64;
 15  #[derive(Archive, Deserialize, Serialize, Debug)]
 17  struct Metadata { read_size: usize }
 21  pub struct PageIO {}
 24  impl PageIO {
        pub fn read_from_path(&self, path, offset) -> PageCacheEntryCompressed
        pub fn write_to_path(&self, path, offset, data: Vec<u8>) -> io::Result<()>
     }

On-Disk Layout

Offset: N
┌─────────────────────────────────────────────────────────────┐
│ 64-byte Metadata Prefix (rkyv serialized)                    │
│   - read_size : usize (size of compressed payload)           │
└─────────────────────────────────────────────────────────────┘
Offset: N + 64
┌─────────────────────────────────────────────────────────────┐
│ Compressed Page Bytes (output of Compressor::compress)       │
└─────────────────────────────────────────────────────────────┘

Metadata is serialized using rkyv, a zero-copy serialization framework. During reads, the code uses rkyv::archived_root to interpret the metadata without allocating or deserializing.

Read Path (read_from_path)

1. Open file at `path`.
2. (macOS only) Disable readahead and kernel caching via fcntl F_RDAHEAD/F_NOCACHE.
3. Seek to requested `offset`.
4. Read 64 bytes into `meta_buffer`.
5. Interpret buffer as archived Metadata; extract `read_size`.
6. Seek to `offset + 64`.
7. Read `read_size` bytes into `ret_buffer`.
8. Return PageCacheEntryCompressed { page: ret_buffer }.

ASCII Flow

File descriptor ──► seek(offset)
                   │
                   ├─ read 64 bytes → meta_buffer
                   │
                   ├─ rkyv::archived_root(meta_buffer)
                   │         │
                   │         └─ read_size
                   │
                   ├─ seek(offset + 64)
                   │
                   └─ read read_size bytes → Vec<u8>
                               │
                               ▼
                     PageCacheEntryCompressed

Write Path (write_to_path)

1. Create/truncate file at `path`.
2. (macOS) Disable readahead/caching.
3. Seek to `offset`.
4. Encode Metadata { read_size = data.len() } using rkyv AllocSerializer.
5. Copy serialized bytes into 64-byte buffer (zero-padded).
6. Allocate combined buffer (64 + data.len()).
7. Append metadata prefix + compressed payload.
8. Write combined buffer with a single write.
9. Call fd.sync_all() to ensure durability.

ASCII Flow

Vec<u8> (compressed page)
       │
       ├─ rkyv serialize Metadata { read_size = len }
       │
       ├─ pad to 64 bytes → meta_buffer[64]
       │
       ├─ combined = meta_buffer || data
       │
       ├─ write_all(combined) at offset
       │
       └─ sync_all()

Design Rationale

  • Fixed Metadata Size: Aligns metadata reads and writes, allowing simple seeks and enabling future expansion (e.g., storing compression algorithm identifiers) by reserving unused bytes in the fixed prefix.

  • Zero-Copy Metadata: rkyv::archived_root avoids heap allocations when reading metadata, which matters for tight IO loops.

  • Kernel Cache Controls: Disabling caching on macOS prevents the OS page cache from double-buffering data already managed by the user-space caches.

  • Single Syscall Writes: Combining metadata and payload into one buffer reduces syscall overhead and ensures atomicity of prefix+data writes.

Integration Points

  • PageHandler::fetch_from_fs uses PageIO::read_from_path to populate the CPC when a page is missing from caches.
  • Future flush logic will call write_to_path when evicting dirty pages or writing WAL checkpoints.
  • Metadata offsets recorded in TableMetaStore::PageMetadata correspond directly to the offsets passed here.

Considerations & TODOs

  • Concurrency: The current write logic uses File::create, which truncates the file. Appending to existing files will require OpenOptions::new().write(true) and careful offset management.
  • Checksums: Adding checksums or version stamps to the metadata prefix would help detect corruption.
  • Error Propagation: read_from_path unwraps I/O results. Production code should convert failures into recoverable errors.