A storage-tier-aware LLM inference scheduler for Apple Silicon

Article URL: https://github.com/t8/hypura Comments URL: https://news.ycombinator.com/item?id=47504695 Points: 125 # Comments: 58

A storage-tier-aware LLM inference scheduler for Apple Silicon
A storage-tier-aware LLM inference scheduler for Apple Silicon Photo: Hacker News

Run a 31 GB Mixtral 8x7B on a 32 GB Mac Mini at 2.2 tok/s.

A 40 GB Llama 70B at 0.3 tok/s.

Vanilla llama.cpp crashes on both.

Consumer hardware (MacBook Pro, Mac Studio) ships with fast unified memory
and NVMe storage, but limited capacity.

A 32 GB M1 Max cannot naively load
a 40 GB model — the OS will swap-thrash until the OOM killer intervenes.

Hypura solves this by understanding the model architecture:
The result: models that would crash your machine under naive mmap become runnable.

Models that fit in memory run at full Metal GPU speed with zero overhead.

Hypura reads the GGUF file, profiles your hardware (GPU working set, RAM, NVMe bandwidth),
and solves a placement optimization that assigns every tensor to a tier:
Hypura selects the best inference mode automatically based on model size, architecture, and available memory:
Pool buffer size, prefetch depth, and memory budgets are computed automatically from your hardware profile — no manual tuning needed.

All benchmarks on M1 Max, 32 GB unified memory, ~5.1 GB/s NVMe sequential read .

Key takeaway: For models that fit in memory, Hypura adds zero overhead.

For models that don't fit, Hypura is the difference between "runs" and "crashes." Expert-streaming on Mixtral achieves usable interactive speeds by keeping only non-expert tensors on GPU and exploiting MoE sparsity (only 2/8 experts fire per token).

Dense FFN-streaming extends this to non-MoE models like Llama 70B.

Pool sizes and prefetch depth scale automatically with available memory.

Hypura builds from source with Cargo.

You'll need Rust 1.75+ and CMake (for the vendored llama.cpp).

The binary is at target/release/hypura .

Start with --max-tokens 10 on untested models before scaling up.

Hypura exposes an Ollama-compatible HTTP API, making it a drop-in replacement for any tool that talks to Ollama — including OpenClaw .

Point OpenClaw at Hypura by setting the Ollama base URL in ~/.openclaw/openclaw.json :
Hypura speaks native Ollama protocol ( /api/chat with NDJSON streaming), so no compatibility shims are needed.

Hypura is a Cargo workspace with two crates:
No.

Hypura only reads from your SSD during inference — it never writes to it.

SSD wear is caused by write cycles (program/erase cycles on NAND flash cells).

Reads do not degrade flash cells.

Hypura's entire NVMe I/O path uses read-only pread() calls with F_NOCACHE to stream tensor weights from the GGUF file into RAM/GPU memory pools, where all computation happens.

The SSD is used as cold storage, not as working memory.

The only writes Hypura performs are negligible: benchmark result JSON files (~KB), co-activation statistics (~KB to ~/.hypura/ ), and the one-time hypura optimize command if you choose to run it.

Normal inference generates zero SSD writes.

I feel morally obligated to say I did not write the code in this repository myself.

This project is an exploration of using LLMs to carry out tasks based on my direction.

The majority of prompts I used to get here were derived using the socratic method, genuine curiosity, and a hunch that NVMe supporting inference is underutilized despite being a (slow but) perfectly valid form of memory.

Source: This article was originally published by Hacker News

Read Full Original Article →

Share this article

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

Maximum 2000 characters