Marmot

LLM inference engine
written in C23

High-performance tensor computation with dual-backend execution. CPU and Metal. GGUF models out of the box.

$ brew install marmot-lm
107
Operations
12
GGUF Quant Formats
5
Model Architectures
90
Tests (C + Rust)

Built different

Modern C23 from the ground up. No legacy code, no compromise.

Modern C23

_Generic type dispatch, constexpr, nullptr, _BitInt(N) for 4-bit quantization. Type-safe from the ground up.

Dual Backend

CPU with Accelerate, AVX2, and NEON. Metal with simdgroup matmul on Apple Silicon. One API, one enum switch.

Bytecode Dispatch

Graphs compile once to bytecode programs, then execute many times. No per-token interpretation overhead. Kernel fusion baked in.

GGUF Native

Load Llama, Qwen2, Qwen3, Phi-3, and Gemma directly from GGUF files. Tokenizers included — BPE, WordPiece, Unigram.

Paged Attention

KV cache pooling with block allocation, watermark management, and swap support. Continuous batching for concurrent requests.

12 Quant Formats

Q4_0 through Q8_K block quantization on both CPU and Metal. Quantize and dequantize as first-class operations.

Run popular models

Download and run GGUF models with a single command.

Terminal
$ marmot-lm pull bartowski/Qwen3-0.6B-GGUF --quantization Q4_K_M
Downloading Qwen3-0.6B-Q4_K_M.gguf... done (420 MB)
$ marmot-lm run Qwen3-0.6B -p "What is the capital of France?"
Backend: Metal
The capital of France is Paris.
Architecture Status Models
Llama Working TinyLlama, Llama 2, Llama 3
Qwen2 Working Qwen 2 family
Qwen3 Working Qwen 3 family
Phi-3 Working Phi-3 Mini, Phi-3 Small
Gemma Partial Gemma 2B

Simple C API

Initialize a backend, load a model, run inference. That's it.

llama_generate.c
// 1. Initialize backend
marmot_context_t *ctx = marmot_init(MARMOT_BACKEND_METAL);

// 2. Load model + build graph from GGUF
marmot_gguf_model_t *model = nullptr;
marmot_gguf_model_load("model.gguf", MARMOT_BACKEND_METAL, &model);

marmot_graph_t *graph = nullptr;
marmot_graph_from_gguf("model.gguf", MARMOT_BACKEND_METAL, &graph);

// 3. Load tokenizer from same file
marmot_tokenizer_t *tokenizer = nullptr;
marmot_tokenizer_create_from_gguf_file("model.gguf", &tok_opts, &tokenizer);

// 4. Tokenize -> embed -> execute graph -> argmax -> decode
//    Full loop in examples/llama_generate.c

// 5. Cleanup
marmot_graph_destroy(graph);
marmot_gguf_model_destroy(model);
marmot_tokenizer_destroy(tokenizer);
marmot_destroy(ctx);

View full example →

How it works

1
Load
GGUF model weights & graph topology
2
Compile
Graph → bytecode with fusion passes
3
Select
Match ops to backend kernels
4
Execute
CPU (SIMD) or Metal (GPU)

Start building with Marmot

MIT licensed. Runs on macOS and Linux. Apple Silicon optimized.