Marmot - LLM Inference Engine in C23

Built different

Modern C23 from the ground up. No legacy code, no compromise.

Modern C23

_Generic type dispatch, constexpr, nullptr, _BitInt(N) for 4-bit quantization. Type-safe from the ground up.

Dual Backend

CPU with Accelerate, AVX2, and NEON. Metal with simdgroup matmul on Apple Silicon. One API, one enum switch.

Bytecode Dispatch

Graphs compile once to bytecode programs, then execute many times. No per-token interpretation overhead. Kernel fusion baked in.

GGUF Native

Load Llama, Qwen2, Qwen3, Phi-3, and Gemma directly from GGUF files. Tokenizers included — BPE, WordPiece, Unigram.

Paged Attention

KV cache pooling with block allocation, watermark management, and swap support. Continuous batching for concurrent requests.

12 Quant Formats

Q4_0 through Q8_K block quantization on both CPU and Metal. Quantize and dequantize as first-class operations.

Run popular models

Download and run GGUF models with a single command.

Terminal
$ marmot-lm pull bartowski/Qwen3-0.6B-GGUF --quantization Q4_K_M
  Downloading Qwen3-0.6B-Q4_K_M.gguf... done (420 MB)
$ marmot-lm run Qwen3-0.6B -p "What is the capital of France?"
  Backend: Metal
  The capital of France is Paris.

Architecture	Status	Models
Llama	Working	TinyLlama, Llama 2, Llama 3
Qwen2	Working	Qwen 2 family
Qwen3	Working	Qwen 3 family
Phi-3	Working	Phi-3 Mini, Phi-3 Small
Gemma	Partial	Gemma 2B

Simple C API

Initialize a backend, load a model, run inference. That's it.

llama_generate.c

// 1. Initialize backend
marmot_context_t *ctx = marmot_init(MARMOT_BACKEND_METAL);

// 2. Load model + build graph from GGUF
marmot_gguf_model_t *model = nullptr;
marmot_gguf_model_load("model.gguf", MARMOT_BACKEND_METAL, &model);

marmot_graph_t *graph = nullptr;
marmot_graph_from_gguf("model.gguf", MARMOT_BACKEND_METAL, &graph);

// 3. Load tokenizer from same file
marmot_tokenizer_t *tokenizer = nullptr;
marmot_tokenizer_create_from_gguf_file("model.gguf", &tok_opts, &tokenizer);

// 4. Tokenize -> embed -> execute graph -> argmax -> decode
//    Full loop in examples/llama_generate.c

// 5. Cleanup
marmot_graph_destroy(graph);
marmot_gguf_model_destroy(model);
marmot_tokenizer_destroy(tokenizer);
marmot_destroy(ctx);

View full example →

How it works

1

Load

GGUF model weights & graph topology

2

Compile

Graph → bytecode with fusion passes

3

Select

Match ops to backend kernels

4

Execute

CPU (SIMD) or Metal (GPU)

Start building with Marmot

MIT licensed. Runs on macOS and Linux. Apple Silicon optimized.

Get Started View on GitHub

LLM inference engine written in C23