High-performance tensor computation with dual-backend execution. CPU and Metal. GGUF models out of the box.
Modern C23 from the ground up. No legacy code, no compromise.
_Generic type dispatch,
constexpr,
nullptr,
_BitInt(N) for 4-bit quantization.
Type-safe from the ground up.
CPU with Accelerate, AVX2, and NEON. Metal with simdgroup matmul on Apple Silicon. One API, one enum switch.
Graphs compile once to bytecode programs, then execute many times. No per-token interpretation overhead. Kernel fusion baked in.
Load Llama, Qwen2, Qwen3, Phi-3, and Gemma directly from GGUF files. Tokenizers included — BPE, WordPiece, Unigram.
KV cache pooling with block allocation, watermark management, and swap support. Continuous batching for concurrent requests.
Q4_0 through Q8_K block quantization on both CPU and Metal. Quantize and dequantize as first-class operations.
Download and run GGUF models with a single command.
| Architecture | Status | Models |
|---|---|---|
| Llama | Working | TinyLlama, Llama 2, Llama 3 |
| Qwen2 | Working | Qwen 2 family |
| Qwen3 | Working | Qwen 3 family |
| Phi-3 | Working | Phi-3 Mini, Phi-3 Small |
| Gemma | Partial | Gemma 2B |
Initialize a backend, load a model, run inference. That's it.
// 1. Initialize backend
marmot_context_t *ctx = marmot_init(MARMOT_BACKEND_METAL);
// 2. Load model + build graph from GGUF
marmot_gguf_model_t *model = nullptr;
marmot_gguf_model_load("model.gguf", MARMOT_BACKEND_METAL, &model);
marmot_graph_t *graph = nullptr;
marmot_graph_from_gguf("model.gguf", MARMOT_BACKEND_METAL, &graph);
// 3. Load tokenizer from same file
marmot_tokenizer_t *tokenizer = nullptr;
marmot_tokenizer_create_from_gguf_file("model.gguf", &tok_opts, &tokenizer);
// 4. Tokenize -> embed -> execute graph -> argmax -> decode
// Full loop in examples/llama_generate.c
// 5. Cleanup
marmot_graph_destroy(graph);
marmot_gguf_model_destroy(model);
marmot_tokenizer_destroy(tokenizer);
marmot_destroy(ctx);
MIT licensed. Runs on macOS and Linux. Apple Silicon optimized.