v0.1.0 — Open Source · MIT

Run LLMs natively
on Apple Silicon

MLX-native inference. 47–125% faster than GGUF translation layers. OpenAI-compatible API. 168+ models. Zero config.

$uv tool install ppmlx Copied!
View on GitHub

Five commands, zero config

Every workflow fits in one command. See the real CLI in action.

ppmlx launch
$
TUI launcher — pick action & model, launch in one step

Numbers don't lie

MacBook Pro M4 Pro, 48 GB. Same models, same prompts, 3 runs averaged.

GLM-4.7-Flash · 58B · 4-bit
0 20 40 60 tok/s Simple 61.6 ±0.2 41.8 ±0.3 +47% Complex 56.1 ±0.3 38.7 ±0.5 +45% Agentic 46.7 ±0.8 20.8 ±3.8 +125% ppmlx (MLX native) Ollama (GGUF)
Time To First Token (TTFT)
Simple: 395ms vs 358ms
Complex: 495ms vs 412ms
Agentic: 465ms vs 377ms
TTFT is comparable. ppmlx trades ~80ms prefill for 47–125% higher throughput.
MacBook Pro M4 Pro · 48 GB| 3 runs averaged · ± std dev| Reproduce →

Drop-in replacement

OpenAI API, Anthropic Messages API, Responses API. If it speaks HTTP, it works.

OpenAI Chat API

Streaming, tools, vision. Drop-in for any SDK — Python, Node, Go, Rust.

Anthropic Messages API

Claude Code runs on your local GPU. One command to launch.

Tool Calling

Function calling in XML and JSON. Powers coding agents like Codex.

Vision + Embeddings

Images via mlx-vlm. Vectors for RAG. Same server, same API.

168+ Models

Llama, Qwen, Mistral, Phi, Gemma, DeepSeek. Curated registry with aliases.

Auto Memory & Logging

LRU model cache, lazy loading. Every request logged to SQLite.

Works with Claude Code · Codex · Open WebUI · LangChain · LlamaIndex · Any OpenAI SDK
Coming soon: Model Garden · ppmlx bench · MCP Server · Speculative Decoding — follow progress →

Your Mac is faster
than you think

Stop paying per token for local tasks.

$uv tool install ppmlx Copied!
GitHub