v0.2.0 — Open Source · MIT

Run LLMs natively
on Apple Silicon

MLX-native inference. 43–112% faster than GGUF translation layers. OpenAI-compatible API. 168+ models. Zero config.

$uv tool install ppmlx Copied!
View on GitHub

Five commands, zero config

Every workflow fits in one command. See the real CLI in action.

One command to spin up a coding agent on your local GPU. No API keys, no cloud costs, no latency.

Pick from 5 agent launchers: Claude Code, Codex, Opencode, Pi — or start a plain chat.

Model picker built in. Select, launch, code. Under 10 seconds from cold start.

Drop-in replacement for any OpenAI-compatible tool. Just point to localhost:6767.

Supports Chat Completions, Responses API, Anthropic Messages — one server, every protocol.

Hot-swap models without restarting. LRU cache keeps your most-used models in memory.

168+ models from the curated MLX registry. Human-friendly aliases, Apple Silicon optimised weights.

Multi-select download — queue up several models and grab them in one go.

Models are stored locally in ~/.ppmlx/models. No Docker, no containers, no overhead.

Auto-downloads missing models on first use. Just type the name and start chatting.

Streaming REPL with token stats, timing, and slash commands built in.

Switch models mid-session with /model. No need to restart anything.

Bring any HuggingFace model. Convert to MLX format and quantize to 4-bit in one step.

69% smaller models with minimal quality loss. Fits large models in unified memory.

Run your quantized model immediately — no extra setup, no config files.

ppmlx launch
$
TUI launcher — pick action & model, launch in one step

Numbers don't lie

MacBook Pro M4 Pro, 48 GB. Same prompts, 3 runs averaged. All models 4-bit quantized.

ppmlx (MLX native) Ollama (GGUF)
Time To First Token
MacBook Pro M4 Pro · 48 GB| 3 runs averaged · ± std dev| Reproduce →

Drop-in replacement

OpenAI API, Anthropic Messages API, Responses API. If it speaks HTTP, it works.

OpenAI Chat API

Streaming, tools, vision. Drop-in for any SDK — Python, Node, Go, Rust.

Anthropic Messages API

Claude Code runs on your local GPU. One command to launch.

Tool Calling

Function calling in XML and JSON. Powers coding agents like Codex.

Vision + Embeddings

Images via mlx-vlm. Vectors for RAG. Same server, same API.

168+ Models

Llama, Qwen, Mistral, Phi, Gemma, DeepSeek. Curated registry with aliases.

Auto Memory & Logging

LRU model cache, lazy loading. Every request logged to SQLite.

Works with Claude Code · Codex · Open WebUI · LangChain · LlamaIndex · Any OpenAI SDK
Coming soon: Model Garden · ppmlx bench · MCP Server · Speculative Decoding — follow progress →

Your Mac is faster
than you think

Stop paying per token for local tasks.

$uv tool install ppmlx Copied!
GitHub