CLI

Five commands, zero config

Every workflow fits in one command. See the real CLI in action.

One command to spin up a coding agent on your local GPU. No API keys, no cloud costs, no latency.

Pick from 5 agent launchers: Claude Code, Codex, Opencode, Pi — or start a plain chat.

Model picker built in. Select, launch, code. Under 10 seconds from cold start.

Drop-in replacement for any OpenAI-compatible tool. Just point to localhost:6767.

Supports Chat Completions, Responses API, Anthropic Messages — one server, every protocol.

Hot-swap models without restarting. LRU cache keeps your most-used models in memory.

168+ models from the curated MLX registry. Human-friendly aliases, Apple Silicon optimised weights.

Multi-select download — queue up several models and grab them in one go.

Models are stored locally in ~/.ppmlx/models. No Docker, no containers, no overhead.

Auto-downloads missing models on first use. Just type the name and start chatting.

Streaming REPL with token stats, timing, and slash commands built in.

Switch models mid-session with /model. No need to restart anything.

Bring any HuggingFace model. Convert to MLX format and quantize to 4-bit in one step.

69% smaller models with minimal quality loss. Fits large models in unified memory.

Run your quantized model immediately — no extra setup, no config files.

ppmlx launch

TUI launcher — pick action & model, launch in one step

What's inside

Drop-in replacement

OpenAI API, Anthropic Messages API, Responses API. If it speaks HTTP, it works.

OpenAI Chat API

Streaming, tools, vision. Drop-in for any SDK — Python, Node, Go, Rust.

Anthropic Messages API

Claude Code runs on your local GPU. One command to launch.

Tool Calling

Function calling in XML and JSON. Powers coding agents like Codex.

Vision + Embeddings

Images via mlx-vlm. Vectors for RAG. Same server, same API.

168+ Models

Llama, Qwen, Mistral, Phi, Gemma, DeepSeek. Curated registry with aliases.

Auto Memory & Logging

LRU model cache, lazy loading. Every request logged to SQLite.

Works with Claude Code · Codex · Open WebUI · LangChain · LlamaIndex · Any OpenAI SDK

Coming soon: Model Garden · ppmlx bench · MCP Server · Speculative Decoding — follow progress →

Run LLMs natively
on Apple Silicon

Five commands, zero config

Numbers don't lie

Drop-in replacement

OpenAI Chat API

Anthropic Messages API

Tool Calling

Vision + Embeddings

168+ Models

Auto Memory & Logging

Your Mac is faster
than you think

Run LLMs nativelyon Apple Silicon

Five commands, zero config

Numbers don't lie

Drop-in replacement

OpenAI Chat API

Anthropic Messages API

Tool Calling

Vision + Embeddings

168+ Models

Auto Memory & Logging

Your Mac is fasterthan you think

Run LLMs natively
on Apple Silicon

Your Mac is faster
than you think