v0.1.0 — Open Source · MIT

Run LLMs natively
on Apple Silicon

MLX-native inference. 47–125% faster than GGUF translation layers. OpenAI-compatible API. 168+ models. Zero config.

$uv tool install ppmlx

View on GitHub

Benchmarks

Numbers don't lie

MacBook Pro M4 Pro, 48 GB. Same models, same prompts, 3 runs averaged.

GLM-4.7-Flash · 58B · 4-bit

Time To First Token (TTFT)

Simple: 395ms vs 358ms

Complex: 495ms vs 412ms

Agentic: 465ms vs 377ms

TTFT is comparable. ppmlx trades ~80ms prefill for 47–125% higher throughput.

MacBook Pro M4 Pro · 48 GB| 3 runs averaged · ± std dev| Reproduce →

What's inside

Drop-in replacement

OpenAI API, Anthropic Messages API, Responses API. If it speaks HTTP, it works.

OpenAI Chat API

Streaming, tools, vision. Drop-in for any SDK — Python, Node, Go, Rust.

Anthropic Messages API

Claude Code runs on your local GPU. One command to launch.

Tool Calling

Function calling in XML and JSON. Powers coding agents like Codex.

Vision + Embeddings

Images via mlx-vlm. Vectors for RAG. Same server, same API.

168+ Models

Llama, Qwen, Mistral, Phi, Gemma, DeepSeek. Curated registry with aliases.

Auto Memory & Logging

LRU model cache, lazy loading. Every request logged to SQLite.

Works with Claude Code · Codex · Open WebUI · LangChain · LlamaIndex · Any OpenAI SDK

Coming soon: Model Garden · ppmlx bench · MCP Server · Speculative Decoding — follow progress →

Run LLMs natively
on Apple Silicon

Five commands, zero config

Numbers don't lie

Drop-in replacement

OpenAI Chat API

Anthropic Messages API

Tool Calling

Vision + Embeddings

168+ Models

Auto Memory & Logging

Your Mac is faster
than you think

Run LLMs nativelyon Apple Silicon

Five commands, zero config

Numbers don't lie

Drop-in replacement

OpenAI Chat API

Anthropic Messages API

Tool Calling

Vision + Embeddings

168+ Models

Auto Memory & Logging

Your Mac is fasterthan you think

Run LLMs natively
on Apple Silicon

Your Mac is faster
than you think