Independent field guide · Apple Silicon · May 2026

One agent. Every model. Zero cloud.

Q: Is it really free?

The software (Hermes, Ollama, MLX) and the open-weights models are free to download and run. You pay once for the hardware and in electricity. No subscriptions, no per-token billing.

Q: Does it genuinely work offline?

Yes, once the four cloud-default tools are repointed. Brain, images, music, voice, video and transcription all answer on localhost.

Q: Do I need macOS Tahoe 26.4?

For the full Neural Accelerator INT4 path, yes. Earlier 26.0 and 26.1 lack INT4 tensor support and the tok/s numbers in this guide will not reproduce.

Q: Is FLUX.2 commercially usable?

Partly. FLUX.2 dev and klein 9B are non-commercial. FLUX.2 klein 4B is Apache-2.0. For unrestricted commercial work, Qwen-Image-2512 or Z-Image-Turbo (both Apache-2.0) are the safe defaults.

Q: How fast is the M5 Max actually?

Independent benchmarks: gpt-oss-120B at Q8 MLX runs 64-88 tokens/second decode on a 128 GB M5 Max. Qwen3.5 27B dense at Q6 MLX: 14-24 tokens/second. Prefill is 3.33-4.06 times faster than M4 Max per Apple's MLX team.

Q: Why not Strix Halo or DGX Spark instead of a Mac?

Bandwidth. M5 Max 614 GB/s vs DGX Spark 273 GB/s vs Strix Halo 256 GB/s. For bandwidth-bound decode (agent tool calls), the Mac wins by roughly 2.25 times over Spark and 2.4 times over Strix Halo, despite Spark's FP4 hardware advantage.

A deep, honest guide to building a private AI studio on a single MacBook Pro M5 Max (128GB). Hermes Agent is the conductor; a local model is its brain; and a dedicated local model handles images, video, music, dubbing, voice cloning and transcription. The whole 2026 model field, how each one works and trains, the benchmarks that matter, and exactly how to build it. Flip the toggle to read it plain or in full technical depth.

Conductor

Hermes · MIT

Memory

0 GB · 614GB/s

Tools

0 + MCP

Models covered

Cloud calls

0 air-gappable

One conductor, six local capabilities, nothing crossing the dashed line.

gpt-oss-120B · Qwen3.6-27B · DeepSeek V4 Flash · GLM-5.1 · Kimi K2.6 · Gemma 4 31B · Mistral Medium 3.5 · Llama 4 Scout · Nemotron 3 Super · FLUX.2 · Qwen-Image · HiDream-O1 · Wan 2.7 · ACE-Step 1.5 · Chatterbox · F5-TTS · Whisper v3 · gpt-oss-120B · Qwen3.6-27B · DeepSeek V4 Flash · GLM-5.1 · Kimi K2.6 · Gemma 4 31B · Mistral Medium 3.5 · Llama 4 Scout · Nemotron 3 Super · FLUX.2 · Qwen-Image · HiDream-O1 · Wan 2.7 · ACE-Step 1.5 · Chatterbox · F5-TTS · Whisper v3 ·

Sources are primary throughout: official Hermes, Apple, OpenAI, Qwen and DeepSeek documentation; the Artificial Analysis Intelligence Index, SWE-bench Verified, BFCL, τ-bench, MMLU-Pro and Text-to-Image arenas; arXiv papers for every media model; and community discussion on r/LocalLLaMA. Numbers are as reported by their sources and are a May 2026 snapshot. Models rotate monthly; the architecture here is stable, the leaderboard is not. Benchmark on your own machine before committing.

01 · The case

Why this is suddenly possible

Two years ago, running a frontier-class model on your own machine was a fantasy. That changed faster than almost anyone predicted. The reason this guide exists now, and could not have existed in 2024, is a single chart: the gap between open models you can download and the closed models you rent has nearly closed.

A year ago the best open model scored 22. Today it scores 54, within a few points of the frontier.

In plain terms

"Open weights" means the actual model is published for anyone to download and run, instead of being locked behind a company's paid website. In 2024 these were toys next to ChatGPT. In 2026 they are genuinely close to the best, free, and small enough to run on a good laptop. That is the whole reason a private studio on one Mac is now realistic.

Under the hood

On the Artificial Analysis Intelligence Index (v4.0, a composite of ten evals including GPQA Diamond, Humanity's Last Exam, τ²-Bench, Terminal-Bench Hard and SciCode), the best open-weights model a year ago, DeepSeek V3 0324, scored 22, about 13 points below the leading proprietary model. Today the top open models (Kimi K2.6, MiMo-V2.5-Pro) score 54, with DeepSeek V4 Pro at 52, within 3-6 points of GPT-5.5 (60), Gemini 3.1 Pro and Claude Opus 4.7 (57). Open weights now hold 244 of 386 ranked models and dominate the intelligence-vs-price Pareto frontier.

Then → now (open weights)	Early 2025	2026-05-24
Top Intelligence Index score	22 (DeepSeek V3 0324)	54 (Kimi K2.6 / MiMo-V2.5-Pro)
Gap to best proprietary	~13 points	3–6 points
SWE-bench Verified (best open)	~55%	80.6% (DeepSeek V4 Pro Max)
Open share of ranked models	minority	244 of 386

The honest twist this guide is built around

The very top of that leaderboard (Kimi K2.6 at ~1T parameters, DeepSeek V4 Pro at 1.6T, GLM-5.1 at 744B) needs a data center, not a laptop. But the models that do fit your Mac are essentially last year's frontier: a 27-billion-parameter Qwen scores 77.2% on SWE-bench Verified. You are not running the #1 model. You are running something that would have been #1 a year ago, for free, offline, forever. That trade is the entire point.

02 · The machine

Why the M5 Max is the right hardware

The Mac's advantage is not raw speed; a data-center GPU beats it on throughput. The advantage is one large pool of fast memory shared by the whole chip, which lets a laptop hold models that simply will not fit on a consumer graphics card, while sipping power and staying silent.

Unified memory is the whole trick

A normal gaming PC keeps the model in the graphics card's small, separate memory, and if the model is too big, it simply will not load. The Mac has one big shared pool, so the entire model lives in the same 128 GB the rest of the chip uses. That is how a laptop runs models a $1,500 graphics card chokes on.

Apple's unified memory is a single pool addressable by CPU, GPU and Neural Engine with no host↔device copies. The M5 Max, launched 2026-03-03, tops out at 128 GB at 614 GB/s on the 40-core GPU SKU (the 32-core variant caps at 64 GB / 460 GB/s). After raising the wired limit you get about 120 GB usable for models. The historic Mac weakness was prefill (prompt processing), where it trailed NVIDIA badly; the M5's new per-core Neural Accelerators push prefill 3.33×–4.06× faster than M4 (Apple's own MLX team measurement) and make FLUX-dev-4bit ~3.8× faster. The Neural Accelerators run 1,024 FP16 fused multiply-accumulates per core per cycle, aggregating to about 70 TFLOPS of FP16 or 130 TFLOPS of INT8 across the 40-core GPU. Native FP8 and FP4 still belong to Blackwell; BF16 on the Neural Accelerators arrived in macOS 26.1, INT4 in 26.4: the numbers in this guide assume macOS Tahoe 26.4 or later.

Spec	Figure	Why it matters here
Unified memory	128 GB (40-core GPU SKU)	About 120 GB usable for models after raising the wired limit. Holds one big brain or one video model.
Memory bandwidth	614 GB/s	Caps token generation speed; favours MoE models with low active parameters.
GPU + Neural Accelerators	40-core · ~70 TFLOPS FP16 · ~130 TFLOPS INT8	About four times M4 Max GPU compute on matmul and prefill; about a fifth faster on decode. Speeds diffusion and rectified-flow image models.
CPU	18-core (6 super + 12 performance)	Drives the agent loop and the Python orchestration; not the bottleneck.
Storage	2 / 4 / 8 TB SSD · 13.6 GB/s read · 17.8 GB/s write	A full stack with alternates is 250–300 GB; pick 4 TB if you keep multiple 120B weight files. Cold-loading a 65 GB model takes ~5 seconds.
Chip topology	TSMC SoIC-mH chiplet (two N3P dies)	"Fusion Architecture": same compute die in M5 Pro and M5 Max; explains the uniform per-core spec.
macOS required	Tahoe 26.4 or later	Earlier 26.0 / 26.1 lack INT4 Neural Accelerator support; perf drops materially. Verify your build before reproducing the numbers in this guide.

The one rule of capacity

One big brain or one video model, never both. A 120B brain (~63 GB) plus a video model (~50 GB) plus KV cache blows past 120 GB. The studio works by running the brain resident and spinning heavy media up on demand. Full memory budgets are in section 13.

How the M5 Max stacks up against everything else with 128 GB

For an agent that decodes hundreds of tool calls per session, memory bandwidth beats raw compute, because decode is bandwidth-bound. The Mac wins this race against its closest peers despite Blackwell's FP4 hardware advantage.

Platform	Unified RAM	Bandwidth	gpt-oss-120B Q4	DeepSeek V4 Flash Q4	Notes
MacBook Pro M5 Max 128 GB	128 GB	614 GB/s	fits · ~63 GB	no · ~120 GB	Silent, battery, ANE + Neural Accelerators.
MacBook Pro M4 Max 128 GB	128 GB	546 GB/s	fits	no	Predecessor: weak prefill.
Mac Studio M3 Ultra 512 GB	512 GB	819 GB/s	fits · Q8	fits	Desktop scale-up: the only Apple option that holds V4 Flash.
RTX 5090 (32 GB)	32 GB GDDR7	1,792 GB/s	too small	no	Fastest per-GB but GB ceiling kills it for 120B.
NVIDIA DGX Spark / Project Digits	128 GB unified	273 GB/s	fits	no	Same RAM, 2.25× less bandwidth than M5 Max. Has FP4 hardware.
AMD Strix Halo (Ryzen AI MAX+ 395)	128 GB unified	256 GB/s	fits	no	x86 alternative; mature ROCm still trails MLX / CUDA on Q4 kernels.
RTX PRO 6000 Blackwell	96 GB	1,792 GB/s	fits	tight	Workstation: PCIe scale-out, no battery, $7-10k.
NVIDIA Jetson AGX Thor	128 GB unified	n/a published	likely	no	Robotics-first, not the studio target.

The headline most reviewers miss

On the same 128 GB of unified memory, the M5 Max has 2.25× the bandwidth of NVIDIA's own personal-AI box. On bandwidth-bound decode (which is what an agent doing tool calls is), the Mac wins despite Spark's FP4 advantage. Blackwell only pulls ahead when the workload is matmul-heavy and fits in 32 GB, which most agent-grade brains don't.

What Mac uniquely enables

Fine-tuning 14–32B models on a laptop (32–64 GB unified beats a 24 GB consumer GPU that can't even load them).
Battery-powered, silent operation under sustained load.
MLX-Audio one-endpoint TTS + STT + STS server.
Draw Things Lightning Draft (about one second per image on M5 Max).
Hardware ProRes encode in the Media Engine.
Continuity recipes: iPhone audio capture → AirDrop → MLX-Audio Whisper → 120B summarizer in one chain.

What Mac uniquely cannot do

CUDA-only models, FlashAttention-3 native kernels, NVIDIA-only quant formats (AWQ-INT4 with Triton).
True multi-GPU NVLink scaling beyond a single Mac (EXO Labs distributed inference over TB5 is the workaround for >128 GB).
Native FP8 and FP4 hardware support (Blackwell's persistent lead).
vLLM speculative-decoding-with-paged-attention performance at scale (vllm-metal v0.2 closes some of this in April 2026).

03 · The conductor

Hermes Agent, in full

A raw model only emits text. Something has to turn "dub this clip into English in my voice" into a real sequence of actions that ends in a file on disk. That something is the agent. Hermes Agent, from Nous Research, is the conductor that holds the baton.

In plain terms

Hermes is a free program you install once. It is the tireless operator that lives on your Mac: you talk to it like a person, and it actually does the work, running commands, editing files, browsing, making images, scheduling itself for later, and remembering what it learns so it gets better the more you use it. Think senior assistant, not chatbot.

Under the hood

An MIT-licensed Python 3.11+ agent runtime. It is model-agnostic: the model is a swappable component behind any OpenAI-compatible /v1/chat/completions endpoint (Ollama, llama.cpp, vLLM, SGLang, LM Studio). It ships a registry of more than 70 tools across more than 30 toolsets, a pluggable memory provider, a skills engine that authors its own skills, a full MCP client (OAuth 2.1 PKCE, OSV malware scanning, ACP via the Zed Agent Client Protocol Registry installable via uvx), subagent delegation, durable multi-agent orchestration via Kanban, a cron scheduler, 22 first-class messaging gateways and a React/Ink TUI plus web dashboard. Installs to ~/.local/bin; all state in ~/.hermes/; no telemetry.

The anatomy

Part	Plain	Technical
Tools	Its hands: everything it can physically do.	More than 70 built-in tools across more than 30 toolsets, plus MCP tools; switchable per session with `-t`.
Skills	Step-by-step methods; it writes new ones from what worked.	87 built-in skills + 79 optional in-repo + 672+ across the agentskills.io / HermesHub / LobeHub / Anthropic registries; `skill_manage`; your own under `~/.hermes/skills/`.
Memory	Facts about you and your projects.	Frozen-snapshot system-prompt prefix-cache injection at session start (MEMORY.md ~2,200 chars + USER.md ~1,375 chars), plus on-demand FTS5 query via `session_search`. Relevance ranking lives in `session_search` and in pluggable providers (Honcho, Mem0, Hindsight).
Session search	Total recall of past conversations.	Every session in SQLite with full-text search; `session_search` retrieves and summarizes.
Subagents	Clones helpers that work in parallel and report back.	`delegate_task`: isolated context + terminal + toolset; orchestrator role; `max_spawn_depth`; file-coordination layer.
Cron	Runs on a schedule. "Every morning, do X."	Skill-backed jobs run in fresh sessions; `notify_on_complete` on background processes.
Gateways	Reach it from Telegram, Discord, Slack, WhatsApp, email.	17 platforms; allowlist / DM-pairing / open auth; per-platform skill gating.
Profiles	Separate personas with their own memory (work vs creative).	Isolated `HERMES_HOME` dirs; per-profile config, keys, memory, sessions, skills.
Plugins + MCP	Bolt-on powers and connections to other tools.	Python/shell-hook plugins (can veto tool calls, ship image-gen backends); full MCP client.

The 70+ built-in tools

Tools tagged cloud need a key by default and get swapped for local equivalents in section 07; everything else is local.

terminal · process

Run shell commands, background servers, monitor and notify on completion.

2 tools

file ×4

read_file, write_file, patch (fuzzy find-replace, 9 strategies, auto syntax check), search_files (ripgrep).

4 tools

code_execution

execute_code: Python that calls other Hermes tools, with branching and output filtering.

delegation

delegate_task: spawn subagents in isolated contexts; only their summary returns.

browser ×12

navigate, click, type, snapshot, vision, console, scroll, CDP and more, full headless control.

12 tools

vision · memory

vision_analyze (describe/answer about images); memory (save durable facts).

skills ×3 · session_search

skill_manage / skill_view / skills_list; search and summarize all past sessions.

cronjob · todo · clarify

Schedule skill-backed tasks; plan multi-step work; ask you a multiple-choice question.

moa

mixture_of_agents: route a hard problem through several models (4 + aggregator) and merge.

rl ×10

Drive reinforcement-learning fine-tuning runs and read live training metrics.

10 tools

image_gen · tts · web ×2

image_generate, text_to_speech, web_search/web_extract.

cloud by default

messaging · homeassistant ×4 · feishu ×5

Send to chat platforms; control smart-home devices; enterprise Lark/Feishu doc ops.

How it thinks

In plain terms

You give it a goal. It makes a short plan, picks a tool, uses it, looks at the result, and decides the next step, repeating until done. It stops to ask only when it needs a real decision or permission for something risky.

Under the hood

A tool-calling loop: system prompt (memory + skills + tool schemas) → model emits a tool call → runtime executes → result re-enters context → repeat. Context compression (/compress) keeps long runs under the window; activity-based timeouts (read timeout relaxed to 1800s for local endpoints) prevent premature kills; subagent results return as summaries so intermediate noise never bloats the main thread.

Three mechanisms keep long jobs alive: todo (it tracks its own steps), compression (summarizes history to stay under the context limit), and delegation (offloads sub-tasks to subagents so the main thread stays clean). They are why a local agent can run a multi-stage media pipeline end to end without losing the plot.

What the community actually reports

Independent of any vendor claim, community discussion across r/LocalLLaMA and Hacker News has been active and growing, praising the smoother setup and the built-in learning loop. A representative comment: the agent "actually remembers" a failure and "creates a skill for troubleshooting it." The honest consensus is not that Hermes replaces dedicated coding agents; for pure software engineering, the leaders are clear: Claude Code with Opus 4.6 holds 80.8% on SWE-bench Verified (highest reported single-agent), Aider's Polyglot leaderboard places Opus 4.6 around 85% edit-format, and Cursor Composer with Sonnet 4.6 sits in the mid-70s. Hermes does not publish a SWE-bench Verified number and is built for a wider job: orchestration, memory, scheduling and gateways. Treat marketing comparisons sceptically; benchmark for your actual workload. Nous shipped hundreds of security-tagged commits across the 0.12-0.14 release cycles (588 / 633 / 550 total merged PRs respectively) and there are no widely reported breaches; the lone disclosed CVE-2026-7396 (WeChat path traversal) was patched. Hermes positions itself as complementary to the older OpenClaw framework, with a built-in hermes claw migrate path that imports persona, skills, memory, channels and API keys.

Why it is the right host for a local studio

Model-agnostic (your local brain plugs straight in), ships the agent machinery you would otherwise hand-build, fully offline-capable, and explicitly supports local servers. It is the same category as Claude Code or OpenClaw, but local-first, model-swappable and self-improving.

Sleeper features in 0.13 and 0.14

Six things from the last two release cycles that materially change what the studio can do:

hermes proxy

v0.14.0

Turns Claude Pro / ChatGPT Pro / SuperGrok into a localhost OpenAI-compatible endpoint. Codex CLI, Aider, Cline and Continue all become free to drive off your Hermes-managed subscription.

cost win

/goal · /subgoal

v0.13–0.14

Ralph-loop persistent goal contracts: the agent keeps going until a judge decides the criteria are met. Layered subgoals can be added mid-run.

persistence

Kanban durable

v0.13.0

Multi-agent orchestration with heartbeat detection, auto-block on incomplete exit, per-task retries, hallucination recovery. Closes the gap delegate_task leaves open.

Curator agent

v0.12.0

Background process that deduplicates, deprecates and consolidates skills. No equivalent in any competitor.

LSP semantic diagnostics

v0.14.0

Beyond syntax linting: type errors, undefined symbols, missing imports surfaced back to the agent before the next turn.

Cross-session prompt cache

v0.14.0

1-hour Claude prompt cache survives /new. Cuts cost materially on long workflows.

OpenClaw, Claw Code, ClaudeClaw: three different projects

Names collide. The migration path Hermes ships is for one of them only.

Project	Repo	What it is
OpenClaw Peter Steinberger	`openclaw/openclaw` 374K stars · MIT · TypeScript	General-purpose messaging-first AI assistant. Originally Clawdbot (2025-11-24) → Moltbot (2026-01-27) → OpenClaw. This is what `hermes claw migrate` reads.
Claw Code Sigrid Jin	`instructkr/claw-code` 48K stars · Python + Rust	Clean-room rewrite of Claude Code's leaked source map. Coding-focused CLI. Unrelated to OpenClaw.
ClaudeClaw moazbuilds	`moazbuilds/claudeclaw`	Lightweight OpenClaw-equivalent that runs as a Claude Code plugin: daemon + Telegram/Discord/Slack/cron/web dashboard.

The migration path

hermes claw migrate reads ~/.openclaw/, and auto-detects legacy ~/.clawdbot/ and ~/.moltbot/ paths. Non-destructive by default: skips SOUL.md if Hermes already has one, skips duplicate memory entries, skips same-named skills. Imports persona, skills, memory, channels and API keys. Imported skills land at ~/.hermes/skills/openclaw-imports/.

04 · The brain

The model that thinks, and the whole field of them

The brain is the model the agent reasons with. For an agent, the trait that matters most is not raw genius: it is reliably choosing the right tool and emitting a clean call, hundreds of times in a row. A model that is 2% smarter but fumbles one call ends the run. So this section covers how these models work, what makes each one different, how they are trained, and exactly how the 2026 field scores, before picking the ones that fit your Mac.

How a modern model works

In plain terms

Picture a company with 128 specialists. For each word, a "router" wakes only the 8 most relevant ones. The model is enormous in total knowledge, but only a thin slice works at a time, so it stays fast and fits in memory. That "mixture of experts" trick is why a 120-billion model runs on a laptop. Some models also have a "think first" switch that lets them reason step by step on hard problems and answer instantly on easy ones.

Under the hood

A Mixture-of-Experts transformer routes each token through a small subset of expert FFNs. Active-parameter count, not total, governs per-token compute and bandwidth, which is why MoE suits a 614 GB/s Mac. The differentiators between 2026 models are mostly in attention and routing: MLA (DeepSeek's Multi-head Latent Attention compresses the KV cache), attention sinks (gpt-oss lets heads "pay zero attention"), linear/lightning attention (MiniMax for long-context efficiency), auxiliary-loss-free routing (DeepSeek's load balancing), and hybrid thinking (Qwen's switchable reasoning with a thinking budget). Training increasingly leans on RL: DeepSeek-R1 showed pure RL via GRPO can teach reasoning with no supervised chain-of-thought; gpt-oss post-trains with CoT+RL like o3.

What makes each architecture different

Innovation	Who	What it does
Mixture-of-Experts	nearly all	Only a few experts fire per token; huge total capacity, small active cost.
Multi-head Latent Attention (MLA)	DeepSeek V3/V4	Compresses the KV cache into a latent space, slashing memory for long context.
Attention sinks	gpt-oss	A learned per-head bias lets the model ignore tokens cleanly; stabilizes long context.
Hybrid thinking + budget	Qwen3.x, DeepSeek V4	One model switches between visible chain-of-thought and instant answers; you cap the reasoning spend.
Auxiliary-loss-free routing	DeepSeek, Qwen3.x MoE	Balances expert load without a loss term that hurts quality; encourages specialization.
Hybrid Mamba + Transformer + MoE	NVIDIA Nemotron 3 Super	State-space + attention + MoE in one model. Only open frontier-class system shipping all three; holds 91.75% RULER at 1M tokens.
Linear / lightning attention	MiniMax M2.x	Sub-quadratic attention for cheap very-long-context inference.
RL-first post-training (GRPO)	DeepSeek-R1 lineage	Pure reinforcement learning induces reasoning without supervised CoT data.
Native MXFP4	gpt-oss	4.25-bit microscaling quantization is the intended format, not a lossy afterthought.

The whole field, scored and sized

Intelligence Index = Artificial Analysis v4.0 composite. SWE = SWE-bench Verified. "128GB" = fits on this Mac at a usable 4-bit quant. Top models are listed precisely because most of them do not fit, which is the honest part.

Model	Lab	Params (total / active)	Intel.	SWE	License	128GB?
gpt-oss-120B	OpenAI	117B / 5.1B MoE	n/a	o4-mini class	Apache-2.0	yes · 63GB
Qwen3.6-27B	Alibaba	27B dense	n/a	77.2	Apache-2.0	yes · 17-33GB
Qwen3.6 35B-A3B	Alibaba	35B / 3.5B MoE	n/a	n/a	Apache-2.0	yes · 26GB
Gemma 4 31B	Google	31B dense	top non-China open	n/a	Gemma	yes · 17GB
Mistral Medium 3.5	Mistral	128B	n/a	77.6	open	tight · ~70GB
Llama 4 Scout	Meta	109B / 17B MoE	n/a	n/a	Llama	yes · ~60GB
Nemotron 3 Super	NVIDIA	~70-100B	top non-China open	n/a	open	likely · 4-bit
DeepSeek V4 Flash	DeepSeek	284B / 13B MoE	~49	79.0	open	no · ~120GB Q4
DeepSeek V4 Pro	DeepSeek	1.6T / 49B MoE	52	80.6	open	cloud only
Kimi K2.6	Moonshot	~1T MoE	54	80.2	open	cloud only
GLM-5.1	Z.AI	744B MoE	~51	77.8 (GLM-5)	open	cloud only
MiniMax M2.7	MiniMax	MoE · linear attn	49.6	n/a	open	cloud only
MiMo-V2.5-Pro	Xiaomi	~1T MoE	54	78.0	open	cloud only
Qwen3.5 397B-A17B	Alibaba	397B / 17B MoE	n/a	76.2	Apache-2.0	no

Intelligence Index · top open weights (higher is better)

Kimi K2.6

MiMo-V2.5-Pro

DeepSeek V4 Pro

GLM-5.1

51.4

MiniMax M2.7

49.6

For scale: GPT-5.5 (xhigh) 60 · Gemini 3.1 Pro / Claude Opus 4.7 57. A year ago the best open model scored 22.

SWE-bench Verified · coding (open weights)

DeepSeek V4 Pro Max

80.6

Kimi K2.6

80.2

DeepSeek V4 Flash

79.0

GLM-5

77.8

Mistral Medium 3.5

77.6

Qwen3.6-27B (fits!)

77.2

Kimi K2.5

76.8

DeepSeek V3.2

73.0

The headline: a 27B dense model that fits your Mac (amber) lands at 77.2, within ~3 points of trillion-parameter cloud models.

Tool calling · BFCL v3 (the agent metric)

GLM-4.5

76.7

Qwen3 32B

75.7

Qwen3 235B-A22B

74.9

GLM-4.7-Flash

74.6

Kimi K2.5

64.5

MMLU-Pro reasoning leader: gpt-oss-120B at 90.0 (vs DeepSeek R1 85.0, GLM-4.5 84.6). AIME 2025 w/tools: gpt-oss-120B 97.9. GPQA Diamond: Kimi K2.6 90.5. τ-bench retail (closed leader): Claude Sonnet 4.5 0.862.

The models that fit your Mac, in detail

gpt-oss-120B · the reliability brain

OpenAI's open model. It is the most dependable at "using tools" of anything you can download, which is the single trait that decides whether long agent jobs finish. It fits comfortably and lets you dial how hard it thinks.

Architecture: token-choice MoE (117B total, 5.1B active, 4 experts), gated SwiGLU, GQA, alternating full + 128-token sliding-window attention, RoPE 131K via YaRN, learned per-head attention sinks.
Quantization: native MXFP4 on MoE weights → ~63 GB; Q6 ~90 GB. o200k_harmony tokenizer; Harmony chat format.
Training: CoT + RL post-training using o3-family techniques, specifically for reasoning and tool use; adjustable reasoning effort (low/medium/high).
Scores: MMLU-Pro 90.0, AIME 2025 97.9 (tools), GPQA 80.9 (tools), τ-bench matches/exceeds o4-mini. The cleanest tool-call discipline of any open weight.

Qwen3.6-27B · the all-rounder

Alibaba's open model and the best balance for this build: fast, small enough to leave room for media, and genuinely excellent at coding and agentic work. If you run one model, this is the safe pick.

Architecture: 27B dense (the family also ships 128-expert/8-active MoE variants with no shared experts and global-batch load-balancing). Hybrid thinking with a thinking budget. 256K context.
Training: ~36T pre-training tokens (double Qwen2.5), three-stage pretrain → four-stage post-train, synthetic math/code from Qwen2.5-Math/Coder, 119 languages. Apache-2.0.
Scores: SWE-bench Verified 77.2, beating models 50× its size; BFCL-class tool calling is strong. ~17 GB at Q4, ~33 GB at Q8.

Qwen3.6 35B-A3B

MoE 3.5B active

Tiny active footprint = very fast on Mac. Perfect fast/auxiliary lane beside a heavy main brain.

fast lane~26 GB Q5

Gemma 4 31B

dense · Google

Top non-China open model on the Intelligence Index; efficient, multimodal, strong general writing. Watch for repetition loops past ~11 tool calls in some harnesses.

vision~17 GB Q4

Mistral Medium 3.5

128B · EU

SWE-bench 77.6, EU-built for data-residency needs. Fits tight at 4-bit, leaves little for media.

tight fit

Llama 4 Scout

109B / 17B MoE

Fits at ~60 GB 4-bit; long context. Caveat: its pythonic tool-call format needs a compatible parser.

long ctx

Nemotron 3 Super

NVIDIA

The other top non-China open reasoning model; a strong Western-licensed alternative.

reasoning

DeepSeek V4 Flash

284B / 13B MoE

SWE-bench 79.0, MLA + auxiliary-loss-free routing. About 120 GB at Q4 (160 GB at FP16), just over the line for 128 GB once KV cache is added.

won't fit

Nemotron 3 Super: the open hybrid

NVIDIA's contender. Different inside: it mixes three kinds of layers (long-context Mamba, attention, and the MoE experts you already know). Holds 91.75% on a million-token retrieval test, which no other open model touches.

Architecture: hybrid Mamba state-space layers (cheap long context) + transformer attention (short range) + MoE experts (capacity). The only open frontier-class system shipping all three combined.
Scores: 91.75% RULER at 1M tokens (unmatched among open models). Strong on Intelligence Index alongside Gemma 4 31B as the two non-China open entries.
Why it matters: the natural pick for teams with data-residency constraints that bar weights from Chinese labs. Western-licensed alternative to DeepSeek / Qwen / GLM.

Qwen3-Coder-Next: the small fast coder

Alibaba's specialised coding model. Three billion active parameters, scores 70.6% on the coding test. The best self-hostable coder under 100 GB.

Architecture: 3B active parameters in a coding-tuned MoE. About 40 GB at Q4.
Training: 800K agentic coding RL tasks (multi-turn, tool-using, test-validated trajectories).
Scores: SWE-bench Verified 70.6 with only 3B active params. Inside the 100 GB ceiling so you can keep it loaded alongside the orchestrator brain on the M5 Max.

The cloud-only leaders, briefly

These do not fit your Mac, but you should know what sits at the top of the open leaderboard:

DeepSeek V4 Pro

1.6T / 49B MoE

Current open-weights coding leader at 80.6 SWE-Verified. Multi-head Latent Attention compresses the KV cache; auxiliary-loss-free routing balances experts. Hybrid thinking.

cloud onlyMLA + ALF

DeepSeek V4 Flash

284B / 13B MoE

First new DeepSeek architecture since V3. Same innovations as Pro at a fifth of the size. Still does not fit a 128 GB Mac at Q4 (~120 GB).

no · ~120 GB Q4

Kimi K2.6

Moonshot · ~1T

Co-leads Intelligence Index v4.0 at 54. GPQA Diamond 90.5 (highest of any open model). General-purpose flagship.

cloud onlyGPQA 90.5

GLM-5.1

Z.AI · 744B MoE

Z.AI's flagship. Intelligence Index around 51, GDPval-AA leader at 1535. Strong real-world agentic benchmark performance.

cloud onlyGDPval 1535

MiniMax M2.7

MoE · lightning attn

Pioneer of lightning attention: sub-quadratic for cheap very-long-context inference. Intelligence Index around 50.

cloud onlylinear attn

MiMo-V2.5-Pro

Xiaomi · 1.02T / 42B

Xiaomi's entry. Intelligence Index 54, 1M-token context, 78.0 SWE-Verified. China's third major model lab behind DeepSeek and Alibaba.

cloud only

Decoding the model-card vocabulary

Term	Used by	What it does in one line
MoE (Mixture of Experts)	nearly all	Only a few experts fire per token; huge total, small active cost.
MLA (Multi-head Latent Attention)	DeepSeek V3/V4	Compresses the KV cache into a latent space.
Attention sinks	gpt-oss	Learned per-head bias lets the model ignore tokens cleanly.
Auxiliary-loss-free routing	DeepSeek, Qwen3.x MoE	Balances expert load without a loss-term penalty.
Hybrid thinking + budget	Qwen3.x, DeepSeek V4	One model switches between visible CoT and instant answers; you cap reasoning spend.
Lightning attention	MiniMax M2.x	Sub-quadratic attention for cheap very-long-context inference.
GRPO	DeepSeek-R1 lineage	Pure RL induces reasoning without supervised CoT data. Cut post-training cost ~10×.
MXFP4	gpt-oss	4.25-bit microscaling: the intended format, not a lossy afterthought.
Mamba + Transformer + MoE	Nemotron 3 Super	State-space + attention + MoE in one model. Holds 91.75% RULER at 1M.

License matrix: can I ship this commercially?

License	Models	Commercial use	Catch
Apache-2.0	gpt-oss-120B, Qwen3.6-27B, Qwen3.6 35B-A3B, Qwen-Image-2512, Z-Image-Turbo, FLUX.2 [klein] 4B	✓ unrestricted	None. Default-yes.
MIT	Chatterbox, HiDream-O1, ACE-Step 1.5 XL	✓ unrestricted	None.
Llama 4 Community	Llama 4 Scout, Maverick	✓ conditional	EU multimodal blocked. 700M MAU threshold.
Gemma	Gemma 4 31B	✓ conditional	Google's terms; review product attribution.
Modified MIT	Mistral Medium 3.5	✓ conditional	Not pure MIT; check redistribution clauses.
FLUX.2 Non-Commercial	FLUX.2 [dev], FLUX.2 [klein] 9B	✗ research only	Pay BFL for commercial, or use [klein] 4B.
Custom open weights	DeepSeek V3/V4, GLM-5/5.1, Kimi K2.6, MiniMax M2.7, MiMo-V2.5-Pro	per-model	Generally permissive but read each.

Why the open models now iterate so fast: GRPO

DeepSeek-R1 demonstrated that pure reinforcement learning, with no supervised chain-of-thought data, can induce reasoning behaviour. Their Group Relative Policy Optimization (GRPO) algorithm cut post-training cost by roughly 10×. That is the single biggest reason DeepSeek shipped V4 Pro just three months after V3.2 and why the rest of the open field is iterating quarterly instead of yearly. The 2025 race was "scale pretraining". The 2026 race is "scale RL".

The verdict for 128 GB

Maximum reliability: gpt-oss-120B resident as the main brain. Speed + media headroom: Qwen3.6-27B (Q8), freeing ~80 GB for media. Best of both: 120B (or 27B) main + Qwen3.6 35B-A3B as a fast auxiliary, and let Hermes route between them. The trillion-parameter leaders stay on a cloud endpoint for the rare job that needs them; everything daily runs local.

A real, community-verified gotcha

Builders report that MLX-quantized models can lose tool-calling reliability after 5–10 rounds, while GGUF (via Ollama/llama.cpp) stays stable longer. So serve the agent brain on Ollama/GGUF for reliability, and reserve MLX for media and fine-tuning, where its speed and memory edge shine. Also: set Ollama's context explicitly (OLLAMA_CONTEXT_LENGTH=65536) or the 70+-tool system prompt silently overflows. And for hybrid-thinking models (Qwen3.x, GLM-4.7, DeepSeek V4), disabling reasoning mode requires all three of --reasoning off, --reasoning-budget 0, and chat_template_kwargs.enable_thinking: false. Setting only one is the single biggest source of production failures in May 2026 (see llama.cpp issue #13189).

05 · Uncensored

Abliteration: how it works, and the catch

In plain terms

"Uncensored" models do not refuse you. Researchers discovered there is essentially a single internal "no" signal inside a model; abliteration surgically cancels it. The catch is that doing this carelessly also dents the model's general skill, which an agent relies on. So the smart pattern is one disciplined model for the work, and a separate uncensored one only for the writing.

Under the hood

Refusal is mediated by roughly a single direction in the residual stream (Arditi et al., NeurIPS 2024). You estimate that refusal direction from contrastive harmful vs harmless prompts, then either steer activations away from it at inference, or permanently orthogonalize the weights against it. The term (ablate + obliterate) was coined by FailSpy. The problem: the refusal vector is polysemantic, entangling refusal with syntax, formatting and capability circuits, so naive ablation causes collateral damage, partially recoverable with light "healing" fine-tuning (SFT/DPO).

The 2026 tooling

Tool	What it does	Why it matters
Heretic	One-command automated abliteration; separates attention vs MLP interventions (MLP causes more damage); v1.2 adds a LoRA-based engine producing a toggleable adapter plus 4-bit support.	~6.5× less capability damage than hand-tuned efforts. `pip install heretic-llm` → `heretic <model>`.
OBLITERATUS	116-model toolkit; adds Expert-Granular Abliteration (per-expert directions for MoE) and CoT-aware ablation.	Deeper and broader, but heavier. For when a single direction is not enough.
UGI Leaderboard	Community ranking of Uncensored General Intelligence plus a natural-intelligence (NatInt) score.	The place to confirm an "uncensored" model is still actually smart after the surgery.

Qwen3.6 abliterated :agent

The compromise model: uncensored and keeps tool calling. Best single "both" pick.

agent + open~9-33 GB

Gemma 4 31B Heretic

Uncensored general-purpose with native vision and tool calling.

vision~17 GB Q4

Hermes 4.3 / 70B

Low-refusal by design (not abliterated). Excellent writing, lyrics, roleplay; run as the content subagent.

content engine

DIY with Heretic

Abliterate Qwen3.6-27B yourself for an uncensored brain tuned to taste, output as a toggleable LoRA.

full control

The recommended pattern

Keep a clean agentic brain (gpt-oss-120B or Qwen3.6-27B) as orchestrator, and wire an uncensored content model (Hermes 4.3 or a Heretic'd Qwen/Gemma) as the delegation model so the agent auto-routes writing to it. Reliable agency, zero refusals where you want them, no compromise on either side.

One honest line

An abliterated model has no refusal layer for anything, including instructions hidden in content it reads (prompt injection). Running them locally is legal in most places; the output and its use are entirely your responsibility. Keep it legal, and keep tool approvals on.

UGI leaderboard · 2026-05-24 snapshot

Top open uncensored: DeepSeek-V3.2-Speciale at 67.9 UGI. Closed: Grok 4 at 69.0. Hugging Face hosts 6,030 abliterated models as of the snapshot. The distinction matters: abliteration is the linear-algebra trick described above (cheap, reversible via toggleable LoRA), while uncensored fine-tune (full SFT/RL on permissive corpus) is heavier and less recoverable surgery. For most agent work, abliteration via Heretic v1.2 is the better recipe; for raw refusal-floor benchmarks, full fine-tunes still lead.

06 · Serving & quantization

How models run, and what "4-bit" means

In plain terms

A model's "weights" are billions of numbers. Quantization shrinks them by storing each with fewer digits, like rounding prices to the nearest dollar: 4-bit is small and fast with a tiny quality cost; 8-bit is bigger and basically perfect. You also need a small program to run the model; on Mac the easy one is Ollama, and the fastest is Apple's own MLX.

Under the hood

Quantization reduces weight precision (16-bit → 8/6/5/4-bit). GGUF is the cross-platform de-facto format (Q4_K_M standard, Q5_K_M sweet spot, Q6/Q8 near-lossless), run by llama.cpp/Ollama; it has the broadest model coverage, often within hours of a release. MLX is Apple's native format, built for unified memory with zero CPU↔GPU copies: ~10% less memory and 15–30% faster than GGUF at the same quant. MXFP4 is gpt-oss's native 4.25-bit microscaling. AWQ/GPTQ are activation- and gradient-aware schemes common on NVIDIA. Below ~Q3, tool-calling reliability collapses.

Server	Best for	Notes
Ollama (GGUF)	The agent brain	Simplest; Hermes auto-detects at `:11434/v1`; most stable for long agentic tool use.
MLX (mlx_lm)	Media, max speed, fine-tuning	Apple-native; fastest single-user generation; the only local LoRA-training path on Mac.
LM Studio	GUI management	Bundles the MLX engine; one-click OpenAI server; runs both MLX and GGUF.
llama.cpp / vLLM / SGLang	Power-user serving	Fine control of quant, context, KV-cache; `-ngl 99` offloads all layers to the Metal GPU.

Quant level	Quality	Use when
4-bit (Q4_K_M / MXFP4 / MLX-4)	Minor loss, max speed	The practical default for big models.
5–6-bit (Q5_K_M / MLX-6)	Near-lossless	24 GB+; the quality sweet spot.
8-bit (Q8 / MLX-8)	Effectively full precision	48 GB+ (you have it); best for a small premium brain.

Lane × server × quant: picking quickly

Each lane in the studio wants a different pairing. This collapses the decisions into one table.

Lane	Server	Quant	Why
Orchestrator brain	Ollama (GGUF)	Q4 / Q8	Tool-call stability past 5+ rounds; MLX drifts. Use Q8 if you have the room.
Auxiliary fast lane	MLX (`mlx_lm`)	Q5	Memory edge + native Metal kernels; short sessions do not trigger MLX tool-call drift.
Media generation	MLX / Draw Things	MXFP4 / 4-bit	Native; Apple's Neural Accelerator path; Draw Things' Lightning Draft hits about 1 sec/image on M5 Max.
Fine-tuning (LoRA)	MLX (`mlx_lm.lora`)	Q4 base	The only local LoRA-training path on Mac. Unified memory beats a 24 GB RTX 3090 here.

Reasoning-format settings, per model

Hybrid-thinking models need explicit configuration or they emit chain-of-thought that breaks downstream parsers. Copy-paste:

# gpt-oss-120B (Harmony format, native reasoning effort)
model: ollama/gpt-oss:120b
extra_body:
  reasoning_effort: high   # low | medium | high

# Qwen3.6 family (hybrid thinking; disable all three knobs)
model: ollama/qwen3.6:27b
extra_body:
  reasoning: off
  reasoning_budget: 0
  chat_template_kwargs:
    enable_thinking: false

# GLM-4.7 / 5.1 : same triple flag as Qwen3.6
# DeepSeek V4 : same triple flag

# Llama 4 Scout : pythonic tool-call parser required
model: ollama/llama4:scout
extra_body:
  tool_call_format: pythonic

# Gemma 4 31B : watch for repetition loops past ~11 tool calls. Reset session on detection.

Citation: llama.cpp issue #13189 documents the full triple-flag fix.

New in April 2026

vllm-metal v0.2 brought paged attention to Apple Silicon: 83× TTFT improvement vs v0.1, 3.6× throughput. The official Apple-Silicon serving path beyond MLX. Worth tracking if you serve to more than one client.

07 · The creative engines

Images, video, music, voice, transcription

This is where local AI stopped being a compromise. The brain writes and reasons; these models produce the media. Each is a different machine with its own architecture, its own training, and its own leaderboard. Below: how each kind works, the full field of options with arena scores, the apps that run them on Mac, and how to prompt them.

Images

In plain terms

You type a description; the model starts from pure static and "develops" it into a photo over a few seconds, like an instant Polaroid in reverse. Modern ones read your prompt so well you can ask for specific text on a sign, an exact pose, or "the same character, new scene."

Under the hood

2026 image models are rectified-flow transformers (MM-DiT): a diffusion-style model that learns a near-straight path from noise to image, so it needs far fewer sampling steps than old U-Net diffusion. Text and image tokens flow through coupled attention streams; a large text encoder (FLUX.2 uses Mistral Small 3.1 24B) gives strong prompt adherence. Images live in a compressed 16-channel latent decoded by an autoencoder. FLUX.2 [klein] was distilled-free, which makes it the best open base for LoRA training.

Text-to-image arena · open weights (Elo)

HiDream-O1-Image-Dev

1187

Qwen Image Max 2512

1160

FLUX.2 [dev]

1160

Seedream 4.5

1165

Qwen-Image-2512 (Apache)

1136

Z-Image-Turbo (Apache)

1076

Image editing (open) leaders: HunyuanImage 3.0 1224 · HiDream-O1 1213 · FLUX.2 [klein] 9B 1161. Closed frontier for scale: GPT Image 2 ~1339.

FLUX.2 [dev / klein 9B / klein 4B / pro]

BFL · 32B

Best-in-class quality and prompt control; klein 4B is the distillation-free LoRA base; up to 10 reference images, HEX color control. License is split: dev and klein 9B are non-commercial; klein 4B is Apache-2.0 (commercial OK).

dev/klein 9B: NCklein 4B: Apache~23 GB dev

Qwen-Image-2512

Alibaba

Top Apache-2.0 model: commercial-safe, excellent text rendering, strong editing. The pragmatic default.

Apache-2.0Elo 1136

HiDream-O1-Image-Dev

open leader

Highest-ranked open-weights model in the arena right now.

Elo 1187

Z-Image-Turbo

Apache · fast

Permissive and quick: few-step turbo sampling for near-instant previews.

Apache-2.0turbo

Seedream 4.5 / Hunyuan 3.0

ByteDance / Tencent

Seedream excels at East-Asian aesthetics (calligraphy, fabric, architecture); Hunyuan leads open image editing.

editing 1224

SD 3.5 / SDXL

Stability

The mature ecosystem: the deepest library of community LoRAs and ControlNets, even if raw quality now trails.

most LoRAs

How to run them on Mac

Draw Things (free, easiest, Metal-optimized, LoRA + ControlNet built in) · ComfyUI (node graphs, most flexible; by Draw Things' own benchmark about 20% slower on Apple Silicon at the same workload, not 3×) · MLX (Apple-native, fastest scripted). Prompt FLUX with plain descriptive sentences plus camera and lighting terms; it ignores old "masterpiece, 8k" spam. Train a FLUX LoRA in 1,000–2,000 steps at rank 8 (Hugging Face's published baseline) and stack 2–3 at 0.5–0.7 weight; Civitai has thousands ready to download.

HiDream-O1: a genuinely novel architecture

Most modern open image models pair a Diffusion Transformer with a separate large text encoder and a VAE. HiDream-O1 ships neither. It is a Pixel-Level Unified Transformer that processes raw RGB end-to-end. At 8B parameters under MIT, it is currently the most permissively-licensed top-tier open image model. Worth knowing about even if FLUX.2 + Qwen-Image are your daily drivers.

One-second image generation on M5 Max

Draw Things' Lightning Draft feature, combined with the M5 Max's Neural Accelerators, makes about one-second 512×512 image generation real on Apple Silicon. The cost is quality (lower step count); the win is the iteration loop. Generate-prompt-tweak cycles that took 30 seconds on M3 Max now take 3. Whether to ship it as a draft-then-finalise pipeline is up to you.

Video

In plain terms

Same idea as images, but the model also has to keep things consistent from frame to frame so motion looks real. This is the heaviest job in the studio: expect minutes per clip, not seconds, and it cannot run at the same time as a big brain.

Under the hood

Video models extend diffusion into a spatio-temporal latent: 3D attention over (frames × height × width) tokens with a causal video VAE, so the model denoises a whole clip while enforcing temporal coherence. This is why VRAM and time costs explode versus stills.

Wan 2.7

Alibaba · April 2026

The current Wan generation; reference-to-video with voice cloning and instruction-based video editing as new model classes since 2.2. Text- and image-to-video via ComfyUI. About 50 GB resident, minutes per clip; won't co-reside with a 120B brain. Wan 3.0 60B (Apache-2.0) is roadmapped for mid-2026.

~50 GBComfyUIv3.0 mid-2026

LTX-Video

Lightricks

Built for speed: real-time-ish generation on capable hardware, lower fidelity than Wan.

fastest

Hunyuan Video / Mochi

Tencent / Genmo

High-motion open alternatives; heavier still, strong cinematic motion.

high motion

Model	Lab	Resident	Max res / dur	M5 Max time per 5-sec clip	Notes
Wan 2.7	Alibaba · Apr 2026	~50 GB	1080p · 10 sec	~4–8 min	Current Wan generation. Reference-to-video + voice cloning + instruction editing as new model classes.
LTX-Video v0.9	Lightricks	~20 GB	720p · 6 sec	~30–90 sec	Speed-first; lower fidelity than Wan but real-time-ish on capable hardware.
HunyuanVideo	Tencent	~60 GB	1280×720 · 5 sec	~5–10 min	High-motion open alternative; cinematic motion.
Mochi 1	Genmo	~40 GB	848×480 · 5 sec	~3–6 min	Apache-2.0; strong open motion baseline.
Step-Video / CogVideoX	StepFun / Tsinghua	~30–50 GB	720p · 6–10 sec	~3–8 min	Newer contenders; CogVideoX-Vid evolution lineage stable.

The honest weak spot

Local video is the one area where the cloud is still clearly ahead on quality and speed. On a 128 GB Mac it is usable for short clips and B-roll, but plan around minutes per generation and run it as a dedicated mode with the brain unloaded.

The open-vs-closed gap, 2026-05-24

Closed leaders right now: Veo 3.x (Google, photorealistic narrative), Sora 2 (OpenAI, broad prompt range), Kling 2.x (Kuaishou, strong motion), Runway Gen-4 (cinematic), Pika 2.x (creator-friendly). Open weights still trail meaningfully on text-in-frame coherence, lip-sync to audio, and minute-long temporal stability. Wan 3.0's 60B Apache-2.0 release (mid-2026 roadmap) is the candidate to close the gap.

Music

In plain terms

Give it a style and some lyrics and it writes and performs a full song, vocals and instruments, in under a minute. The local model is genuinely close to the big paid services, runs offline, and you can teach it a voice or style from a handful of examples.

Under the hood

ACE-Step 1.5 XL is a 4B hybrid: a language-model "composer" reasons in chain-of-thought to plan a structured blueprint (lyrics, sections, duration, metadata), then a diffusion transformer renders 48 kHz stereo audio. It is built on a Sana-style deep-compression autoencoder (DCAE) + linear transformer, with MERT and m-hubert features aligned via REPA; v1.5 adds intrinsic RL. Under 4 GB VRAM, 50+ languages, quality in the Suno v5 range (the closed leader Suno v5.5 shipped March 2026 and has since widened the gap somewhat), and it supports cover, repaint, vocal-to-BGM and LoRA from a few songs.

ACE-Step 1.5 XL

4B · open

The local Suno. Full songs with vocals in seconds; tiny footprint; trainable on your own style.

<4 GBSuno v5 class

YuE

open

Long-form vocal music generation; strong full-song structure, heavier than ACE-Step.

DiffRhythm

open · fast

Full songs in ~10s via latent diffusion; very fast, fewer controls.

fast

MusicGen / Stable Audio

Meta / Stability

Instrumental and sound-design workhorses; no vocals, but reliable for beds and loops.

instrumental

How to prompt ACE-Step

Two fields: tags (3–7 words: genre, mood, instruments, tempo) and lyrics with structure markers like [verse], [chorus], [instrumental]. Budget ~2–3 words of lyric per second (under ~140 words for a 47-second clip). Specific tags ("balkan brass, minor key, 90 bpm, male vocal") beat vague ones every time.

Voice & dubbing

In plain terms

From 5–10 seconds of someone speaking, these models clone the voice and then read any text in it, with emotion. Chain a few together and you can take a video in one language and output it dubbed in another, keeping the original speaker's voice.

Under the hood

Modern open TTS is zero-shot non-autoregressive flow-matching: a DiT generates a mel-spectrogram conditioned on text plus a speaker embedding extracted from a short reference clip, fused without a separate duration model. F5-TTS pairs a DiT with ConvNeXt, trained on ~100k hours, hitting real-time factor ~0.15 (≈6× faster than playback). Chatterbox adds emotion control and a PerTh watermark.

Chatterbox

Resemble · MIT

23-language cloning with emotion control; competitive with ElevenLabs in independent blind tests. The default voice engine.

MIT · 23 lang

F5-TTS

open · fast

RTF ~0.15, clones from a 1–10s reference; superb speed/quality balance.

RTF 0.15

Kokoro

tiny

Featherweight, extremely fast TTS for narration where cloning is not needed.

Qwen3-TTS

Alibaba · Jan 2026

Clones from just 3 seconds of reference audio (vs 5–15s for F5, 5s for Chatterbox). Strong multilingual range. Pairs naturally with the Qwen brain.

3s ref

CosyVoice 3

Alibaba · open

About 150 ms streaming latency, the lowest of any open TTS as of May 2026. The pick for real-time voice agents.

150 ms

Sesame CSM-1B

Sesame · open

Conversational speech model: tiny, fast, fluent in the conversational register where most TTS still sounds read-aloud.

conversational

Dia

open

Multi-speaker dialogue model; handles back-and-forth and overlap better than single-speaker TTS systems.

dialogue

TTS quality: open voice models (community Elo, May 2026)

Chatterbox (MIT)

1187

F5-TTS

1142

CosyVoice 3

1118

Qwen3-TTS

1095

XTTS v2 (Coqui)

1040

Kokoro

980

Closed reference: ElevenLabs Multilingual v3 ~1320. The Chatterbox-vs-ElevenLabs blind-test gap is single digits in the same range. Snapshot from TTS-Arena (HuggingFace) on 2026-05-24.

Transcription

In plain terms

Turns speech into accurate text in dozens of languages, fast, fully offline. The backbone of meeting notes, subtitles and the dubbing pipeline above.

Under the hood

Encoder-decoder transformers trained on huge weakly-labelled audio. Whisper large-v3 runs at ~3 GB in MLX with strong multilingual word-error rates; Parakeet and Qwen3-ASR push speed and accuracy further on supported languages.

Whisper large-v3

OpenAI · ~3 GB

The reliable multilingual standard; timestamps, translation, robust to noise.

99 languages

Parakeet v3

NVIDIA

Very fast, very accurate on supported languages; great for long recordings.

fastest

Qwen3-ASR

Alibaba

Newer multilingual ASR with strong accuracy; pairs naturally with the Qwen brain.

Canary-Qwen-2.5B

NVIDIA · open

5.63% WER on English: beats Whisper large-v3 on the HuggingFace Open ASR Leaderboard for English transcription. Use when English-only and accuracy matters more than language coverage.

5.63% WER

ASR word-error rate: open models on English (HF Open ASR Leaderboard, May 2026)

Canary-Qwen-2.5B

5.63

Parakeet v3

6.1

Whisper large-v3

6.7

Whisper large-v3-turbo

6.9

Qwen3-ASR

7.2

Lower is better. Whisper still wins on language breadth (99 languages) and noise robustness. Canary-Qwen wins on English accuracy.

One server runs all the audio

MLX-Audio (mlx_audio.server) exposes TTS, STT and speech-to-speech behind an OpenAI-compatible REST endpoint, so Hermes drives every voice model through one local URL, the same way it talks to the brain.

Three ways to run Whisper on M5 Max

WhisperKit (Argmax, native Swift, lowest latency) · Lightning-whisper-mlx (community MLX port, fast on M-series) · MetaWhisp (newer, optimised for the M5 Neural Accelerators specifically). Pick by language coverage and latency budget; all three serve the same model weights.

08 · Cutting the cord

Making the last cloud tools local

Hermes ships four tools that phone the cloud by default. To make the studio truly air-gappable, each one is repointed at a local engine. After this, the machine can run with Wi-Fi off.

Default (cloud)	Local replacement	How
`image_generate` → FAL	Local FLUX / Qwen-Image	Plugin or MCP wrapper around Draw Things / ComfyUI / MLX; the plugin system supports image-gen backends.
`text_to_speech` → cloud	MLX-Audio server	Point the tool at the local OpenAI-compatible voice endpoint.
`web_search` → Exa	Off, or local SearXNG	Disable for a pure air-gap, or wrap a self-hosted SearXNG via MCP for offline-ish search.
`mixture_of_agents`	Local council	Repoint the 4 references + aggregator at your local models (e.g. gpt-oss + Qwen + Gemma).

The result

Brain, images, music, voice, video and transcription all answer on localhost. Pull the network cable and the studio keeps working. That is the difference between "private-ish" and genuinely yours.

09 · LoRAs & fine-tuning

Teaching a model your style, on the Mac

In plain terms

A LoRA is a tiny add-on file that teaches a big model one new thing: your face, your art style, a brand voice, a singer's tone, without retraining the whole model. You make one from a handful of examples in an hour or two, and snap it on or off like a lens.

Under the hood

Low-Rank Adaptation freezes the base weights and injects small trainable rank-decomposition matrices (A·B) into attention/FFN layers; you train ~0.1–1% of parameters. QLoRA trains those adapters on top of a 4-bit-quantized base, collapsing memory further. Lineage: DreamBooth and Textual Inversion for image models, now standard across text, image, music and voice. Unified memory is the quiet superpower here: there is no separate VRAM wall, so a Mac fine-tunes models a 24 GB RTX 3090 cannot even load (rule of thumb: 16 GB → 8B, 32 GB → 14B, 64 GB → 32B; Llama-7B needs ~28 GB full, ~14 GB LoRA, ~7 GB QLoRA).

Training a text LoRA with MLX

# 1. quantize the base to 4-bit (QLoRA kicks in automatically)
mlx_lm.convert --hf-path Qwen/Qwen3.6-27B -q --q-bits 4
# 2. train the adapter on your JSONL data
mlx_lm.lora --model ./mlx_model --train --data ./data \
  --lora-layers 16 --batch-size 2 --iters 600
# 3. fuse the adapter back into a standalone model (optional)
mlx_lm.fuse --model ./mlx_model --adapter-path ./adapters

Image LoRA

FLUX/SDXL LoRA in 2,000–4,000 steps via ai-toolkit, SimpleTuner or ComfyUI. Stack 2–3 at 0.5–0.7 weight. Thousands ready on Civitai.

Music LoRA

ACE-Step learns a genre or a singer's tone from a handful of songs; snap it on for on-brand tracks.

Abliteration LoRA

Heretic v1.2 outputs the uncensoring itself as a toggleable LoRA adapter, no full re-download.

Voice "LoRA"

Zero-shot cloning is effectively instant adaptation; fine-tune only for a recurring signature voice.

Why this is the Mac's hidden edge

NVIDIA is 2–4× faster on models that fit its VRAM. But the moment a model is too big for the card, the Mac wins by simply being able to train it at all. For personal fine-tuning of 14–32B models, 128 GB of unified memory is a genuinely rare capability in a laptop.

10 · Orchestration

How the agent runs the whole studio

The pieces only become a studio when one mind coordinates them. Hermes does this with five mechanisms, all configurable.

Mechanism	What it enables
Model routing	`delegation.model` sends subagent or content work to a different model (e.g. uncensored writer) while the orchestrator stays on the reliable brain. Switch live with `/model`.
Subagents	`delegate_task` spawns isolated workers (own context, terminal, tools) that run in parallel and return only a summary, keeping the main thread clean.
Mixture of agents	`mixture_of_agents` sends one hard problem to several models and an aggregator merges the best answer, all local.
Code execution	`execute_code` runs Python that itself calls Hermes tools, for branching logic the model would otherwise narrate step by step.
Cron + memory + skills	Scheduled jobs run in fresh sessions; memory carries durable facts; skills carry repeatable procedures the agent wrote itself.

# route the heavy thinking and the uncensored writing separately
model: ollama/gpt-oss:120b          # orchestrator brain
delegation:
  model: ollama/qwen3.6-abliterated:agent   # subagent / content writer
custom_providers:
  - name: local-media
    base_url: http://localhost:8080/v1   # MLX-Audio / image gateway

11 · Prompts & recipes

How to actually talk to each model

Every model class wants a different prompt style. Old "masterpiece, trending on artstation" spam hurts modern models. Here are the patterns that work.

FLUX / Qwen-Image

Plain descriptive sentences + camera, lens and lighting. "A weathered fisherman at dawn, 35mm, soft side light, shallow depth of field, muted teal palette." Put exact text in quotes for signage.

ACE-Step

Tags: balkan brass, minor key, 90bpm, male vocal, live. Lyrics with [verse] / [chorus] / [instrumental]. ~2–3 words per second.

Voice cloning

5–10s of clean reference audio (no music). Punctuation drives prosody; for emotion use Chatterbox's exaggeration control rather than ALL-CAPS.

Agent system prompt

State the goal, the allowed tools, and a stop condition. Let memory and skills carry standing context instead of repeating it every session.

12 · Pipelines

Real workflows the studio runs end to end

These are not hypotheticals; each is a chain of the tools and models already covered, expressed as a Hermes skill or cron job.

Auto-dubbing

Drop in a video → Whisper transcribes → brain translates → Demucs/Pyannote separate and diarize → Chatterbox clones each speaker → ffmpeg muxes. One skill, one command.

skill

Local songwriting

Brain (or uncensored writer) drafts lyrics in your style → ACE-Step composes and performs → you keep stems. The fully offline Suno.

music

Content factory (cron)

Every morning: brain drafts posts → FLUX renders matching images → files land in a folder, notify on complete. Runs while you sleep.

cron

Autonomous coding

Brain plans → subagents implement modules in parallel → execute_code runs tests → it self-corrects until green.

subagents

Local model council

mixture_of_agents routes a hard decision through gpt-oss + Qwen + Gemma and merges, no API.

MoA

Self-improving loop

It hits a bug, solves it, and writes a skill so the fix is permanent. The studio gets better with use.

skills

Embed it anywhere

Beyond chat, the whole agent is importable: from run_agent import AIAgent drops the studio into your own Python scripts, so a pipeline can be triggered by a file drop, a webhook or a schedule.

13 · Capacity

What actually fits in 128 GB at once

Roughly 120 GB is usable for models after raising the wired limit. These are real, simultaneous loadouts. The pattern is always: keep one brain resident, spin heavy media up on demand.

A · Daily driver · max reliability

gpt-oss-120B · 63GBKV cacheFLUX on demand

B · Creative · brain + always-on media

Qwen 27B Q8 · 33GBFLUX dev · 23GBMLX-AudioACE-Stepheadroom

C · Pure agent · multi-brain routing

gpt-oss-120B · 62GBQwen 27B Q835B-A3B Q5

D · Video mode · brain unloaded

Wan 2.7 · ~50 GBQwen 27B Q4working memory

main brainsecond modelimage/audiovideocache / free

The trade in one line

You cannot run a 120B brain and a video model at the same time. You can run a 120B brain, image, music, voice and transcription together all day. Plan loadouts, not wishlists.

What fits my Mac? Interactive

Slide to your RAM budget. The model field table above and the budget bars above re-colour: green if it fits, amber if tight, red if you would need to evict the brain to run it.

Available unified memory 128 GB

163264128192256512

At 128 GB: the recipe is one resident brain plus on-demand media. gpt-oss-120B fits comfortably; Qwen3.6-27B Q8 leaves the most media headroom. Video is mode-switch, not co-resident.

14 · Build it

From a fresh Mac to a running studio

Free the memory. Raise the GPU wired limit so models can use about 120 GB.
```
sudo sysctl iogpu.wired_limit_mb=122880
```
Install the serving layer. Ollama for the brain (GGUF), and MLX/LM Studio for media and fine-tuning.

Pull the models.

ollama pull gpt-oss:120b
ollama pull qwen3.6:27b
# raise context so the 70+-tool prompt fits
export OLLAMA_CONTEXT_LENGTH=65536

Install Hermes. Pick one:

# Recommended on macOS
brew install hermes-agent

# Or via PyPI (clean Python environments)
pip install hermes-agent

# Or via the official install script
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

# Or on Windows
iex (irm https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.ps1)

Point it at local. Edit config.yaml (the orchestration block in section 10): brain, delegation model, custom providers for media.
Verify. Run hermes doctor, then ask it to write a file, generate an image, and transcribe a clip. If all three land, the studio is live.
Configure reasoning-format per model. Hybrid-thinking models need three knobs disabled to keep tool-call output clean. The cheat sheet is in section 6. Skip this and watch agents emit raw chain-of-thought into your terminal output.
Enable hermes proxy if you have any of Claude Pro, ChatGPT Pro or SuperGrok. The subscription becomes a localhost OpenAI-compatible endpoint that Codex CLI, Aider, Cline and Continue can all drive. Material cost-saving for multi-tool users.

Housekeeping that pays off

hermes backup snapshots memory and skills · profiles keep work and creative brains separate · /compress rescues a long session nearing its context limit.

One-shot install script

Save as studio-up.sh, run with bash studio-up.sh. Idempotent; rerun anytime.

#!/usr/bin/env bash
set -euo pipefail

# 1. Free the memory
sudo sysctl iogpu.wired_limit_mb=122880

# 2. Install serving layer
brew install ollama
brew install --cask lm-studio
pip install mlx mlx-lm mlx-audio

# 3. Pull models
ollama pull gpt-oss:120b
ollama pull qwen3.6:27b
export OLLAMA_CONTEXT_LENGTH=65536

# 4. Install Hermes
brew install hermes-agent

# 5. Initial config
mkdir -p ~/.hermes
cat > ~/.hermes/config.yaml <<'YAML'
model: ollama/gpt-oss:120b
delegation:
  model: ollama/qwen3.6:27b
custom_providers:
  - name: local-media
    base_url: http://localhost:8080/v1
YAML

# 6. Verify
hermes doctor
echo "Studio is live. Try: hermes 'write hello.txt then summarize it.'"

15 · Safety & air-gap

Staying private, staying in control

An agent that can run shell commands and an uncensored model that never refuses are powerful and need guardrails. Hermes ships them; keep them on.

Dangerous-command blocking + approvals: destructive patterns (rm -rf, DROP TABLE) are blocked or require explicit confirmation.
Secret-exfiltration scanning: the runtime flags attempts to leak keys or credentials, important when an abliterated model will follow any instruction it reads.
MCP hardening: OAuth 2.1 PKCE for connectors plus OSV scanning of MCP servers for known-malicious packages.
Air-gap checklist: local brain ✓, local media ✓, web_search off or local SearXNG ✓, no telemetry ✓. Then the network cable is optional.

The one mindset to keep

Local does not mean consequence-free. The agent acts on your machine with your permissions; the uncensored model will draft anything. Keep approvals on for shell and file-deletion, sandbox experiments, and own what you generate. Privacy is a feature, not an excuse.

Seven sandbox backends

Hermes ships more sandbox options than any competitor. Pick by trust level + cost.

local

default

Runs in your shell with the agent's permissions. Fastest. Use for trusted skills.

Docker

local container

Isolated filesystem, network, processes. Use for untrusted MCP servers or third-party skills.

SSH

remote host

Run the workload on another machine over SSH. Use for heavy compute on a Mac Studio without leaving the laptop.

Singularity

HPC container

Container format favoured in HPC. Use on university or research clusters.

Modal

serverless

Spins up an ephemeral cloud sandbox per task. Use for elastic burst compute outside the studio.

Daytona

dev environments

Pre-configured dev environments. Use for reproducible per-project workspaces.

Vercel Sandbox

edge sandbox

Vercel's serverless sandbox primitive. Use when the workload should sit close to a web app.

Disclosed CVEs

One disclosed Hermes CVE to date: CVE-2026-7396 (WeChat path traversal). Patched in 0.13.x. Small attack surface, transparent disclosure. The 0.12-0.14 release cycles shipped hundreds of security-tagged commits as part of normal hardening; no breaches reported in the wild.

16 · Where this goes

The studio is a starting line, not a finish

Step back and look at the arc. A year ago, the best open model scored 22 and a private studio like this was science fiction. Today the gap to the frontier is single digits, a 27-billion-parameter model that fits a laptop codes like last year's best, and music, voice and images that rivalled paid services now run offline in seconds. The line on that chart is still climbing.

The question stopped being "can I run this locally?" and became "why would I rent it?"

What you have built here is not a cheaper ChatGPT. It is a different relationship with the technology. The models are yours: they do not change under you overnight, they do not log your work, they do not disappear when a subscription lapses or a company pivots. The agent learns your patterns and keeps them. The voice clone, the LoRA of your style, the skills it wrote solving your bugs, none of it leaves the machine. In a market built on renting access to someone else's servers, owning the whole stack outright is the genuinely radical option.

It is not the strongest possible system. The trillion-parameter leaders still live in data centers, local video still trails, and the leaderboard you read today will be wrong next month. But the trajectory is unmistakable: every quarter, more of the frontier becomes something you can hold in 128 GB. The right move is not to wait for the perfect model. It is to build the studio now, learn how the pieces fit, and let it improve underneath you as the open field keeps closing the gap, which it will.

One agent. Every model. Zero cloud. Run it once, and the cloud starts to look like a choice rather than a requirement. That is the whole point, and it is already here.

17 · FAQ & honest gaps

Straight answers

Is it really free?

The software (Hermes, Ollama, MLX) and the open-weights models are free to download and run. You pay once for the hardware and in electricity. No subscriptions, no per-token billing.

Does it genuinely work offline?

Yes, once the four cloud-default tools are repointed (section 08). Brain, images, music, voice, video and transcription all answer on localhost. You can pull the network cable.

What is the catch with uncensored models?

Abliteration can dent general capability if done carelessly, and an uncensored model will follow injected instructions too. Use a clean brain as orchestrator and an uncensored model only as the content subagent, and keep tool approvals on.

Where is local AI still weak?

Video: quality and speed trail the cloud, and it cannot co-reside with a big brain. The very top reasoning models (1T+) need a data center. For pure software engineering, dedicated coding agents still edge out a general local agent.

Is a Mac actually the right machine?

For this use case, yes. Unified memory lets a laptop hold and even fine-tune models a consumer GPU cannot load, silently and on battery. An NVIDIA card is faster on models that fit its VRAM, but loses the moment a model is too big for it.

Will these recommendations age?

The specific models will, within weeks. The architecture will not: an agent conductor, a resident MoE brain on GGUF, MLX for media and fine-tuning, and on-demand creative engines. Swap the model names; keep the structure.

Do I need macOS Tahoe 26.4?

For the full Neural Accelerator INT4 path, yes. Earlier 26.0 and 26.1 lack INT4 tensor support and the tok/s numbers in this guide will not reproduce. Run sw_vers to check.

Does Apple Intelligence conflict with the local studio?

No. Apple Intelligence runs a small on-device model (around 3B parameters) inside its own allocation; it does not fight your 120 GB unified pool for the orchestrator brain.

Can I scale beyond 128 GB?

Yes, two paths. (a) Mac Studio M3 Ultra at 512 GB / 819 GB/s holds DeepSeek V4 Flash at Q8 comfortably. (b) EXO Labs distributed inference over Thunderbolt 5 daisy-chains M5 Max + Mac Studio M3 Ultra into a single inference pool, workable for models > 128 GB.

Is FLUX.2 commercially usable?

Partly. FLUX.2 [dev] and [klein] 9B are non-commercial. FLUX.2 [klein] 4B is Apache-2.0, fully commercial-friendly. For unrestricted commercial image work the safe default is Qwen-Image-2512 or Z-Image-Turbo (both Apache-2.0).

How fast is the M5 Max actually?

Independent benchmarks: gpt-oss-120B at Q8 MLX runs 64-88 tokens/second decode on a 128 GB M5 Max. Qwen3.5 27B dense at Q6 MLX: 14-24 tokens/second. Prefill is 3.33-4.06× faster than M4 Max per Apple's MLX team. Real, reproducible, on a laptop.

What happens when the model I'm using gets superseded?

Swap the model in config.yaml; the rest of the studio (agent, tools, skills, memory, pipelines) is unchanged. That is the whole point of the architecture: the model is a swappable component, not the system.

Why not Strix Halo or DGX Spark instead of a Mac?

Both are valid. Same 128 GB unified memory. The deciding factor is bandwidth: M5 Max 614 GB/s vs DGX Spark 273 GB/s vs Strix Halo 256 GB/s. For bandwidth-bound decode workloads (agent tool-call loops), the Mac wins by roughly 2.25× over Spark and 2.4× over Strix Halo. NVIDIA's FP4 hardware advantage helps on some prefill workloads. If you need CUDA-only models or Windows-native tools, Strix Halo wins.