One agent. Every model. Zero cloud.
A deep, honest guide to building a private AI studio on a single MacBook Pro M5 Max (128GB). Hermes Agent is the conductor; a local model is its brain; and a dedicated local model handles images, video, music, dubbing, voice cloning and transcription. The whole 2026 model field, how each one works and trains, the benchmarks that matter, and exactly how to build it. Flip the toggle to read it plain or in full technical depth.
Sources are primary throughout: official Hermes, Apple, OpenAI, Qwen and DeepSeek documentation; the Artificial Analysis Intelligence Index, SWE-bench Verified, BFCL, τ-bench, MMLU-Pro and Text-to-Image arenas; arXiv papers for every media model; and community discussion on r/LocalLLaMA. Numbers are as reported by their sources and are a May 2026 snapshot. Models rotate monthly; the architecture here is stable, the leaderboard is not. Benchmark on your own machine before committing.
Why this is suddenly possible
Two years ago, running a frontier-class model on your own machine was a fantasy. That changed faster than almost anyone predicted. The reason this guide exists now, and could not have existed in 2024, is a single chart: the gap between open models you can download and the closed models you rent has nearly closed.
"Open weights" means the actual model is published for anyone to download and run, instead of being locked behind a company's paid website. In 2024 these were toys next to ChatGPT. In 2026 they are genuinely close to the best, free, and small enough to run on a good laptop. That is the whole reason a private studio on one Mac is now realistic.
On the Artificial Analysis Intelligence Index (v4.0, a composite of ten evals including GPQA Diamond, Humanity's Last Exam, τ²-Bench, Terminal-Bench Hard and SciCode), the best open-weights model a year ago, DeepSeek V3 0324, scored 22, about 13 points below the leading proprietary model. Today the top open models (Kimi K2.6, MiMo-V2.5-Pro) score 54, with DeepSeek V4 Pro at 52, within 3-6 points of GPT-5.5 (60), Gemini 3.1 Pro and Claude Opus 4.7 (57). Open weights now hold 244 of 386 ranked models and dominate the intelligence-vs-price Pareto frontier.
| Then → now (open weights) | Early 2025 | 2026-05-24 |
|---|---|---|
| Top Intelligence Index score | 22 (DeepSeek V3 0324) | 54 (Kimi K2.6 / MiMo-V2.5-Pro) |
| Gap to best proprietary | ~13 points | 3–6 points |
| SWE-bench Verified (best open) | ~55% | 80.6% (DeepSeek V4 Pro Max) |
| Open share of ranked models | minority | 244 of 386 |
Why the M5 Max is the right hardware
The Mac's advantage is not raw speed; a data-center GPU beats it on throughput. The advantage is one large pool of fast memory shared by the whole chip, which lets a laptop hold models that simply will not fit on a consumer graphics card, while sipping power and staying silent.
Unified memory is the whole trick
A normal gaming PC keeps the model in the graphics card's small, separate memory, and if the model is too big, it simply will not load. The Mac has one big shared pool, so the entire model lives in the same 128 GB the rest of the chip uses. That is how a laptop runs models a $1,500 graphics card chokes on.
Apple's unified memory is a single pool addressable by CPU, GPU and Neural Engine with no host↔device copies. The M5 Max, launched 2026-03-03, tops out at 128 GB at 614 GB/s on the 40-core GPU SKU (the 32-core variant caps at 64 GB / 460 GB/s). After raising the wired limit you get about 120 GB usable for models. The historic Mac weakness was prefill (prompt processing), where it trailed NVIDIA badly; the M5's new per-core Neural Accelerators push prefill 3.33×–4.06× faster than M4 (Apple's own MLX team measurement) and make FLUX-dev-4bit ~3.8× faster. The Neural Accelerators run 1,024 FP16 fused multiply-accumulates per core per cycle, aggregating to about 70 TFLOPS of FP16 or 130 TFLOPS of INT8 across the 40-core GPU. Native FP8 and FP4 still belong to Blackwell; BF16 on the Neural Accelerators arrived in macOS 26.1, INT4 in 26.4: the numbers in this guide assume macOS Tahoe 26.4 or later.
| Spec | Figure | Why it matters here |
|---|---|---|
| Unified memory | 128 GB (40-core GPU SKU) | About 120 GB usable for models after raising the wired limit. Holds one big brain or one video model. |
| Memory bandwidth | 614 GB/s | Caps token generation speed; favours MoE models with low active parameters. |
| GPU + Neural Accelerators | 40-core · ~70 TFLOPS FP16 · ~130 TFLOPS INT8 | About four times M4 Max GPU compute on matmul and prefill; about a fifth faster on decode. Speeds diffusion and rectified-flow image models. |
| CPU | 18-core (6 super + 12 performance) | Drives the agent loop and the Python orchestration; not the bottleneck. |
| Storage | 2 / 4 / 8 TB SSD · 13.6 GB/s read · 17.8 GB/s write | A full stack with alternates is 250–300 GB; pick 4 TB if you keep multiple 120B weight files. Cold-loading a 65 GB model takes ~5 seconds. |
| Chip topology | TSMC SoIC-mH chiplet (two N3P dies) | "Fusion Architecture": same compute die in M5 Pro and M5 Max; explains the uniform per-core spec. |
| macOS required | Tahoe 26.4 or later | Earlier 26.0 / 26.1 lack INT4 Neural Accelerator support; perf drops materially. Verify your build before reproducing the numbers in this guide. |
How the M5 Max stacks up against everything else with 128 GB
For an agent that decodes hundreds of tool calls per session, memory bandwidth beats raw compute, because decode is bandwidth-bound. The Mac wins this race against its closest peers despite Blackwell's FP4 hardware advantage.
| Platform | Unified RAM | Bandwidth | gpt-oss-120B Q4 | DeepSeek V4 Flash Q4 | Notes |
|---|---|---|---|---|---|
| MacBook Pro M5 Max 128 GB | 128 GB | 614 GB/s | fits · ~63 GB | no · ~120 GB | Silent, battery, ANE + Neural Accelerators. |
| MacBook Pro M4 Max 128 GB | 128 GB | 546 GB/s | fits | no | Predecessor: weak prefill. |
| Mac Studio M3 Ultra 512 GB | 512 GB | 819 GB/s | fits · Q8 | fits | Desktop scale-up: the only Apple option that holds V4 Flash. |
| RTX 5090 (32 GB) | 32 GB GDDR7 | 1,792 GB/s | too small | no | Fastest per-GB but GB ceiling kills it for 120B. |
| NVIDIA DGX Spark / Project Digits | 128 GB unified | 273 GB/s | fits | no | Same RAM, 2.25× less bandwidth than M5 Max. Has FP4 hardware. |
| AMD Strix Halo (Ryzen AI MAX+ 395) | 128 GB unified | 256 GB/s | fits | no | x86 alternative; mature ROCm still trails MLX / CUDA on Q4 kernels. |
| RTX PRO 6000 Blackwell | 96 GB | 1,792 GB/s | fits | tight | Workstation: PCIe scale-out, no battery, $7-10k. |
| NVIDIA Jetson AGX Thor | 128 GB unified | n/a published | likely | no | Robotics-first, not the studio target. |
What Mac uniquely enables
- Fine-tuning 14–32B models on a laptop (32–64 GB unified beats a 24 GB consumer GPU that can't even load them).
- Battery-powered, silent operation under sustained load.
- MLX-Audio one-endpoint TTS + STT + STS server.
- Draw Things Lightning Draft (about one second per image on M5 Max).
- Hardware ProRes encode in the Media Engine.
- Continuity recipes: iPhone audio capture → AirDrop → MLX-Audio Whisper → 120B summarizer in one chain.
What Mac uniquely cannot do
- CUDA-only models, FlashAttention-3 native kernels, NVIDIA-only quant formats (AWQ-INT4 with Triton).
- True multi-GPU NVLink scaling beyond a single Mac (EXO Labs distributed inference over TB5 is the workaround for >128 GB).
- Native FP8 and FP4 hardware support (Blackwell's persistent lead).
- vLLM speculative-decoding-with-paged-attention performance at scale (vllm-metal v0.2 closes some of this in April 2026).
Hermes Agent, in full
A raw model only emits text. Something has to turn "dub this clip into English in my voice" into a real sequence of actions that ends in a file on disk. That something is the agent. Hermes Agent, from Nous Research, is the conductor that holds the baton.
Hermes is a free program you install once. It is the tireless operator that lives on your Mac: you talk to it like a person, and it actually does the work, running commands, editing files, browsing, making images, scheduling itself for later, and remembering what it learns so it gets better the more you use it. Think senior assistant, not chatbot.
An MIT-licensed Python 3.11+ agent runtime. It is model-agnostic: the model is a swappable component behind any OpenAI-compatible /v1/chat/completions endpoint (Ollama, llama.cpp, vLLM, SGLang, LM Studio). It ships a registry of more than 70 tools across more than 30 toolsets, a pluggable memory provider, a skills engine that authors its own skills, a full MCP client (OAuth 2.1 PKCE, OSV malware scanning, ACP via the Zed Agent Client Protocol Registry installable via uvx), subagent delegation, durable multi-agent orchestration via Kanban, a cron scheduler, 22 first-class messaging gateways and a React/Ink TUI plus web dashboard. Installs to ~/.local/bin; all state in ~/.hermes/; no telemetry.
The anatomy
| Part | Plain | Technical |
|---|---|---|
| Tools | Its hands: everything it can physically do. | More than 70 built-in tools across more than 30 toolsets, plus MCP tools; switchable per session with -t. |
| Skills | Step-by-step methods; it writes new ones from what worked. | 87 built-in skills + 79 optional in-repo + 672+ across the agentskills.io / HermesHub / LobeHub / Anthropic registries; skill_manage; your own under ~/.hermes/skills/. |
| Memory | Facts about you and your projects. | Frozen-snapshot system-prompt prefix-cache injection at session start (MEMORY.md ~2,200 chars + USER.md ~1,375 chars), plus on-demand FTS5 query via session_search. Relevance ranking lives in session_search and in pluggable providers (Honcho, Mem0, Hindsight). |
| Session search | Total recall of past conversations. | Every session in SQLite with full-text search; session_search retrieves and summarizes. |
| Subagents | Clones helpers that work in parallel and report back. | delegate_task: isolated context + terminal + toolset; orchestrator role; max_spawn_depth; file-coordination layer. |
| Cron | Runs on a schedule. "Every morning, do X." | Skill-backed jobs run in fresh sessions; notify_on_complete on background processes. |
| Gateways | Reach it from Telegram, Discord, Slack, WhatsApp, email. | 17 platforms; allowlist / DM-pairing / open auth; per-platform skill gating. |
| Profiles | Separate personas with their own memory (work vs creative). | Isolated HERMES_HOME dirs; per-profile config, keys, memory, sessions, skills. |
| Plugins + MCP | Bolt-on powers and connections to other tools. | Python/shell-hook plugins (can veto tool calls, ship image-gen backends); full MCP client. |
The 70+ built-in tools
Tools tagged cloud need a key by default and get swapped for local equivalents in section 07; everything else is local.
terminal · process
Run shell commands, background servers, monitor and notify on completion.
file ×4
read_file, write_file, patch (fuzzy find-replace, 9 strategies, auto syntax check), search_files (ripgrep).
code_execution
execute_code: Python that calls other Hermes tools, with branching and output filtering.
delegation
delegate_task: spawn subagents in isolated contexts; only their summary returns.
browser ×12
navigate, click, type, snapshot, vision, console, scroll, CDP and more, full headless control.
vision · memory
vision_analyze (describe/answer about images); memory (save durable facts).
skills ×3 · session_search
skill_manage / skill_view / skills_list; search and summarize all past sessions.
cronjob · todo · clarify
Schedule skill-backed tasks; plan multi-step work; ask you a multiple-choice question.
moa
mixture_of_agents: route a hard problem through several models (4 + aggregator) and merge.
rl ×10
Drive reinforcement-learning fine-tuning runs and read live training metrics.
image_gen · tts · web ×2
image_generate, text_to_speech, web_search/web_extract.
messaging · homeassistant ×4 · feishu ×5
Send to chat platforms; control smart-home devices; enterprise Lark/Feishu doc ops.
How it thinks
You give it a goal. It makes a short plan, picks a tool, uses it, looks at the result, and decides the next step, repeating until done. It stops to ask only when it needs a real decision or permission for something risky.
A tool-calling loop: system prompt (memory + skills + tool schemas) → model emits a tool call → runtime executes → result re-enters context → repeat. Context compression (/compress) keeps long runs under the window; activity-based timeouts (read timeout relaxed to 1800s for local endpoints) prevent premature kills; subagent results return as summaries so intermediate noise never bloats the main thread.
Three mechanisms keep long jobs alive: todo (it tracks its own steps), compression (summarizes history to stay under the context limit), and delegation (offloads sub-tasks to subagents so the main thread stays clean). They are why a local agent can run a multi-stage media pipeline end to end without losing the plot.
What the community actually reports
Independent of any vendor claim, community discussion across r/LocalLLaMA and Hacker News has been active and growing, praising the smoother setup and the built-in learning loop. A representative comment: the agent "actually remembers" a failure and "creates a skill for troubleshooting it." The honest consensus is not that Hermes replaces dedicated coding agents; for pure software engineering, the leaders are clear: Claude Code with Opus 4.6 holds 80.8% on SWE-bench Verified (highest reported single-agent), Aider's Polyglot leaderboard places Opus 4.6 around 85% edit-format, and Cursor Composer with Sonnet 4.6 sits in the mid-70s. Hermes does not publish a SWE-bench Verified number and is built for a wider job: orchestration, memory, scheduling and gateways. Treat marketing comparisons sceptically; benchmark for your actual workload. Nous shipped hundreds of security-tagged commits across the 0.12-0.14 release cycles (588 / 633 / 550 total merged PRs respectively) and there are no widely reported breaches; the lone disclosed CVE-2026-7396 (WeChat path traversal) was patched. Hermes positions itself as complementary to the older OpenClaw framework, with a built-in hermes claw migrate path that imports persona, skills, memory, channels and API keys.
Sleeper features in 0.13 and 0.14
Six things from the last two release cycles that materially change what the studio can do:
hermes proxy
Turns Claude Pro / ChatGPT Pro / SuperGrok into a localhost OpenAI-compatible endpoint. Codex CLI, Aider, Cline and Continue all become free to drive off your Hermes-managed subscription.
/goal · /subgoal
Ralph-loop persistent goal contracts: the agent keeps going until a judge decides the criteria are met. Layered subgoals can be added mid-run.
Kanban durable
Multi-agent orchestration with heartbeat detection, auto-block on incomplete exit, per-task retries, hallucination recovery. Closes the gap delegate_task leaves open.
Curator agent
Background process that deduplicates, deprecates and consolidates skills. No equivalent in any competitor.
LSP semantic diagnostics
Beyond syntax linting: type errors, undefined symbols, missing imports surfaced back to the agent before the next turn.
Cross-session prompt cache
1-hour Claude prompt cache survives /new. Cuts cost materially on long workflows.
OpenClaw, Claw Code, ClaudeClaw: three different projects
Names collide. The migration path Hermes ships is for one of them only.
| Project | Repo | What it is |
|---|---|---|
| OpenClaw Peter Steinberger | openclaw/openclaw374K stars · MIT · TypeScript | General-purpose messaging-first AI assistant. Originally Clawdbot (2025-11-24) → Moltbot (2026-01-27) → OpenClaw. This is what hermes claw migrate reads. |
| Claw Code Sigrid Jin | instructkr/claw-code48K stars · Python + Rust | Clean-room rewrite of Claude Code's leaked source map. Coding-focused CLI. Unrelated to OpenClaw. |
| ClaudeClaw moazbuilds | moazbuilds/claudeclaw | Lightweight OpenClaw-equivalent that runs as a Claude Code plugin: daemon + Telegram/Discord/Slack/cron/web dashboard. |
hermes claw migrate reads ~/.openclaw/, and auto-detects legacy ~/.clawdbot/ and ~/.moltbot/ paths. Non-destructive by default: skips SOUL.md if Hermes already has one, skips duplicate memory entries, skips same-named skills. Imports persona, skills, memory, channels and API keys. Imported skills land at ~/.hermes/skills/openclaw-imports/.The model that thinks, and the whole field of them
The brain is the model the agent reasons with. For an agent, the trait that matters most is not raw genius: it is reliably choosing the right tool and emitting a clean call, hundreds of times in a row. A model that is 2% smarter but fumbles one call ends the run. So this section covers how these models work, what makes each one different, how they are trained, and exactly how the 2026 field scores, before picking the ones that fit your Mac.
How a modern model works
Picture a company with 128 specialists. For each word, a "router" wakes only the 8 most relevant ones. The model is enormous in total knowledge, but only a thin slice works at a time, so it stays fast and fits in memory. That "mixture of experts" trick is why a 120-billion model runs on a laptop. Some models also have a "think first" switch that lets them reason step by step on hard problems and answer instantly on easy ones.
A Mixture-of-Experts transformer routes each token through a small subset of expert FFNs. Active-parameter count, not total, governs per-token compute and bandwidth, which is why MoE suits a 614 GB/s Mac. The differentiators between 2026 models are mostly in attention and routing: MLA (DeepSeek's Multi-head Latent Attention compresses the KV cache), attention sinks (gpt-oss lets heads "pay zero attention"), linear/lightning attention (MiniMax for long-context efficiency), auxiliary-loss-free routing (DeepSeek's load balancing), and hybrid thinking (Qwen's switchable reasoning with a thinking budget). Training increasingly leans on RL: DeepSeek-R1 showed pure RL via GRPO can teach reasoning with no supervised chain-of-thought; gpt-oss post-trains with CoT+RL like o3.
What makes each architecture different
| Innovation | Who | What it does |
|---|---|---|
| Mixture-of-Experts | nearly all | Only a few experts fire per token; huge total capacity, small active cost. |
| Multi-head Latent Attention (MLA) | DeepSeek V3/V4 | Compresses the KV cache into a latent space, slashing memory for long context. |
| Attention sinks | gpt-oss | A learned per-head bias lets the model ignore tokens cleanly; stabilizes long context. |
| Hybrid thinking + budget | Qwen3.x, DeepSeek V4 | One model switches between visible chain-of-thought and instant answers; you cap the reasoning spend. |
| Auxiliary-loss-free routing | DeepSeek, Qwen3.x MoE | Balances expert load without a loss term that hurts quality; encourages specialization. |
| Hybrid Mamba + Transformer + MoE | NVIDIA Nemotron 3 Super | State-space + attention + MoE in one model. Only open frontier-class system shipping all three; holds 91.75% RULER at 1M tokens. |
| Linear / lightning attention | MiniMax M2.x | Sub-quadratic attention for cheap very-long-context inference. |
| RL-first post-training (GRPO) | DeepSeek-R1 lineage | Pure reinforcement learning induces reasoning without supervised CoT data. |
| Native MXFP4 | gpt-oss | 4.25-bit microscaling quantization is the intended format, not a lossy afterthought. |
The whole field, scored and sized
Intelligence Index = Artificial Analysis v4.0 composite. SWE = SWE-bench Verified. "128GB" = fits on this Mac at a usable 4-bit quant. Top models are listed precisely because most of them do not fit, which is the honest part.
| Model | Lab | Params (total / active) | Intel. | SWE | License | 128GB? |
|---|---|---|---|---|---|---|
| gpt-oss-120B | OpenAI | 117B / 5.1B MoE | n/a | o4-mini class | Apache-2.0 | yes · 63GB |
| Qwen3.6-27B | Alibaba | 27B dense | n/a | 77.2 | Apache-2.0 | yes · 17-33GB |
| Qwen3.6 35B-A3B | Alibaba | 35B / 3.5B MoE | n/a | n/a | Apache-2.0 | yes · 26GB |
| Gemma 4 31B | 31B dense | top non-China open | n/a | Gemma | yes · 17GB | |
| Mistral Medium 3.5 | Mistral | 128B | n/a | 77.6 | open | tight · ~70GB |
| Llama 4 Scout | Meta | 109B / 17B MoE | n/a | n/a | Llama | yes · ~60GB |
| Nemotron 3 Super | NVIDIA | ~70-100B | top non-China open | n/a | open | likely · 4-bit |
| DeepSeek V4 Flash | DeepSeek | 284B / 13B MoE | ~49 | 79.0 | open | no · ~120GB Q4 |
| DeepSeek V4 Pro | DeepSeek | 1.6T / 49B MoE | 52 | 80.6 | open | cloud only |
| Kimi K2.6 | Moonshot | ~1T MoE | 54 | 80.2 | open | cloud only |
| GLM-5.1 | Z.AI | 744B MoE | ~51 | 77.8 (GLM-5) | open | cloud only |
| MiniMax M2.7 | MiniMax | MoE · linear attn | 49.6 | n/a | open | cloud only |
| MiMo-V2.5-Pro | Xiaomi | ~1T MoE | 54 | 78.0 | open | cloud only |
| Qwen3.5 397B-A17B | Alibaba | 397B / 17B MoE | n/a | 76.2 | Apache-2.0 | no |
Intelligence Index · top open weights (higher is better)
SWE-bench Verified · coding (open weights)
Tool calling · BFCL v3 (the agent metric)
The models that fit your Mac, in detail
gpt-oss-120B · the reliability brain
OpenAI's open model. It is the most dependable at "using tools" of anything you can download, which is the single trait that decides whether long agent jobs finish. It fits comfortably and lets you dial how hard it thinks.
- Architecture: token-choice MoE (117B total, 5.1B active, 4 experts), gated SwiGLU, GQA, alternating full + 128-token sliding-window attention, RoPE 131K via YaRN, learned per-head attention sinks.
- Quantization: native MXFP4 on MoE weights → ~63 GB; Q6 ~90 GB. o200k_harmony tokenizer; Harmony chat format.
- Training: CoT + RL post-training using o3-family techniques, specifically for reasoning and tool use; adjustable reasoning effort (low/medium/high).
- Scores: MMLU-Pro 90.0, AIME 2025 97.9 (tools), GPQA 80.9 (tools), τ-bench matches/exceeds o4-mini. The cleanest tool-call discipline of any open weight.
Qwen3.6-27B · the all-rounder
Alibaba's open model and the best balance for this build: fast, small enough to leave room for media, and genuinely excellent at coding and agentic work. If you run one model, this is the safe pick.
- Architecture: 27B dense (the family also ships 128-expert/8-active MoE variants with no shared experts and global-batch load-balancing). Hybrid thinking with a thinking budget. 256K context.
- Training: ~36T pre-training tokens (double Qwen2.5), three-stage pretrain → four-stage post-train, synthetic math/code from Qwen2.5-Math/Coder, 119 languages. Apache-2.0.
- Scores: SWE-bench Verified 77.2, beating models 50× its size; BFCL-class tool calling is strong. ~17 GB at Q4, ~33 GB at Q8.
Qwen3.6 35B-A3B
Tiny active footprint = very fast on Mac. Perfect fast/auxiliary lane beside a heavy main brain.
Gemma 4 31B
Top non-China open model on the Intelligence Index; efficient, multimodal, strong general writing. Watch for repetition loops past ~11 tool calls in some harnesses.
Mistral Medium 3.5
SWE-bench 77.6, EU-built for data-residency needs. Fits tight at 4-bit, leaves little for media.
Llama 4 Scout
Fits at ~60 GB 4-bit; long context. Caveat: its pythonic tool-call format needs a compatible parser.
Nemotron 3 Super
The other top non-China open reasoning model; a strong Western-licensed alternative.
DeepSeek V4 Flash
SWE-bench 79.0, MLA + auxiliary-loss-free routing. About 120 GB at Q4 (160 GB at FP16), just over the line for 128 GB once KV cache is added.
Nemotron 3 Super: the open hybrid
NVIDIA's contender. Different inside: it mixes three kinds of layers (long-context Mamba, attention, and the MoE experts you already know). Holds 91.75% on a million-token retrieval test, which no other open model touches.
- Architecture: hybrid Mamba state-space layers (cheap long context) + transformer attention (short range) + MoE experts (capacity). The only open frontier-class system shipping all three combined.
- Scores: 91.75% RULER at 1M tokens (unmatched among open models). Strong on Intelligence Index alongside Gemma 4 31B as the two non-China open entries.
- Why it matters: the natural pick for teams with data-residency constraints that bar weights from Chinese labs. Western-licensed alternative to DeepSeek / Qwen / GLM.
Qwen3-Coder-Next: the small fast coder
Alibaba's specialised coding model. Three billion active parameters, scores 70.6% on the coding test. The best self-hostable coder under 100 GB.
- Architecture: 3B active parameters in a coding-tuned MoE. About 40 GB at Q4.
- Training: 800K agentic coding RL tasks (multi-turn, tool-using, test-validated trajectories).
- Scores: SWE-bench Verified 70.6 with only 3B active params. Inside the 100 GB ceiling so you can keep it loaded alongside the orchestrator brain on the M5 Max.
The cloud-only leaders, briefly
These do not fit your Mac, but you should know what sits at the top of the open leaderboard:
DeepSeek V4 Pro
Current open-weights coding leader at 80.6 SWE-Verified. Multi-head Latent Attention compresses the KV cache; auxiliary-loss-free routing balances experts. Hybrid thinking.
DeepSeek V4 Flash
First new DeepSeek architecture since V3. Same innovations as Pro at a fifth of the size. Still does not fit a 128 GB Mac at Q4 (~120 GB).
Kimi K2.6
Co-leads Intelligence Index v4.0 at 54. GPQA Diamond 90.5 (highest of any open model). General-purpose flagship.
GLM-5.1
Z.AI's flagship. Intelligence Index around 51, GDPval-AA leader at 1535. Strong real-world agentic benchmark performance.
MiniMax M2.7
Pioneer of lightning attention: sub-quadratic for cheap very-long-context inference. Intelligence Index around 50.
MiMo-V2.5-Pro
Xiaomi's entry. Intelligence Index 54, 1M-token context, 78.0 SWE-Verified. China's third major model lab behind DeepSeek and Alibaba.
| Term | Used by | What it does in one line |
|---|---|---|
| MoE (Mixture of Experts) | nearly all | Only a few experts fire per token; huge total, small active cost. |
| MLA (Multi-head Latent Attention) | DeepSeek V3/V4 | Compresses the KV cache into a latent space. |
| Attention sinks | gpt-oss | Learned per-head bias lets the model ignore tokens cleanly. |
| Auxiliary-loss-free routing | DeepSeek, Qwen3.x MoE | Balances expert load without a loss-term penalty. |
| Hybrid thinking + budget | Qwen3.x, DeepSeek V4 | One model switches between visible CoT and instant answers; you cap reasoning spend. |
| Lightning attention | MiniMax M2.x | Sub-quadratic attention for cheap very-long-context inference. |
| GRPO | DeepSeek-R1 lineage | Pure RL induces reasoning without supervised CoT data. Cut post-training cost ~10×. |
| MXFP4 | gpt-oss | 4.25-bit microscaling: the intended format, not a lossy afterthought. |
| Mamba + Transformer + MoE | Nemotron 3 Super | State-space + attention + MoE in one model. Holds 91.75% RULER at 1M. |
| License | Models | Commercial use | Catch |
|---|---|---|---|
| Apache-2.0 | gpt-oss-120B, Qwen3.6-27B, Qwen3.6 35B-A3B, Qwen-Image-2512, Z-Image-Turbo, FLUX.2 [klein] 4B | ✓ unrestricted | None. Default-yes. |
| MIT | Chatterbox, HiDream-O1, ACE-Step 1.5 XL | ✓ unrestricted | None. |
| Llama 4 Community | Llama 4 Scout, Maverick | ✓ conditional | EU multimodal blocked. 700M MAU threshold. |
| Gemma | Gemma 4 31B | ✓ conditional | Google's terms; review product attribution. |
| Modified MIT | Mistral Medium 3.5 | ✓ conditional | Not pure MIT; check redistribution clauses. |
| FLUX.2 Non-Commercial | FLUX.2 [dev], FLUX.2 [klein] 9B | ✗ research only | Pay BFL for commercial, or use [klein] 4B. |
| Custom open weights | DeepSeek V3/V4, GLM-5/5.1, Kimi K2.6, MiniMax M2.7, MiMo-V2.5-Pro | per-model | Generally permissive but read each. |
OLLAMA_CONTEXT_LENGTH=65536) or the 70+-tool system prompt silently overflows. And for hybrid-thinking models (Qwen3.x, GLM-4.7, DeepSeek V4), disabling reasoning mode requires all three of --reasoning off, --reasoning-budget 0, and chat_template_kwargs.enable_thinking: false. Setting only one is the single biggest source of production failures in May 2026 (see llama.cpp issue #13189).Abliteration: how it works, and the catch
"Uncensored" models do not refuse you. Researchers discovered there is essentially a single internal "no" signal inside a model; abliteration surgically cancels it. The catch is that doing this carelessly also dents the model's general skill, which an agent relies on. So the smart pattern is one disciplined model for the work, and a separate uncensored one only for the writing.
Refusal is mediated by roughly a single direction in the residual stream (Arditi et al., NeurIPS 2024). You estimate that refusal direction from contrastive harmful vs harmless prompts, then either steer activations away from it at inference, or permanently orthogonalize the weights against it. The term (ablate + obliterate) was coined by FailSpy. The problem: the refusal vector is polysemantic, entangling refusal with syntax, formatting and capability circuits, so naive ablation causes collateral damage, partially recoverable with light "healing" fine-tuning (SFT/DPO).
The 2026 tooling
| Tool | What it does | Why it matters |
|---|---|---|
| Heretic | One-command automated abliteration; separates attention vs MLP interventions (MLP causes more damage); v1.2 adds a LoRA-based engine producing a toggleable adapter plus 4-bit support. | ~6.5× less capability damage than hand-tuned efforts. pip install heretic-llm → heretic <model>. |
| OBLITERATUS | 116-model toolkit; adds Expert-Granular Abliteration (per-expert directions for MoE) and CoT-aware ablation. | Deeper and broader, but heavier. For when a single direction is not enough. |
| UGI Leaderboard | Community ranking of Uncensored General Intelligence plus a natural-intelligence (NatInt) score. | The place to confirm an "uncensored" model is still actually smart after the surgery. |
Qwen3.6 abliterated :agent
The compromise model: uncensored and keeps tool calling. Best single "both" pick.
Gemma 4 31B Heretic
Uncensored general-purpose with native vision and tool calling.
Hermes 4.3 / 70B
Low-refusal by design (not abliterated). Excellent writing, lyrics, roleplay; run as the content subagent.
DIY with Heretic
Abliterate Qwen3.6-27B yourself for an uncensored brain tuned to taste, output as a toggleable LoRA.
delegation model so the agent auto-routes writing to it. Reliable agency, zero refusals where you want them, no compromise on either side.How models run, and what "4-bit" means
A model's "weights" are billions of numbers. Quantization shrinks them by storing each with fewer digits, like rounding prices to the nearest dollar: 4-bit is small and fast with a tiny quality cost; 8-bit is bigger and basically perfect. You also need a small program to run the model; on Mac the easy one is Ollama, and the fastest is Apple's own MLX.
Quantization reduces weight precision (16-bit → 8/6/5/4-bit). GGUF is the cross-platform de-facto format (Q4_K_M standard, Q5_K_M sweet spot, Q6/Q8 near-lossless), run by llama.cpp/Ollama; it has the broadest model coverage, often within hours of a release. MLX is Apple's native format, built for unified memory with zero CPU↔GPU copies: ~10% less memory and 15–30% faster than GGUF at the same quant. MXFP4 is gpt-oss's native 4.25-bit microscaling. AWQ/GPTQ are activation- and gradient-aware schemes common on NVIDIA. Below ~Q3, tool-calling reliability collapses.
| Server | Best for | Notes |
|---|---|---|
| Ollama (GGUF) | The agent brain | Simplest; Hermes auto-detects at :11434/v1; most stable for long agentic tool use. |
| MLX (mlx_lm) | Media, max speed, fine-tuning | Apple-native; fastest single-user generation; the only local LoRA-training path on Mac. |
| LM Studio | GUI management | Bundles the MLX engine; one-click OpenAI server; runs both MLX and GGUF. |
| llama.cpp / vLLM / SGLang | Power-user serving | Fine control of quant, context, KV-cache; -ngl 99 offloads all layers to the Metal GPU. |
| Quant level | Quality | Use when |
|---|---|---|
| 4-bit (Q4_K_M / MXFP4 / MLX-4) | Minor loss, max speed | The practical default for big models. |
| 5–6-bit (Q5_K_M / MLX-6) | Near-lossless | 24 GB+; the quality sweet spot. |
| 8-bit (Q8 / MLX-8) | Effectively full precision | 48 GB+ (you have it); best for a small premium brain. |
Lane × server × quant: picking quickly
Each lane in the studio wants a different pairing. This collapses the decisions into one table.
| Lane | Server | Quant | Why |
|---|---|---|---|
| Orchestrator brain | Ollama (GGUF) | Q4 / Q8 | Tool-call stability past 5+ rounds; MLX drifts. Use Q8 if you have the room. |
| Auxiliary fast lane | MLX (mlx_lm) | Q5 | Memory edge + native Metal kernels; short sessions do not trigger MLX tool-call drift. |
| Media generation | MLX / Draw Things | MXFP4 / 4-bit | Native; Apple's Neural Accelerator path; Draw Things' Lightning Draft hits about 1 sec/image on M5 Max. |
| Fine-tuning (LoRA) | MLX (mlx_lm.lora) | Q4 base | The only local LoRA-training path on Mac. Unified memory beats a 24 GB RTX 3090 here. |
Reasoning-format settings, per model
Hybrid-thinking models need explicit configuration or they emit chain-of-thought that breaks downstream parsers. Copy-paste:
# gpt-oss-120B (Harmony format, native reasoning effort)
model: ollama/gpt-oss:120b
extra_body:
reasoning_effort: high # low | medium | high
# Qwen3.6 family (hybrid thinking; disable all three knobs)
model: ollama/qwen3.6:27b
extra_body:
reasoning: off
reasoning_budget: 0
chat_template_kwargs:
enable_thinking: false
# GLM-4.7 / 5.1 : same triple flag as Qwen3.6
# DeepSeek V4 : same triple flag
# Llama 4 Scout : pythonic tool-call parser required
model: ollama/llama4:scout
extra_body:
tool_call_format: pythonic
# Gemma 4 31B : watch for repetition loops past ~11 tool calls. Reset session on detection.
Citation: llama.cpp issue #13189 documents the full triple-flag fix.
Images, video, music, voice, transcription
This is where local AI stopped being a compromise. The brain writes and reasons; these models produce the media. Each is a different machine with its own architecture, its own training, and its own leaderboard. Below: how each kind works, the full field of options with arena scores, the apps that run them on Mac, and how to prompt them.
Images
You type a description; the model starts from pure static and "develops" it into a photo over a few seconds, like an instant Polaroid in reverse. Modern ones read your prompt so well you can ask for specific text on a sign, an exact pose, or "the same character, new scene."
2026 image models are rectified-flow transformers (MM-DiT): a diffusion-style model that learns a near-straight path from noise to image, so it needs far fewer sampling steps than old U-Net diffusion. Text and image tokens flow through coupled attention streams; a large text encoder (FLUX.2 uses Mistral Small 3.1 24B) gives strong prompt adherence. Images live in a compressed 16-channel latent decoded by an autoencoder. FLUX.2 [klein] was distilled-free, which makes it the best open base for LoRA training.
Text-to-image arena · open weights (Elo)
FLUX.2 [dev / klein 9B / klein 4B / pro]
Best-in-class quality and prompt control; klein 4B is the distillation-free LoRA base; up to 10 reference images, HEX color control. License is split: dev and klein 9B are non-commercial; klein 4B is Apache-2.0 (commercial OK).
Qwen-Image-2512
Top Apache-2.0 model: commercial-safe, excellent text rendering, strong editing. The pragmatic default.
HiDream-O1-Image-Dev
Highest-ranked open-weights model in the arena right now.
Z-Image-Turbo
Permissive and quick: few-step turbo sampling for near-instant previews.
Seedream 4.5 / Hunyuan 3.0
Seedream excels at East-Asian aesthetics (calligraphy, fabric, architecture); Hunyuan leads open image editing.
SD 3.5 / SDXL
The mature ecosystem: the deepest library of community LoRAs and ControlNets, even if raw quality now trails.
Video
Same idea as images, but the model also has to keep things consistent from frame to frame so motion looks real. This is the heaviest job in the studio: expect minutes per clip, not seconds, and it cannot run at the same time as a big brain.
Video models extend diffusion into a spatio-temporal latent: 3D attention over (frames × height × width) tokens with a causal video VAE, so the model denoises a whole clip while enforcing temporal coherence. This is why VRAM and time costs explode versus stills.
Wan 2.7
The current Wan generation; reference-to-video with voice cloning and instruction-based video editing as new model classes since 2.2. Text- and image-to-video via ComfyUI. About 50 GB resident, minutes per clip; won't co-reside with a 120B brain. Wan 3.0 60B (Apache-2.0) is roadmapped for mid-2026.
LTX-Video
Built for speed: real-time-ish generation on capable hardware, lower fidelity than Wan.
Hunyuan Video / Mochi
High-motion open alternatives; heavier still, strong cinematic motion.
| Model | Lab | Resident | Max res / dur | M5 Max time per 5-sec clip | Notes |
|---|---|---|---|---|---|
| Wan 2.7 | Alibaba · Apr 2026 | ~50 GB | 1080p · 10 sec | ~4–8 min | Current Wan generation. Reference-to-video + voice cloning + instruction editing as new model classes. |
| LTX-Video v0.9 | Lightricks | ~20 GB | 720p · 6 sec | ~30–90 sec | Speed-first; lower fidelity than Wan but real-time-ish on capable hardware. |
| HunyuanVideo | Tencent | ~60 GB | 1280×720 · 5 sec | ~5–10 min | High-motion open alternative; cinematic motion. |
| Mochi 1 | Genmo | ~40 GB | 848×480 · 5 sec | ~3–6 min | Apache-2.0; strong open motion baseline. |
| Step-Video / CogVideoX | StepFun / Tsinghua | ~30–50 GB | 720p · 6–10 sec | ~3–8 min | Newer contenders; CogVideoX-Vid evolution lineage stable. |
Music
Give it a style and some lyrics and it writes and performs a full song, vocals and instruments, in under a minute. The local model is genuinely close to the big paid services, runs offline, and you can teach it a voice or style from a handful of examples.
ACE-Step 1.5 XL is a 4B hybrid: a language-model "composer" reasons in chain-of-thought to plan a structured blueprint (lyrics, sections, duration, metadata), then a diffusion transformer renders 48 kHz stereo audio. It is built on a Sana-style deep-compression autoencoder (DCAE) + linear transformer, with MERT and m-hubert features aligned via REPA; v1.5 adds intrinsic RL. Under 4 GB VRAM, 50+ languages, quality in the Suno v5 range (the closed leader Suno v5.5 shipped March 2026 and has since widened the gap somewhat), and it supports cover, repaint, vocal-to-BGM and LoRA from a few songs.
ACE-Step 1.5 XL
The local Suno. Full songs with vocals in seconds; tiny footprint; trainable on your own style.
YuE
Long-form vocal music generation; strong full-song structure, heavier than ACE-Step.
DiffRhythm
Full songs in ~10s via latent diffusion; very fast, fewer controls.
MusicGen / Stable Audio
Instrumental and sound-design workhorses; no vocals, but reliable for beds and loops.
[verse], [chorus], [instrumental]. Budget ~2–3 words of lyric per second (under ~140 words for a 47-second clip). Specific tags ("balkan brass, minor key, 90 bpm, male vocal") beat vague ones every time.Voice & dubbing
From 5–10 seconds of someone speaking, these models clone the voice and then read any text in it, with emotion. Chain a few together and you can take a video in one language and output it dubbed in another, keeping the original speaker's voice.
Modern open TTS is zero-shot non-autoregressive flow-matching: a DiT generates a mel-spectrogram conditioned on text plus a speaker embedding extracted from a short reference clip, fused without a separate duration model. F5-TTS pairs a DiT with ConvNeXt, trained on ~100k hours, hitting real-time factor ~0.15 (≈6× faster than playback). Chatterbox adds emotion control and a PerTh watermark.
Chatterbox
23-language cloning with emotion control; competitive with ElevenLabs in independent blind tests. The default voice engine.
F5-TTS
RTF ~0.15, clones from a 1–10s reference; superb speed/quality balance.
Kokoro
Featherweight, extremely fast TTS for narration where cloning is not needed.
Qwen3-TTS
Clones from just 3 seconds of reference audio (vs 5–15s for F5, 5s for Chatterbox). Strong multilingual range. Pairs naturally with the Qwen brain.
CosyVoice 3
About 150 ms streaming latency, the lowest of any open TTS as of May 2026. The pick for real-time voice agents.
Sesame CSM-1B
Conversational speech model: tiny, fast, fluent in the conversational register where most TTS still sounds read-aloud.
Dia
Multi-speaker dialogue model; handles back-and-forth and overlap better than single-speaker TTS systems.
TTS quality: open voice models (community Elo, May 2026)
Transcription
Turns speech into accurate text in dozens of languages, fast, fully offline. The backbone of meeting notes, subtitles and the dubbing pipeline above.
Encoder-decoder transformers trained on huge weakly-labelled audio. Whisper large-v3 runs at ~3 GB in MLX with strong multilingual word-error rates; Parakeet and Qwen3-ASR push speed and accuracy further on supported languages.
Whisper large-v3
The reliable multilingual standard; timestamps, translation, robust to noise.
Parakeet v3
Very fast, very accurate on supported languages; great for long recordings.
Qwen3-ASR
Newer multilingual ASR with strong accuracy; pairs naturally with the Qwen brain.
Canary-Qwen-2.5B
5.63% WER on English: beats Whisper large-v3 on the HuggingFace Open ASR Leaderboard for English transcription. Use when English-only and accuracy matters more than language coverage.
ASR word-error rate: open models on English (HF Open ASR Leaderboard, May 2026)
mlx_audio.server) exposes TTS, STT and speech-to-speech behind an OpenAI-compatible REST endpoint, so Hermes drives every voice model through one local URL, the same way it talks to the brain.Making the last cloud tools local
Hermes ships four tools that phone the cloud by default. To make the studio truly air-gappable, each one is repointed at a local engine. After this, the machine can run with Wi-Fi off.
| Default (cloud) | Local replacement | How |
|---|---|---|
image_generate → FAL | Local FLUX / Qwen-Image | Plugin or MCP wrapper around Draw Things / ComfyUI / MLX; the plugin system supports image-gen backends. |
text_to_speech → cloud | MLX-Audio server | Point the tool at the local OpenAI-compatible voice endpoint. |
web_search → Exa | Off, or local SearXNG | Disable for a pure air-gap, or wrap a self-hosted SearXNG via MCP for offline-ish search. |
mixture_of_agents | Local council | Repoint the 4 references + aggregator at your local models (e.g. gpt-oss + Qwen + Gemma). |
localhost. Pull the network cable and the studio keeps working. That is the difference between "private-ish" and genuinely yours.Teaching a model your style, on the Mac
A LoRA is a tiny add-on file that teaches a big model one new thing: your face, your art style, a brand voice, a singer's tone, without retraining the whole model. You make one from a handful of examples in an hour or two, and snap it on or off like a lens.
Low-Rank Adaptation freezes the base weights and injects small trainable rank-decomposition matrices (A·B) into attention/FFN layers; you train ~0.1–1% of parameters. QLoRA trains those adapters on top of a 4-bit-quantized base, collapsing memory further. Lineage: DreamBooth and Textual Inversion for image models, now standard across text, image, music and voice. Unified memory is the quiet superpower here: there is no separate VRAM wall, so a Mac fine-tunes models a 24 GB RTX 3090 cannot even load (rule of thumb: 16 GB → 8B, 32 GB → 14B, 64 GB → 32B; Llama-7B needs ~28 GB full, ~14 GB LoRA, ~7 GB QLoRA).
Training a text LoRA with MLX
# 1. quantize the base to 4-bit (QLoRA kicks in automatically)
mlx_lm.convert --hf-path Qwen/Qwen3.6-27B -q --q-bits 4
# 2. train the adapter on your JSONL data
mlx_lm.lora --model ./mlx_model --train --data ./data \
--lora-layers 16 --batch-size 2 --iters 600
# 3. fuse the adapter back into a standalone model (optional)
mlx_lm.fuse --model ./mlx_model --adapter-path ./adapters
Image LoRA
FLUX/SDXL LoRA in 2,000–4,000 steps via ai-toolkit, SimpleTuner or ComfyUI. Stack 2–3 at 0.5–0.7 weight. Thousands ready on Civitai.
Music LoRA
ACE-Step learns a genre or a singer's tone from a handful of songs; snap it on for on-brand tracks.
Abliteration LoRA
Heretic v1.2 outputs the uncensoring itself as a toggleable LoRA adapter, no full re-download.
Voice "LoRA"
Zero-shot cloning is effectively instant adaptation; fine-tune only for a recurring signature voice.
How the agent runs the whole studio
The pieces only become a studio when one mind coordinates them. Hermes does this with five mechanisms, all configurable.
| Mechanism | What it enables |
|---|---|
| Model routing | delegation.model sends subagent or content work to a different model (e.g. uncensored writer) while the orchestrator stays on the reliable brain. Switch live with /model. |
| Subagents | delegate_task spawns isolated workers (own context, terminal, tools) that run in parallel and return only a summary, keeping the main thread clean. |
| Mixture of agents | mixture_of_agents sends one hard problem to several models and an aggregator merges the best answer, all local. |
| Code execution | execute_code runs Python that itself calls Hermes tools, for branching logic the model would otherwise narrate step by step. |
| Cron + memory + skills | Scheduled jobs run in fresh sessions; memory carries durable facts; skills carry repeatable procedures the agent wrote itself. |
# route the heavy thinking and the uncensored writing separately
model: ollama/gpt-oss:120b # orchestrator brain
delegation:
model: ollama/qwen3.6-abliterated:agent # subagent / content writer
custom_providers:
- name: local-media
base_url: http://localhost:8080/v1 # MLX-Audio / image gateway
How to actually talk to each model
Every model class wants a different prompt style. Old "masterpiece, trending on artstation" spam hurts modern models. Here are the patterns that work.
FLUX / Qwen-Image
Plain descriptive sentences + camera, lens and lighting. "A weathered fisherman at dawn, 35mm, soft side light, shallow depth of field, muted teal palette." Put exact text in quotes for signage.
ACE-Step
Tags: balkan brass, minor key, 90bpm, male vocal, live. Lyrics with [verse] / [chorus] / [instrumental]. ~2–3 words per second.
Voice cloning
5–10s of clean reference audio (no music). Punctuation drives prosody; for emotion use Chatterbox's exaggeration control rather than ALL-CAPS.
Agent system prompt
State the goal, the allowed tools, and a stop condition. Let memory and skills carry standing context instead of repeating it every session.
Real workflows the studio runs end to end
These are not hypotheticals; each is a chain of the tools and models already covered, expressed as a Hermes skill or cron job.
Auto-dubbing
Drop in a video → Whisper transcribes → brain translates → Demucs/Pyannote separate and diarize → Chatterbox clones each speaker → ffmpeg muxes. One skill, one command.
Local songwriting
Brain (or uncensored writer) drafts lyrics in your style → ACE-Step composes and performs → you keep stems. The fully offline Suno.
Content factory (cron)
Every morning: brain drafts posts → FLUX renders matching images → files land in a folder, notify on complete. Runs while you sleep.
Autonomous coding
Brain plans → subagents implement modules in parallel → execute_code runs tests → it self-corrects until green.
Local model council
mixture_of_agents routes a hard decision through gpt-oss + Qwen + Gemma and merges, no API.
Self-improving loop
It hits a bug, solves it, and writes a skill so the fix is permanent. The studio gets better with use.
from run_agent import AIAgent drops the studio into your own Python scripts, so a pipeline can be triggered by a file drop, a webhook or a schedule.What actually fits in 128 GB at once
Roughly 120 GB is usable for models after raising the wired limit. These are real, simultaneous loadouts. The pattern is always: keep one brain resident, spin heavy media up on demand.
What fits my Mac? Interactive
Slide to your RAM budget. The model field table above and the budget bars above re-colour: green if it fits, amber if tight, red if you would need to evict the brain to run it.
From a fresh Mac to a running studio
- Free the memory. Raise the GPU wired limit so models can use about 120 GB.
sudo sysctl iogpu.wired_limit_mb=122880 - Install the serving layer. Ollama for the brain (GGUF), and MLX/LM Studio for media and fine-tuning.
- Pull the models.
ollama pull gpt-oss:120b ollama pull qwen3.6:27b # raise context so the 70+-tool prompt fits export OLLAMA_CONTEXT_LENGTH=65536 - Install Hermes. Pick one:
# Recommended on macOS brew install hermes-agent # Or via PyPI (clean Python environments) pip install hermes-agent # Or via the official install script curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash # Or on Windows iex (irm https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.ps1) - Point it at local. Edit
config.yaml(the orchestration block in section 10): brain, delegation model, custom providers for media. - Verify. Run
hermes doctor, then ask it to write a file, generate an image, and transcribe a clip. If all three land, the studio is live. - Configure reasoning-format per model. Hybrid-thinking models need three knobs disabled to keep tool-call output clean. The cheat sheet is in section 6. Skip this and watch agents emit raw chain-of-thought into your terminal output.
- Enable
hermes proxyif you have any of Claude Pro, ChatGPT Pro or SuperGrok. The subscription becomes a localhost OpenAI-compatible endpoint that Codex CLI, Aider, Cline and Continue can all drive. Material cost-saving for multi-tool users.
hermes backup snapshots memory and skills · profiles keep work and creative brains separate · /compress rescues a long session nearing its context limit.studio-up.sh, run with bash studio-up.sh. Idempotent; rerun anytime.
#!/usr/bin/env bash
set -euo pipefail
# 1. Free the memory
sudo sysctl iogpu.wired_limit_mb=122880
# 2. Install serving layer
brew install ollama
brew install --cask lm-studio
pip install mlx mlx-lm mlx-audio
# 3. Pull models
ollama pull gpt-oss:120b
ollama pull qwen3.6:27b
export OLLAMA_CONTEXT_LENGTH=65536
# 4. Install Hermes
brew install hermes-agent
# 5. Initial config
mkdir -p ~/.hermes
cat > ~/.hermes/config.yaml <<'YAML'
model: ollama/gpt-oss:120b
delegation:
model: ollama/qwen3.6:27b
custom_providers:
- name: local-media
base_url: http://localhost:8080/v1
YAML
# 6. Verify
hermes doctor
echo "Studio is live. Try: hermes 'write hello.txt then summarize it.'"
Staying private, staying in control
An agent that can run shell commands and an uncensored model that never refuses are powerful and need guardrails. Hermes ships them; keep them on.
- Dangerous-command blocking + approvals: destructive patterns (
rm -rf,DROP TABLE) are blocked or require explicit confirmation. - Secret-exfiltration scanning: the runtime flags attempts to leak keys or credentials, important when an abliterated model will follow any instruction it reads.
- MCP hardening: OAuth 2.1 PKCE for connectors plus OSV scanning of MCP servers for known-malicious packages.
- Air-gap checklist: local brain ✓, local media ✓,
web_searchoff or local SearXNG ✓, no telemetry ✓. Then the network cable is optional.
Seven sandbox backends
Hermes ships more sandbox options than any competitor. Pick by trust level + cost.
local
Runs in your shell with the agent's permissions. Fastest. Use for trusted skills.
Docker
Isolated filesystem, network, processes. Use for untrusted MCP servers or third-party skills.
SSH
Run the workload on another machine over SSH. Use for heavy compute on a Mac Studio without leaving the laptop.
Singularity
Container format favoured in HPC. Use on university or research clusters.
Modal
Spins up an ephemeral cloud sandbox per task. Use for elastic burst compute outside the studio.
Daytona
Pre-configured dev environments. Use for reproducible per-project workspaces.
Vercel Sandbox
Vercel's serverless sandbox primitive. Use when the workload should sit close to a web app.
The studio is a starting line, not a finish
Step back and look at the arc. A year ago, the best open model scored 22 and a private studio like this was science fiction. Today the gap to the frontier is single digits, a 27-billion-parameter model that fits a laptop codes like last year's best, and music, voice and images that rivalled paid services now run offline in seconds. The line on that chart is still climbing.
What you have built here is not a cheaper ChatGPT. It is a different relationship with the technology. The models are yours: they do not change under you overnight, they do not log your work, they do not disappear when a subscription lapses or a company pivots. The agent learns your patterns and keeps them. The voice clone, the LoRA of your style, the skills it wrote solving your bugs, none of it leaves the machine. In a market built on renting access to someone else's servers, owning the whole stack outright is the genuinely radical option.
It is not the strongest possible system. The trillion-parameter leaders still live in data centers, local video still trails, and the leaderboard you read today will be wrong next month. But the trajectory is unmistakable: every quarter, more of the frontier becomes something you can hold in 128 GB. The right move is not to wait for the perfect model. It is to build the studio now, learn how the pieces fit, and let it improve underneath you as the open field keeps closing the gap, which it will.
One agent. Every model. Zero cloud. Run it once, and the cloud starts to look like a choice rather than a requirement. That is the whole point, and it is already here.
Straight answers
Is it really free?
Does it genuinely work offline?
What is the catch with uncensored models?
Where is local AI still weak?
Is a Mac actually the right machine?
Will these recommendations age?
Do I need macOS Tahoe 26.4?
sw_vers to check.Does Apple Intelligence conflict with the local studio?
Can I scale beyond 128 GB?
Is FLUX.2 commercially usable?
How fast is the M5 Max actually?
What happens when the model I'm using gets superseded?
config.yaml; the rest of the studio (agent, tools, skills, memory, pipelines) is unchanged. That is the whole point of the architecture: the model is a swappable component, not the system.