VoiceBox

A free, open-source text-to-speech application running entirely on local hardware. No cloud, no subscriptions, no email signup. Available on Mac, Windows, and Linux. Positioned as the open-source alternative to ElevenLabs.

Models

Qwen-3 (0.7B parameters): Higher audio quality, best overall model for voice generation
Chatterbox: Lower audio quality but supports slash-command effects: /laugh, /chuckle, /gasp, /cough, /sigh, /groan, /sniff, /shush, /clear_throat

Key Features

Voice cloning: Record ~30 seconds of audio with reference text → creates a voice profile. Sensitive to audio quality — needs good mic and gain staging.
Sound effects: Built-in presets (robotic, radio, echo chamber, deep voice). Effects are combinations of: low pass, high pass, chorus, reverb, delay, compressor, pitch shifter.
Stories: Folder system for organizing and generating longer-form audio content.
API: Local server with optional network access for automation integration.
Multi-language: Depends on the TTS model selected (Chatterbox Turbo is English-only).

Hardware

Runs on Apple Silicon (M4 Max demonstrated) and other capable machines. Speed depends on local hardware. Models in the 7B range run well.

Rough Edges (as of March 2026)

Custom effect presets can’t be saved (bug)
Voice cloning fails on clipped audio
No input device selector (uses system default)
Chatterbox audio quality noticeably lower than Qwen-3

Security Note

Voice cloning quality is not yet good enough to fool family members, but improving rapidly. Dave Swift recommends setting up verbal passwords with loved ones as a precaution against voice spoofing attacks.

AI For Dev

Explorer

voicebox

VoiceBox

Models

Key Features

Hardware

Rough Edges (as of March 2026)

Security Note

See Also

Graph View

Table of Contents

Backlinks