Voice: Speak and Listen
Want the AI to read responses aloud? Or dictate messages instead of typing? Caiioo offers voice input and output—all configurable, some running locally on your device.

Voice Output (Text-to-Speech)
Have the AI read its responses aloud. Choose from:
| Option | Type | Quality | Setup |
|---|---|---|---|
| Browser Voices | Local | Basic | Free, no setup |
| Kokoro | Local | High | Free, runs on your device |
| Gemini 3.1 Flash TTS | Cloud | Natural | Add OpenRouter API key |
| ElevenLabs | Cloud | Premium | Add your API key |
| Cartesia (Sonic 3.5) | Cloud | Premium | Add your API key |
| Resemble.ai | Cloud | Excellent (voice cloning) | Add your API key |
Kokoro download size: The Kokoro model ships in two variants, and which one downloads depends on your platform. macOS and iOS load the smaller INT8-quantized model (~88 MB), while the extension/browser uses the larger full-precision WebGPU build (~330 MB). It's a one-time download.
Platform notes:
- iOS native Kokoro (v0.9.720+): Runs in the iOS host process via OnnxRuntime instead of WebView, fixing iPhone 13/14 crashes.
- macOS Kokoro: Streams sentence-by-sentence (within ~1s of pressing play) through the desktop helper process.
- Gemini TTS (v0.9.723+): Via OpenRouter — now plays sentence-by-sentence, so audio starts after the first sentence instead of waiting for the whole reply to synthesize.
- Cartesia (v0.9.723+): One API key powers both Sonic 3.5 (output) and Ink (input). There's no default voice—pick one in Settings > Voice before you enable it.
Playback speed: The speed slider (0.5×–2.0×) is applied by the provider for ElevenLabs (clamped to 0.7–1.2×) and Cartesia (clamped to 0.6–1.5×). Browser voices and Kokoro speed up locally; Resemble.ai and Gemini have no speed control and always play at normal rate.
To enable it:
- Go to Settings > Voice
- Pick a text-to-speech option
- Toggle "Auto-read responses" if you want the AI to read automatically
- Adjust playback speed if you like
If playback fails: Voice errors now surface as a toast instead of failing silently—so a missing or invalid API key, or a voice that isn't compatible with the selected model (common with Resemble.ai and Cartesia), tells you exactly what to fix.
Local vs Cloud: Browser voices and Kokoro never send anything off your device. Gemini, ElevenLabs, Cartesia, and Resemble.ai send text to their servers (using your API keys) to generate the audio. See Privacy & Data for details.
Voice costs (TTS + STT) roll up as voice_cost on the conversation, matching the one-shot path.
Voice Input (Speech-to-Text)
Dictate your messages instead of typing. Click the microphone icon in the composer to start recording. Caiioo transcribes what you say and drops it into the message field.
Choose how it transcribes:
| Option | Type | Privacy | Setup |
|---|---|---|---|
| Whisper (Browser) | Local | Fully private | Free, runs on your device |
| WhisperKit (iOS) | Local | Fully private | Free, on-device |
| whisper.cpp & Moonshine (Android) | Local | Fully private | Free, on-device |
| Browser Speech | Local | Private | Free, built-in |
| ElevenLabs Scribe | Cloud | Accurate (great for non-English) | Add your ElevenLabs API key |
| Cartesia Ink | Cloud | Accurate, low-latency | Add your Cartesia API key |
Local options (Whisper, WhisperKit, whisper.cpp, Moonshine, Browser Speech) keep your audio local—nothing is sent to any server. ElevenLabs and Cartesia send audio to their servers for transcription (using your API key) and offer higher accuracy, especially for non-English languages.
To use it:
- Click the microphone icon in the composer
- Speak your message
- Stop when you're done
- The transcript appears in the message field
- Edit if needed, then send
First-time setup: The first time you use an on-device speech model, it has to download and warm up. The composer shows the progress ("Downloading speech model… N%", then "Preparing"/"Loading"), so a brief pause on your first mic tap is expected, not a hang.
System-Wide Dictation (macOS)
Pro subscribers on macOS can also install PrivateVoice, a separate companion app that adds a global press-to-talk hotkey for dictating into any application—not just Caiioo. See the desktop download page for details.
See Also
- Privacy & Data — How voice data is handled
- Platform & Setup — Desktop app and PrivateVoice availability
- Settings > Voice — Configure voice options for your setup
This guide is maintained by the Caiioo team using Slate, our built-in editor.