Fish Audio
Fish Audio S2-Pro is Fish Audio's next-generation expressive text-to-speech model for natural voice generation, open-ended emotion tags, multi-speaker dialogue, voice cloning, and 80+ language workflows.
Fish Audio currently recommends s2-pro for new projects. S2-Pro adds natural-language bracket control such as [whispers sweetly] or [laughing nervously], supports multi-speaker dialogue, covers 80+ languages, targets 100ms time-to-first-audio, and ships with an open-source SGLang-based serving stack. The previous s1 model remains available for existing integrations and is still useful when a workflow depends on its parenthesis-based emotion syntax. On a Martini SEO page, Fish Audio is positioned as an expressive voice foundation model to compare against ElevenLabs, Minimax Speech, and other TTS systems; it is not wired into Martini's production generation menu unless a runtime integration is added separately.
| Variant | Description |
|---|---|
| Fish Audio S2-Pro | Recommended current model with bracket-style natural language control, multi-speaker dialogue, 80+ languages, and open-source serving. |
| Fish Audio S1 | Previous 4B-parameter model with parenthesis-based emotional control, kept for existing integrations. |
Connect Fish Audio S2 with video, image, script, and music nodes on Martini's infinite canvas. No GPU required — start free.
Get Started FreeFish Audio currently recommends s2-pro for new projects. It adds natural-language bracket control, multi-speaker dialogue, 80+ languages, low time-to-first-audio, and an open-source serving stack. S1 remains available for existing integrations.
This page is an SEO and comparison page. Fish Audio is not added to Martini's production audio generation menu by this change; that would require a separate runtime provider integration, billing, UI controls, and webhook handling.
Fish Audio S2 emphasizes open-source serving, flexible bracket-style control, and self-hosting options. ElevenLabs emphasizes a mature hosted voice ecosystem, Eleven v3 expressiveness, Multilingual v2 stability, and text-to-sound effects.