Fish Audio
Fish Audio S2-Pro is a text-to-speech model, not a dedicated sound-effects generator — its core job is expressive voice synthesis with bracket cues and multi-speaker dialogue. For pure foley (whoosh transitions, impact stingers, ambient room tone, UI feedback), ElevenLabs Sound Effects v2 is the right tool because it's built for that surface. Fish Audio S2-Pro plays a complementary role on the same canvas: it handles voice-driven sound design — character vocalizations like grunts, sighs, gasps, breathing, laughter, and cry effects via bracket cues like [exhausted sigh], [sharp gasp], [nervous chuckle], [exhausted breathing]. For a video that needs both real foley (door slams, ambient beds) and human-vocal SFX (a character's gasp, a runner's breathing), use ElevenLabs SFX v2 for the foley cues and Fish Audio S2-Pro for the vocal cues, both attached to the same canvas timeline.
Sort the SFX cues for the video into two buckets before picking the model. Bucket 1 (foley): door slams, glass breaks, footsteps, machinery, weather, ambient beds, UI clicks — these are non-vocal environmental sounds and route to ElevenLabs Sound Effects v2. Bucket 2 (vocal SFX): a character's gasp, a runner's heavy breathing, a frustrated sigh, a startled scream, a tired exhale — these are human-vocal cues and route to Fish Audio S2-Pro using bracket-only prompts inside an Audio node. The split matters because each model is built for its bucket; using SFX v2 for a "frustrated sigh" produces a generic sigh, while Fish Audio with [frustrated exhausted sigh] produces a sigh tied to a specific character voice.
Fish Audio S2-Pro vocal SFX inherit the character of the selected voice. A gasp from a deep male voice (cloned narrator) sounds different from a gasp from a young female voice (prebuilt expressive). Pick the voice first — typically you're reusing a character voice already cast for dialogue in the same scene. For diegetic vocal SFX (a character on screen reacting), use that character's established voice. For off-screen vocal SFX (a generic crowd reaction, an unseen scream), use a different voice or a cloned background voice so it doesn't pull focus from the on-screen character. Voice consent applies here too if you cloned voices.
Vocal SFX prompts are bracket-only, no spoken words. Examples: `[sharp gasp]`, `[exhausted breathing for 5 seconds]`, `[nervous chuckle]`, `[startled scream then silence]`, `[panting after running, slow recovery]`. The model interprets the bracket as the entire vocal performance, no surrounding sentence needed. This is different from Fish Audio's normal dialogue use where brackets direct delivery on a spoken line. For a panting-after-running cue, place the bracket-only prompt on its own Audio node, generate, listen, then attach to the chase scene's post-action recovery beat on the canvas timeline.
A scene with a chase moment typically needs three SFX layers: ambient bed (alleyway echo from ElevenLabs SFX v2), foley (running footsteps from SFX v2), vocal SFX (panting recovery from Fish Audio S2-Pro), and the spoken dialogue that follows (Fish Audio S2-Pro Dialogue mode or ElevenLabs Dialogue v3). Place each on its own Audio node and align to the timeline. The Martini canvas handles the layering; for final delivery you can export a single audio mix or separate stems for handoff to a mixer. Note: Fish Audio in Martini is currently SEO-positioned — production runtime depends on workspace configuration. If Fish Audio isn't wired up for vocal SFX, ElevenLabs Eleven v3 with bracket-only prompts (e.g., `[gasp]` as a standalone) is a fallback, though tag coverage is narrower.
Vocal SFX prompt for a chase-scene recovery beat — describes timing and intensity inside the bracket. Use the same character voice as that scene's dialogue for diegetic continuity.
[panting after running, heavy chest, slow recovery over 5 seconds]
Reaction SFX for a horror or thriller jump-cut — bracket-only prompt, no surrounding spoken words. Place on the frame the visual reveal lands.
[sharp startled gasp then sudden silence, female voice]
Fish Audio S2-Pro is a TTS model, not a foley generator. Use it only for vocal SFX (gasps, sighs, breathing, laughter, screams) and route door slams, ambient beds, UI sounds to ElevenLabs Sound Effects v2.
Vocal SFX prompts are bracket-only — no spoken words around the cue. The bracket is the entire performance: [sharp gasp], [exhausted breathing 5 seconds], [nervous chuckle].
Pick the voice before the cue. Diegetic vocal SFX inherit the character voice; off-screen reactions should use a different voice so they don't pull focus.
Pair Fish Audio vocal SFX with ElevenLabs Sound Effects v2 foley on the same canvas timeline. Each model handles the bucket it's built for — foley vs. vocal — and the canvas keeps all cues aligned.
Voice consent matters for vocal SFX too. If you cloned a voice for character dialogue, the same consent applies to vocal SFX generated with that voice.
Fish Audio S2-Pro is the vocal-SFX complement to ElevenLabs Sound Effects v2 on Martini. Use it for character gasps, breathing, sighs, laughs, and similar human-vocal cues that inherit a chosen character voice; route foley (doors, ambience, footsteps, UI) to SFX v2 on the same canvas. The Martini canvas timeline accepts both models' outputs and aligns them to the video, so a 30-60s edit can layer ambient bed (SFX v2) + foley (SFX v2) + vocal SFX (Fish Audio) + dialogue (Dialogue v3 or Fish Audio dialogue) without leaving the workspace. For pure foley work or for English-only projects where polish matters most, use ElevenLabs end-to-end. For multilingual scenes or when vocal SFX should match a previously-cloned character voice, Fish Audio is the right node for the vocal cues specifically.
Connect Fish Audio S2-Pro with other AI models on Martini's infinite canvas. No GPU required — start free.
Get Started Free