Fish Audio

How to Clone a Voice With AI with Fish Audio S2-Pro

Fish Audio S2-Pro is the open-source alternative to ElevenLabs cloning, with two real differentiators: natural-language bracket control inside the prompt (`[whispering]`, `[laughing nervously]`, `[pause]`) and an open serving stack you can self-host. Voice cloning needs a clean reference audio sample plus a matching transcript — Fish Audio uses the transcript text to disambiguate phonemes, so a misaligned transcript hurts cloning quality more than it does on ElevenLabs. Coverage is 80+ languages with automatic detection. Critical: only clone voices you own or have explicit written permission to clone. Fish Audio is open-source, which means consent enforcement is on you, not the platform — make the rights clearance explicit before you upload reference audio.

Try Fish Audio S2-Pro Free

Step-by-Step Guide

Get rights and prepare a transcript-aligned reference

Document the rights to the voice before you upload. Fish Audio S2-Pro is open-source, so the consent burden sits with you — keep written permission on file for any voice that is not your own. Then prepare the reference: 30+ seconds of clean speech (no background music, no second speaker, 44.1kHz WAV preferred) and a verbatim transcript that exactly matches what was spoken. Fish Audio uses the transcript to align phonemes during cloning; a transcript that says "we will" when the audio actually says "we'll" introduces small alignment errors that compound when you generate longer scripts. If you don't have a transcript, run the audio through a STT model first, then proofread.

Add an Audio node and load Fish Audio S2-Pro

Add an Audio node, select Fish Audio S2-Pro from the model picker, and upload the reference audio + transcript pair. The model returns a cloned voice ID that you can reuse across every Audio node in the project. Note: Fish Audio in Martini is currently positioned as an SEO-comparison surface — the model exists in Martini's registry alongside ElevenLabs, MiniMax, and others, but production runtime integration depends on your workspace's configuration. If your workspace doesn't have Fish Audio runtime wired up, fall back to ElevenLabs for production, and use Fish Audio for self-hosted experimentation outside Martini.

Direct delivery with bracket-style natural-language tags

Fish Audio S2-Pro's biggest differentiator is open-ended bracket control. Instead of a fixed list of tokens, you can write [whispering sweetly], [laughing nervously], [pause for two seconds], [angry whisper] — natural language inside brackets, placed at the exact word where delivery should change. Keep the bracket cues local: a tag near a word affects that word and the immediate phrase, not the rest of the paragraph. Compare this to ElevenLabs' fixed inline tag set ([whispers], [laughs], [excited]): Fish Audio gives you wider expressive range at the cost of slightly less predictable interpretation. Test 2-3 deliveries before committing to a long-form script.

Compare Fish Audio against ElevenLabs on the same script

For high-stakes content, place a Fish Audio S2-Pro Audio node and an ElevenLabs Eleven v3 Audio node side by side on the canvas, both reading the same script with cloned voices. The comparison is usually revealing: ElevenLabs is more polished and confident in English emotional delivery, while Fish Audio handles open-ended emotion tags more flexibly and covers more languages without a separate model swap. Many production teams use Fish Audio for prototype recordings and multilingual experiments, then ship the released cut on ElevenLabs. The shared canvas means you don't have to leave Martini to run the comparison.

Prompt Examples

Course intro with cloned host voice — bracket cues are placed exactly where delivery should shift. Compare against the same line on ElevenLabs Eleven v3 (fixed tags) to see Fish Audio's wider expressive range.

[calmly] Welcome to the studio. [pause] Today, [thoughtfully] we're going to look at three patterns that show up again and again — [emphasizing] every single time — in successful product launches.

Multi-speaker dialogue using Fish Audio's S2-Pro speaker tags. Each speaker can be a different cloned voice, all in one Audio node. The [curious] cue inside the host turn directs delivery on that line only.

[Speaker:Host] Hey, thanks for joining us. [Speaker:Guest] Happy to be here. [Speaker:Host] [curious] So tell me — what made you start this project?

Parameter Tips

Reference audio + transcript pairing is the single biggest quality lever. A misaligned transcript ("we will" when the audio says "we'll") accumulates phoneme errors across long generations.

Bracket cues are open-ended natural language: [whispering sweetly], [laughing nervously], [pause for two seconds]. Local placement (right next to the word) gives the most predictable result.

Fish Audio covers 80+ languages with automatic detection — no need to switch model variants between English, Mandarin, Japanese, or Spanish on the same cloned voice.

Open-source serving means you can self-host outside Martini if needed — useful for sensitive voice content where you don't want the audio leaving your infrastructure.

For S1 legacy prompts using parenthesis emotion syntax, switch to S2-Pro's bracket syntax. Old (excited) prompts must be rewritten as [excited] to work with the new model.

What to Expect

Fish Audio S2-Pro voice cloning is the open-source choice — wider language coverage, more flexible emotion control via natural-language brackets, and self-hostable infrastructure. Trade-off vs. ElevenLabs: less polished English emotional delivery, slightly less predictable interpretation of bracket cues, and the consent burden sits with you (no platform-level voice verification). For production-grade English narration on a single voice, ElevenLabs Eleven v3 is the safer bet. For multilingual projects, voice prototyping, self-hosted deployments, or workflows that need natural-language emotion tags beyond a fixed set, Fish Audio S2-Pro is worth the trade. The Martini canvas lets you A/B both on the same script without switching tools.

Use Fish Audio S2-Pro on Martini

Connect Fish Audio S2-Pro with other AI models on Martini's infinite canvas. No GPU required — start free.

Get Started Free

Related features

Docs

nodes/audio

Try Other Models for This Task

ElevenLabs

ElevenLabs Eleven v3

ElevenLabs offers two voice cloning tiers that map directly to how much reference audio you have. Instant Voice Cloning trains on a 10-second sample and is ready in seconds — fine for internal narration drafts, prototype dubs, and personal video voiceover. Professional Voice Cloning needs 30+ minutes of clean studio audio, but the resulting voice can carry an entire course or audiobook without drifting. On Martini, both modes feed Eleven v3 (or Multilingual v2 for non-English work), so once your voice is registered you can generate new narration in 70+ languages with inline emotion tags. Critical: only clone voices you own or have explicit written permission to clone. ElevenLabs requires voice verification for your own voice, and consent matters whether the platform enforces it or not.

View guide

How to Clone a Voice With AI

Fish Audio

How to Clone a Voice With AI with Fish Audio S2-Pro

Try Fish Audio S2-Pro Free

Step-by-Step Guide

Get rights and prepare a transcript-aligned reference

Add an Audio node and load Fish Audio S2-Pro

Direct delivery with bracket-style natural-language tags

Compare Fish Audio against ElevenLabs on the same script

Prompt Examples

[Speaker:Host] Hey, thanks for joining us. [Speaker:Guest] Happy to be here. [Speaker:Host] [curious] So tell me — what made you start this project?

Parameter Tips

Reference audio + transcript pairing is the single biggest quality lever. A misaligned transcript ("we will" when the audio says "we'll") accumulates phoneme errors across long generations.

Bracket cues are open-ended natural language: [whispering sweetly], [laughing nervously], [pause for two seconds]. Local placement (right next to the word) gives the most predictable result.

Fish Audio covers 80+ languages with automatic detection — no need to switch model variants between English, Mandarin, Japanese, or Spanish on the same cloned voice.

Open-source serving means you can self-host outside Martini if needed — useful for sensitive voice content where you don't want the audio leaving your infrastructure.

For S1 legacy prompts using parenthesis emotion syntax, switch to S2-Pro's bracket syntax. Old (excited) prompts must be rewritten as [excited] to work with the new model.

What to Expect

Use Fish Audio S2-Pro on Martini

Connect Fish Audio S2-Pro with other AI models on Martini's infinite canvas. No GPU required — start free.

Get Started Free

Related features

Docs

nodes/audio

Try Other Models for This Task

ElevenLabs

ElevenLabs Eleven v3

View guide

How to Clone a Voice With AI

How to Clone a Voice With AI with Fish Audio S2-Pro

Step-by-Step Guide

Get rights and prepare a transcript-aligned reference

Add an Audio node and load Fish Audio S2-Pro

Direct delivery with bracket-style natural-language tags

Compare Fish Audio against ElevenLabs on the same script

Prompt Examples

Parameter Tips

What to Expect

Use Fish Audio S2-Pro on Martini

Related features

Docs

Related reading

Try Other Models for This Task

ElevenLabs Eleven v3

This website uses cookies

How to Clone a Voice With AI with Fish Audio S2-Pro

Step-by-Step Guide

Get rights and prepare a transcript-aligned reference

Add an Audio node and load Fish Audio S2-Pro

Direct delivery with bracket-style natural-language tags

Compare Fish Audio against ElevenLabs on the same script

Prompt Examples

Parameter Tips

What to Expect

Use Fish Audio S2-Pro on Martini

Related features

Docs

Related reading

Try Other Models for This Task

ElevenLabs Eleven v3