ElevenLabs
ElevenLabs offers two voice cloning tiers that map directly to how much reference audio you have. Instant Voice Cloning trains on a 10-second sample and is ready in seconds — fine for internal narration drafts, prototype dubs, and personal video voiceover. Professional Voice Cloning needs 30+ minutes of clean studio audio, but the resulting voice can carry an entire course or audiobook without drifting. On Martini, both modes feed Eleven v3 (or Multilingual v2 for non-English work), so once your voice is registered you can generate new narration in 70+ languages with inline emotion tags. Critical: only clone voices you own or have explicit written permission to clone. ElevenLabs requires voice verification for your own voice, and consent matters whether the platform enforces it or not.
Before recording or uploading, get explicit written consent from the voice owner. ElevenLabs requires voice verification when you clone your own voice, and Martini follows the same policy. Decide between Instant Voice Cloning (10-second sample, ready in seconds, good for drafts and short narration) and Professional Voice Cloning (30+ minutes of clean studio audio, ready in 24 hours, the only mode acceptable for long-form courses, audiobooks, or branded narrator voices). If your sample is a 30s phone recording, IVC is your only option — PVC will reject low-quality input. If you control the recording session, plan for 30 minutes of varied scripted content with no background noise: that one-time effort buys you a voice that holds up across thousands of generations.
For Instant Voice Cloning, record 10 seconds of conversational speech in a quiet room — no music bed, no AC hum, no second speaker. Speak the way you want the cloned voice to sound: a measured, narrator pace produces a measured cloned voice; an excited reading produces an excited clone. For Professional Voice Cloning, you need 30+ minutes of studio-grade recordings: vary content (narrative paragraphs, conversational lines, technical readings, emotional ranges) so the model captures your full delivery range. Convert all uploads to 44.1kHz WAV or 320kbps MP3. Audio with hiss, room reverb, lip smacks, or breath pops will train into the clone — you cannot strip those out later.
Add an Audio node, select ElevenLabs Eleven v3 (or Multilingual v2 for non-English work), and pick your newly cloned voice from the voice picker. Generate a 30-second test sentence that uses sounds your sample didn't cover: an "s" word, a question intonation, a number sequence, an exclamation. This is where IVC clones fail and PVC clones hold up. If the IVC clone struggles on questions or numbers, that's the trade-off — re-record with more varied content or upgrade to PVC. Once the test passes, the cloned voice is reusable across every Audio node in the project, and you can wire it into Lipsync nodes (OmniHuman, Kling LipSync) for talking-head delivery.
A cloned voice still needs direction. Eleven v3 understands inline tags like [whispers], [laughs], [sighs], [excited], [pause] placed near the words they should affect. For a course intro: "Hi everyone, [excited] welcome to module three!" produces noticeably warmer delivery than the same line without the tag. Keep tags sparse and local — three tags in a 60-second narration is plenty; ten tags fight each other and produce inconsistent reads. Punctuation also drives pacing: ellipses create contemplative pauses, em dashes create sharp transitions, and short sentences read at a faster confident clip than long ones.
Podcast intro with cloned host voice — the [excited] tag bumps energy at the welcome line, the ellipsis sets up a contemplative pause before the show kickoff. Works equally well with IVC for prototype recordings or PVC for the released cut.
Hi everyone, welcome back to the show. [excited] Today we're diving into something I've been waiting weeks to talk about... let's get into it.
Course narration with cloned founder voice — numbered structure gives the cloned voice clear pacing markers, and the [pause] before "Ready?" creates a deliberate moment that mirrors how a real instructor would let the class catch up.
In this module, we'll cover three core concepts. First, we look at how the data flows through the pipeline. Then, we trace each transformation step. Finally, we audit the output for quality. [pause] Ready? Let's start.
Instant Voice Cloning needs 10s of clean audio and is ready in seconds. Professional Voice Cloning needs 30+ minutes and 24 hours of training time, but produces a voice that holds up across long-form content. There is no middle tier.
For non-English cloned voices, switch the Audio node from Eleven v3 to ElevenLabs Multilingual v2 — both can use the same cloned voice ID, but Multilingual v2 produces more natural prosody outside English.
Voice consent is non-optional. Document written permission for any voice that is not your own, even for internal drafts. ElevenLabs voice verification covers your own voice; permission for others is on you.
A cloned voice is reusable across every Audio node in the workspace and connects to Lipsync nodes (OmniHuman, Kling LipSync) for talking-head delivery.
A cloned ElevenLabs voice is the closest you can get to your real voice without re-recording. IVC drafts in seconds and works for everything except marquee content; PVC takes a day to train and is the only mode you should ship to a course or audiobook. The output stays consistent across generations because the voice is registered once — every subsequent Audio node call reuses the same voice ID. Trade-off vs. Fish Audio S2-Pro: ElevenLabs has the broader voice ecosystem and stronger English emotional inflection; Fish Audio S2-Pro has open-source serving and natural-language bracket control. For a creator already running ElevenLabs voiceovers in their pipeline, cloning into the same family keeps everything in one canvas.
Connect ElevenLabs Eleven v3 with other AI models on Martini's infinite canvas. No GPU required — start free.
Get Started Free