Audio Node
Audio nodes are used to generate speech, sound effects, music, and other audio content, adding an auditory dimension to your creations.
Generation Modes
Audio nodes support multiple audio generation types:
| Mode | Description | Input |
|---|---|---|
| Text to Speech | Convert text to speech | Text content (input within node) |
| Sound Effects | Generate sound effects | Text description (input within node) |
| Music | Generate background music | Lyrics/style description (input within node) |
| Voice Design | Custom voice creation | Voice feature description |
| Video to Audio | Video audio/sound effects | Video (requires Video node connection) |
⚠️ Important: Audio nodes do not accept connections from Text nodes. Text content is input directly within the Audio node.
Basic Usage
🗣️ Text to Speech
Convert text content into natural, fluent speech.
Steps:
- Add an Audio node
- Select Speech type at the top of the node
- Select a TTS model (e.g., OpenAI TTS-1-HD, Minimax Speech 2.5)
- Select a voice
- Enter the text to be read in the input field
- Click Generate
Text example:
Welcome to Martini, a powerful AI creative workflow platform. Here, you can easily generate images, videos, and audio content using drag-and-drop nodes.
Use cases:
- Video narration
- Audiobooks
- Educational explanations
- Product introductions
🎵 Generate Sound Effects
Generate realistic sound effects based on descriptions.
Steps:
- Add an Audio node
- Select Sound Effects type
- Select a model (recommended: ElevenLabs Sound Effects v2)
- Enter sound effect description (English works better)
- Set duration
- Click Generate
Sound effect description examples:
| Scene | Prompt |
|---|---|
| Natural environment | Ocean waves crashing on a rocky shore |
| Urban scene | Busy city street with car horns and chatter |
| Action sound | Sword swoosh and metal clang |
| Ambient sound | Eerie wind blowing through abandoned building |
Parameters:
- Duration: 0.5-22 seconds
- Prompt Influence: Degree of influence description has on result
🎼 Generate Music
Create background music or complete songs.
Steps:
- Add an Audio node
- Select Music type
- Select a model (Suno V5 or Minimax Music v1.5)
- Enter lyrics or style description
- Set style tags (Genre, Mood)
- Click Generate
Music description examples:
| Type | Prompt |
|---|---|
| Instrumental | Upbeat electronic dance music, energetic, modern synths |
| With lyrics | Enter complete lyrics, AI will compose and sing |
| Film score | Epic orchestral score, dramatic strings, cinematic |
| Ambient music | Ambient meditation music, peaceful, soft piano |
Suno advanced parameters:
- Style: Music style (Pop, Rock, Classical, etc.)
- Mood: Emotion (Happy, Sad, Energetic)
- Instrumental: Pure music (no vocals)
- Vocal Gender: Male / Female
- Weirdness: Creativity level
🎨 Voice Design
Create unique AI voices.
Steps:
- Add an Audio node
- Select Voice Design type
- Select Minimax Voice Design model
- Describe desired voice characteristics (gender, age, tone)
- Enter preview text
- Click Generate
Voice description example:
A young female voice, warm and friendly, slightly husky, British accent
Uses:
- Create brand-specific voices
- Character voice design
- Diversified TTS voices
🎬 Video to Audio
Generate matching sound effects or background music based on video content.
Steps:
- Prepare a Video node
- Add an Audio node
- Connect Video → Audio
- Select Video to Audio type
- Select Mirelo SFX V1 model
- Set number of samples to generate (2-8)
- Click Generate
Workflow:
Video → Audio (Mirelo SFX V1)
Features:
- AI analyzes video content
- Generates 2-8 different sound effect variants
- You can choose the most suitable one
Use cases:
- Add sound effects to silent videos
- Generate background music
- Create sound effect libraries
Model Selection Guide
(To be added)
Parameters
Voice
TTS models provide multiple preset voices.
OpenAI TTS-1-HD voices:
- Alloy: Neutral, balanced
- Echo: Male, steady
- Fable: British accent, narrative feel
- Onyx: Male, deep
- Nova: Female, energetic
- Shimmer: Female, gentle
Minimax Speech 2.5 voices:
- Supports multiple Chinese voices
- Supports emotion control (happy, sad, angry, etc.)
Speed
Adjust playback speed of speech (TTS mode).
| Speed | Description |
|---|---|
| 0.5x | Very slow, suitable for teaching |
| 1.0x | Normal speed (recommended) |
| 1.5x | Fast, suitable for fast-paced content |
| 2.0x | Very fast |
Duration
Set the length of sound effects (Sound Effects mode).
ElevenLabs Sound Effects:
- Minimum: 0.5 seconds
- Maximum: 22 seconds
- Recommendation: Set based on actual needs
Emotion
Minimax Speech 2.5 supports emotion control.
| Emotion | Applicable Scenario |
|---|---|
| Neutral | Objective narration, news broadcast |
| Happy | Cheerful content, advertisements |
| Sad | Sad scenes, drama |
| Angry | Conflict scenes |
| Surprised | Surprise, amazement |
Connection Rules
Audio nodes can receive from:
| Upstream Node | Function | Mode |
|---|---|---|
| Video | Provide video content | Video to Audio |
| Image | Provide visual reference (some models) | Description generation |
⚠️ Audio nodes do not accept Text node connections. Please enter text content directly within the Audio node.
Audio nodes can connect to:
| Downstream Node | Function |
|---|---|
| Video | Serve as audio input for digital humans |
Workflow Examples
🎙️ Video Audio
Video (silent video) → Audio (Mirelo) → Generate multiple sound effect options
🗣️ Digital Human Voice
Audio (TTS) ──┐
├→ Video (Kling AI Avatar)
Image (person) ─┘
First generate speech with Audio node, then connect to Video node to create digital human.
🎬 Complete Short Video Production
Text → Image → Video (silent video) Generate separately: - Audio (TTS) → Narration - Audio (Sound Effects) → Background sound effects - Audio (Music) → Background music Finally composite in editing software
Upload Audio (as Starting Point)
You can upload local audio files to Audio nodes:
Method:
- Drag and drop audio file onto canvas
- Or click upload area within Audio node
Supported formats: MP3, WAV, M4A
Uses:
- Serve as input for Video nodes (create digital humans)
- Material for audio editing
- Export to other tools
Common Questions
❓ Why can't Audio nodes connect to Text nodes?
This is a design decision. Text input for Audio nodes is completed directly within the node, avoiding extra connection complexity.
Correct usage:
- ❌ Text → Audio (not supported)
- ✅ Enter text directly in Audio node
❓ How to choose the right TTS voice?
Recommended process:
- First listen to previews of all voices
- Choose based on content:
- Serious content: Choose steady voices (Onyx, Echo)
- Casual content: Choose lively voices (Nova, Shimmer)
- Narrative content: Choose voices with storytelling feel (Fable, Ballad)
- Can switch voices and regenerate if not satisfied
❓ Can music generation specify specific genres?
Yes! In Suno model:
Method 1: Use style tags
- Select from Style dropdown (Pop, Rock, Jazz, etc.)
Method 2: Describe in Prompt
80s synth-pop with retro drum machines, nostalgic melody, upbeat tempo
❓ What if generated sound effect is too short/long?
Adjust Duration parameter:
- Sound Effects mode supports 0.5-22 seconds
- Directly adjust duration in parameter panel
If longer sound effects needed:
- Generate multiple sound effect segments
- Splice together in external audio editing software
❓ How to choose from multiple Video to Audio sound effects?
- Mirelo generates 2-8 sound effect variants
- Click audio waveforms within node to switch and preview
- Select the most satisfactory one
- Download or connect to downstream use
❓ Can generated speech speed be adjusted?
Yes! In TTS parameters:
- Find the Speed slider
- Adjust to 0.5x-2.0x
- Regenerate
Advanced Features
Minimax Emotion Control
Minimax Speech 2.5 supports fine-grained emotion and tone control.
Adjustable parameters:
- Emotion: Happy, Sad, Angry, Surprised, etc.
- Speed: Speaking rate
- Pitch: Tone
- Volume: Volume
Suitable for:
- Audiobooks (require rich emotional expression)
- Drama dubbing
- Advertisement videos
ElevenLabs Context Awareness
ElevenLabs TTS supports context input to improve naturalness.
Usage:
- Enter previous text in Previous Text
- Enter following text in Next Text
- Current text will adjust tone based on context
Suitable for:
- Long-form reading (consistent tone between chapters)
- Dialogue scenes
Suno Custom Mode
Suno V5 supports highly customizable music generation.
Parameter control:
- Style Weight: Style intensity
- Weirdness Constraint: Creativity level
- Audio Weight: Melody weight
Suitable for:
- Experimental music
- Precise control of specific styles
Operation Buttons and Features
Generate
Click to start generating audio.
Generation time:
- TTS: 5-15 seconds
- Sound Effects: 10-30 seconds
- Music: 30-120 seconds
Play
Click play button to preview audio.
Features:
- Play/pause
- Volume adjustment
- Loop playback
Download
Download generated audio file.
Formats:
- TTS: MP3
- Sound Effects: WAV/MP3
- Music: MP3
Workflow Examples
🎙️ Create Audio Content
1. Audio (TTS) → Generate narration 2. Audio (Sound Effects) → Generate background sound effects 3. Audio (Music) → Generate background music Mix and composite in external audio software (e.g., Audacity)
🎬 Complete Video Audio Workflow
Text → Image → Video (silent video) Audio (TTS) → Narration audio Audio (Video to Audio) ← Video → Generate ambient sound effects Composite in editing software
🗣️ Digital Human Video
Audio (TTS, generate speech) ──┐
├→ Video (Kling AI Avatar)
Image (person photo) ──────┘
See Video Node - Digital Human
Model Selection Guide
(To be added)
Common Usage Tips
📝 TTS Text Optimization
Punctuation affects pauses:
- Comma
,= Brief pause - Period
.= Clear pause - Question mark
?= Rising intonation - Exclamation mark
!= Emphasis
Numbers and symbols:
- Write
one hundredinstead of100(unless you want "one zero zero") - Write
firstinstead of1st
🎵 Sound Effect Generation Tips
Specific descriptions:
- ❌
water sound - ✅
Heavy rain pouring on a tin roof
Add environment and distance:
Close-up microphone of crackling fireplaceDistant thunder rolling across hills
🎼 Music Generation Tips
Structured description:
- Genre: Pop, Rock, Jazz, Classical
- Instruments: Piano, Guitar, Synth, Orchestra
- Mood: Upbeat, Melancholic, Epic
- Tempo: Fast, Slow, Moderate tempo
Example:
Acoustic folk song with gentle guitar strumming, warm male vocals, introspective lyrics, slow tempo, indie style
Common Questions
❓ Which languages does TTS support?
| Model | Supported Languages |
|---|---|
| OpenAI TTS | Chinese, English, multiple languages |
| Minimax Speech 2.5 | Chinese, English (better Chinese results) |
| LMNT | English |
| Hume | English |
❓ Are there copyright issues with generated music?
AI-generated music typically belongs to you for personal use, but it's recommended to:
- Check platform terms of use before commercial use
- Use original music for critical projects to be safe
❓ What if sound effect generation results are poor?
Optimization methods:
- Use English descriptions: AI understands English more accurately
- Add details: Describe sound effect texture, distance, environment
- Adjust Duration: Sound effect length should be reasonable
- Generate multiple times: Select best result
❓ How to add multiple audio layers to video?
Martini generates single-layer audio, multi-layer mixing requires external tools:
Recommended workflow:
- Generate separately in Martini:
- Narration (TTS)
- Sound effects (Sound Effects)
- Background music (Music)
- Export all audio
- Mix in Audacity / Premiere / Final Cut
- Composite final audio with video
Performance Optimization Recommendations
(To be added)
Next Steps
- Video Node — Combine audio to create digital human videos
- Workflow Examples — Complete audio-video workflows
