Audio Node

Audio nodes are used to generate speech, sound effects, music, and other audio content, adding an auditory dimension to your creations.

Generation Modes

Audio nodes support multiple audio generation types:

Mode	Description	Input
Text to Speech	Convert text to speech	Text content (input within node)
Sound Effects	Generate sound effects	Text description (input within node)
Music	Generate background music	Lyrics/style description (input within node)
Voice Design	Custom voice creation	Voice feature description
Video to Audio	Video audio/sound effects	Video (requires Video node connection)

⚠️ Important: Audio nodes do not accept connections from Text nodes. Text content is input directly within the Audio node.

Basic Usage

🗣️ Text to Speech

Convert text content into natural, fluent speech.

Steps:

Add an Audio node
Select Speech type at the top of the node
Select a TTS model (e.g., OpenAI TTS-1-HD, Minimax Speech 2.5)
Select a voice
Enter the text to be read in the input field
Click Generate

Text example:

Welcome to Martini, a powerful AI creative workflow platform.
Here, you can easily generate images, videos, and audio content using drag-and-drop nodes.

Use cases:

Video narration
Audiobooks
Educational explanations
Product introductions

🎵 Generate Sound Effects

Generate realistic sound effects based on descriptions.

Steps:

Add an Audio node
Select Sound Effects type
Select a model (recommended: ElevenLabs Sound Effects v2)
Enter sound effect description (English works better)
Set duration
Click Generate

Sound effect description examples:

Scene	Prompt
Natural environment	`Ocean waves crashing on a rocky shore`
Urban scene	`Busy city street with car horns and chatter`
Action sound	`Sword swoosh and metal clang`
Ambient sound	`Eerie wind blowing through abandoned building`

Parameters:

Duration: 0.5-22 seconds
Prompt Influence: Degree of influence description has on result

🎼 Generate Music

Create background music or complete songs.

Steps:

Add an Audio node
Select Music type
Select a model (Suno V5 or Minimax Music v1.5)
Enter lyrics or style description
Set style tags (Genre, Mood)
Click Generate

Music description examples:

Type	Prompt
Instrumental	`Upbeat electronic dance music, energetic, modern synths`
With lyrics	Enter complete lyrics, AI will compose and sing
Film score	`Epic orchestral score, dramatic strings, cinematic`
Ambient music	`Ambient meditation music, peaceful, soft piano`

Suno advanced parameters:

Style: Music style (Pop, Rock, Classical, etc.)
Mood: Emotion (Happy, Sad, Energetic)
Instrumental: Pure music (no vocals)
Vocal Gender: Male / Female
Weirdness: Creativity level

🎨 Voice Design

Create unique AI voices.

Steps:

Add an Audio node
Select Voice Design type
Select Minimax Voice Design model
Describe desired voice characteristics (gender, age, tone)
Enter preview text
Click Generate

Voice description example:

A young female voice, warm and friendly,
slightly husky, British accent

Uses:

Create brand-specific voices
Character voice design
Diversified TTS voices

🎬 Video to Audio

Generate matching sound effects or background music based on video content.

Steps:

Prepare a Video node
Add an Audio node
Connect Video → Audio
Select Video to Audio type
Select Mirelo SFX V1 model
Set number of samples to generate (2-8)
Click Generate

Workflow:

Video → Audio (Mirelo SFX V1)

Features:

AI analyzes video content
Generates 2-8 different sound effect variants
You can choose the most suitable one

Use cases:

Add sound effects to silent videos
Generate background music
Create sound effect libraries

Model Selection Guide

(To be added)

Parameters

Voice

TTS models provide multiple preset voices.

OpenAI TTS-1-HD voices:

Alloy: Neutral, balanced
Echo: Male, steady
Fable: British accent, narrative feel
Onyx: Male, deep
Nova: Female, energetic
Shimmer: Female, gentle

Minimax Speech 2.5 voices:

Supports multiple Chinese voices
Supports emotion control (happy, sad, angry, etc.)

Speed

Adjust playback speed of speech (TTS mode).

Speed	Description
0.5x	Very slow, suitable for teaching
1.0x	Normal speed (recommended)
1.5x	Fast, suitable for fast-paced content
2.0x	Very fast

Duration

Set the length of sound effects (Sound Effects mode).

ElevenLabs Sound Effects:

Minimum: 0.5 seconds
Maximum: 22 seconds
Recommendation: Set based on actual needs

Emotion

Minimax Speech 2.5 supports emotion control.

Emotion	Applicable Scenario
Neutral	Objective narration, news broadcast
Happy	Cheerful content, advertisements
Sad	Sad scenes, drama
Angry	Conflict scenes
Surprised	Surprise, amazement

Connection Rules

Audio nodes can receive from:

Upstream Node	Function	Mode
Video	Provide video content	Video to Audio
Image	Provide visual reference (some models)	Description generation

⚠️ Audio nodes do not accept Text node connections. Please enter text content directly within the Audio node.

Audio nodes can connect to:

Downstream Node	Function
Video	Serve as audio input for digital humans

Workflow Examples

🎙️ Video Audio

Video (silent video) → Audio (Mirelo) → Generate multiple sound effect options

🗣️ Digital Human Voice

Audio (TTS) ──┐
              ├→ Video (Kling AI Avatar)
Image (person) ─┘

First generate speech with Audio node, then connect to Video node to create digital human.

🎬 Complete Short Video Production

Text → Image → Video (silent video)

Generate separately:
- Audio (TTS) → Narration
- Audio (Sound Effects) → Background sound effects
- Audio (Music) → Background music

Finally composite in editing software

Upload Audio (as Starting Point)

You can upload local audio files to Audio nodes:

Method:

Drag and drop audio file onto canvas
Or click upload area within Audio node

Supported formats: MP3, WAV, M4A

Uses:

Serve as input for Video nodes (create digital humans)
Material for audio editing
Export to other tools

Common Questions

❓ Why can't Audio nodes connect to Text nodes?

This is a design decision. Text input for Audio nodes is completed directly within the node, avoiding extra connection complexity.

Correct usage:

❌ Text → Audio (not supported)
✅ Enter text directly in Audio node

❓ How to choose the right TTS voice?

Recommended process:

First listen to previews of all voices
Choose based on content:
- Serious content: Choose steady voices (Onyx, Echo)
- Casual content: Choose lively voices (Nova, Shimmer)
- Narrative content: Choose voices with storytelling feel (Fable, Ballad)
Can switch voices and regenerate if not satisfied

❓ Can music generation specify specific genres?

Yes! In Suno model:

Method 1: Use style tags

Select from Style dropdown (Pop, Rock, Jazz, etc.)

Method 2: Describe in Prompt

80s synth-pop with retro drum machines,
nostalgic melody, upbeat tempo

❓ What if generated sound effect is too short/long?

Adjust Duration parameter:

Sound Effects mode supports 0.5-22 seconds
Directly adjust duration in parameter panel

If longer sound effects needed:

Generate multiple sound effect segments
Splice together in external audio editing software

❓ How to choose from multiple Video to Audio sound effects?

Mirelo generates 2-8 sound effect variants
Click audio waveforms within node to switch and preview
Select the most satisfactory one
Download or connect to downstream use

❓ Can generated speech speed be adjusted?

Yes! In TTS parameters:

Find the Speed slider
Adjust to 0.5x-2.0x
Regenerate

Advanced Features

Minimax Emotion Control

Minimax Speech 2.5 supports fine-grained emotion and tone control.

Adjustable parameters:

Emotion: Happy, Sad, Angry, Surprised, etc.
Speed: Speaking rate
Pitch: Tone
Volume: Volume

Suitable for:

Audiobooks (require rich emotional expression)
Drama dubbing
Advertisement videos

ElevenLabs Context Awareness

ElevenLabs TTS supports context input to improve naturalness.

Usage:

Enter previous text in Previous Text
Enter following text in Next Text
Current text will adjust tone based on context

Suitable for:

Long-form reading (consistent tone between chapters)
Dialogue scenes

Suno Custom Mode

Suno V5 supports highly customizable music generation.

Parameter control:

Style Weight: Style intensity
Weirdness Constraint: Creativity level
Audio Weight: Melody weight

Suitable for:

Experimental music
Precise control of specific styles

Operation Buttons and Features

Generate

Click to start generating audio.

Generation time:

TTS: 5-15 seconds
Sound Effects: 10-30 seconds
Music: 30-120 seconds

Play

Click play button to preview audio.

Features:

Play/pause
Volume adjustment
Loop playback

Download

Download generated audio file.

Formats:

TTS: MP3
Sound Effects: WAV/MP3
Music: MP3

Workflow Examples

🎙️ Create Audio Content

1. Audio (TTS) → Generate narration
2. Audio (Sound Effects) → Generate background sound effects
3. Audio (Music) → Generate background music

Mix and composite in external audio software (e.g., Audacity)

🎬 Complete Video Audio Workflow

Text → Image → Video (silent video)

Audio (TTS) → Narration audio
Audio (Video to Audio) ← Video → Generate ambient sound effects

Composite in editing software

🗣️ Digital Human Video

Audio (TTS, generate speech) ──┐
                       ├→ Video (Kling AI Avatar)
Image (person photo) ──────┘

See Video Node - Digital Human

Model Selection Guide

(To be added)

Common Usage Tips

📝 TTS Text Optimization

Punctuation affects pauses:

Comma , = Brief pause
Period . = Clear pause
Question mark ? = Rising intonation
Exclamation mark ! = Emphasis

Numbers and symbols:

Write one hundred instead of 100 (unless you want "one zero zero")
Write first instead of 1st

🎵 Sound Effect Generation Tips

Specific descriptions:

❌ water sound
✅ Heavy rain pouring on a tin roof

Add environment and distance:

Close-up microphone of crackling fireplace
Distant thunder rolling across hills

🎼 Music Generation Tips

Structured description:

Genre: Pop, Rock, Jazz, Classical
Instruments: Piano, Guitar, Synth, Orchestra
Mood: Upbeat, Melancholic, Epic
Tempo: Fast, Slow, Moderate tempo

Example:

Acoustic folk song with gentle guitar strumming,
warm male vocals, introspective lyrics,
slow tempo, indie style

Common Questions

❓ Which languages does TTS support?

Model	Supported Languages
OpenAI TTS	Chinese, English, multiple languages
Minimax Speech 2.5	Chinese, English (better Chinese results)
LMNT	English
Hume	English

❓ Are there copyright issues with generated music?

AI-generated music typically belongs to you for personal use, but it's recommended to:

Check platform terms of use before commercial use
Use original music for critical projects to be safe

❓ What if sound effect generation results are poor?

Optimization methods:

Use English descriptions: AI understands English more accurately
Add details: Describe sound effect texture, distance, environment
Adjust Duration: Sound effect length should be reasonable
Generate multiple times: Select best result

❓ How to add multiple audio layers to video?

Martini generates single-layer audio, multi-layer mixing requires external tools:

Recommended workflow:

Generate separately in Martini:
- Narration (TTS)
- Sound effects (Sound Effects)
- Background music (Music)
Export all audio
Mix in Audacity / Premiere / Final Cut
Composite final audio with video

Performance Optimization Recommendations

(To be added)

Next Steps

Video Node — Combine audio to create digital human videos
Workflow Examples — Complete audio-video workflows