Zapim
Labs
studio · v2
idle
An instrument for designing, cloning and directing speech — thirty languages, every nuance from a whisper to a stadium speech.
Begin a session
choose your methodComposition
describe the voice, then write what it says
0 characters
From the library
click any preset to loadApply a style
styles fill the personality field · effects insert tags into your textEmotion
Scenario
Effects · click to insert into your text
i
All non-verbal tagsYou can type any of these directly in your text — chips below are just shortcuts.
Laughs & sighs:
• [laugh] — quick laugh (alias: [chuckle])
• [sigh] — audible sigh
Pauses & fillers:
• [think] — thinking pause / "uhmm" (alias: [hesitate])
• [hush] — "shh" sound (alias: [quiet])
Question intonations (try each — they shape the rising lift differently):
• [ask] — ah-shape
• [wonder] — ei-shape
• [inquire] — en-shape
• [puzzle] — oh-shape
Surprise / amazement:
• [surprise] — "wa!" burst
• [amaze] — "yo!" reaction
Dissatisfaction:
• [displease] — disapproving grunt (alias: [frown])
Tip: spread tags across sentences rather than stacking them in one phrase — gives the model room to set up each effect.
Reference voice
5–30 seconds of clean audioDrop a WAV, MP3, M4A here
or click to choose a file
Read the passage below clearly at a natural pace. Aim for ~18–22 seconds.
0:00
Fine adjustments
advanced · safe to leave alone
Adherence
i
2.5
AdherenceHow strictly the model follows your personality description.
• Lower (1.4–1.7): more variety + emotional range. Best for dramatic reads.
• Default (2.0): balanced.
• Higher (2.2–3.0): rigid, very stable, less surprise. Best for repeatable brand voice.
how strictly the model follows your description
Quality
i
24
Quality vs SpeedRefinement passes per audio clip.
• Low (4–8): faster, slightly rougher prosody.
• Default (10): balanced.
• High (15–25): smoother, more natural. Best for production.
refinement passes per audio clip
Output
i
Sample rateFinal audio frequency.
• Studio 48 kHz — full quality, default.
• Voice 16 kHz — speech grade, ~3× smaller files.
• Phone 8 kHz — telephony quality (G.711). How it sounds on a real phone call.
Lower rates do NOT speed up generation — only shrink the file.
studio, voice, or telephony rate
Background
i
Ambient backgroundLayer a low-volume ambient track under the speech to make it feel like a real environment — call center for support flows, rally for political speeches, street for door-to-door, etc. Mixed server-side.
layered ambient under the speech
Background level
i
65%
Background levelHow loud the ambient bed sits under the speech.
• 100% — full per-category default (most prominent, can compete with speech).
• 70% — recommended default. Audible but not overpowering.
• 30–50% — barely-there, "this is happening in a place" feel.
• 0% — mute (effectively turns the background off without changing the dropdown).
Step: 10%.
how loud the ambient sits under the voice (0 mutes, 100 = full)
Text cleanup
i
on
Text cleanupPre-process numbers ("123" → "one hundred twenty three"), expand abbreviations, normalize punctuation. Leave on for almost everything.
expand numbers and abbreviations
Result
just now
1.0×
Download WAV