Zapim Labs · Studio

Begin a session

choose your method

Composition

describe the voice, then write what it says

Voice personality i

Voice personalityA short, comma-separated description of the speaker — the model uses it to design the voice. We wrap your text in parentheses and prepend it before the actual line so you don't have to. Three layers to combine: 1) Identity — gender, age, role middle-aged Indian woman young man news anchor elderly grandfather 2) Texture — quality of the voice warm raspy magnetic breathy nasal low-pitched 3) Delivery — emotion + pace + scenario slow steady pace passionate shouting campaign slogans conspiratorial whisper Examples: Middle-aged Indian woman, warm storyteller voice, slow steady pace Furious young man, shouting, raw and unhinged Elderly grandfather, soft gentle voice, slightly hoarse Tips: stick to clean adjectives, avoid stacking too many. 3–6 descriptors works best.

0 characters

From the library

click any preset to load

Apply a style

styles fill the personality field · effects insert tags into your text

Emotion

Scenario

Effects · click to insert into your text i

All non-verbal tagsYou can type any of these directly in your text — chips below are just shortcuts. Laughs & sighs: • [laugh] — quick laugh (alias: [chuckle]) • [sigh] — audible sigh Pauses & fillers: • [think] — thinking pause / "uhmm" (alias: [hesitate]) • [hush] — "shh" sound (alias: [quiet]) Question intonations (try each — they shape the rising lift differently): • [ask] — ah-shape • [wonder] — ei-shape • [inquire] — en-shape • [puzzle] — oh-shape Surprise / amazement: • [surprise] — "wa!" burst • [amaze] — "yo!" reaction Dissatisfaction: • [displease] — disapproving grunt (alias: [frown]) Tip: spread tags across sentences rather than stacking them in one phrase — gives the model room to set up each effect.

Reference voice

5–30 seconds of clean audio

Drop a WAV, MP3, M4A here

or click to choose a file

Fine adjustments

advanced · safe to leave alone

Adherence i

AdherenceHow strictly the model follows your personality description. • Lower (1.4–1.7): more variety + emotional range. Best for dramatic reads. • Default (2.0): balanced. • Higher (2.2–3.0): rigid, very stable, less surprise. Best for repeatable brand voice.

how strictly the model follows your description

2.5

Quality i

Quality vs SpeedRefinement passes per audio clip. • Low (4–8): faster, slightly rougher prosody. • Default (10): balanced. • High (15–25): smoother, more natural. Best for production.

refinement passes per audio clip

24

Output i

Sample rateFinal audio frequency. • Studio 48 kHz — full quality, default. • Voice 16 kHz — speech grade, ~3× smaller files. • Phone 8 kHz — telephony quality (G.711). How it sounds on a real phone call. Lower rates do NOT speed up generation — only shrink the file.

studio, voice, or telephony rate

Background i

Ambient backgroundLayer a low-volume ambient track under the speech to make it feel like a real environment — call center for support flows, rally for political speeches, street for door-to-door, etc. Mixed server-side.

layered ambient under the speech

Background level i

Background levelHow loud the ambient bed sits under the speech. • 100% — full per-category default (most prominent, can compete with speech). • 70% — recommended default. Audible but not overpowering. • 30–50% — barely-there, "this is happening in a place" feel. • 0% — mute (effectively turns the background off without changing the dropdown). Step: 10%.

how loud the ambient sits under the voice (0 mutes, 100 = full)

65%

Text cleanup i

Text cleanupPre-process numbers ("123" → "one hundred twenty three"), expand abbreviations, normalize punctuation. Leave on for almost everything.

expand numbers and abbreviations

on

Result

just now

Playback 1.0×

Download WAV