Audio Generation
Generate speech, transcribe audio, and transform voices using ElevenLabs and OpenAI
Overview
Wazza Engine provides access to 13 audio models across 2 providers:
ElevenLabs
5 models for text-to-speech and speech-to-speech
• Eleven v3 (highest quality)
• Eleven TTV v3
• Eleven Flash v2.5 (fastest)
• Eleven Turbo v2.5
• Eleven Multilingual STS v2
Pricing: 1-3 credits per generation
OpenAI
5 models for TTS and speech-to-text
• GPT-4o Mini TTS
• Whisper 1
• GPT-4o Mini Transcribe
• GPT-4o Transcribe
• GPT-4o Transcribe Diarize
Pricing: 1-2 credits per generation
Text-to-Speech (TTS)
Convert text to natural-sounding speech:
ElevenLabs Example
import WazzaEngine from '@wazza/engine';
const wazza = new WazzaEngine({
apiKey: process.env.WAZZA_API_KEY
});
// High-quality TTS with ElevenLabs v3
const response = await wazza.generate({
provider: 'eleven-labs',
model: 'eleven_v3',
text: 'Welcome to Wazza Engine. This is a demonstration of our text-to-speech capabilities.',
parameters: {
voiceId: 'en-US-female-1',
stability: 0.75,
similarityBoost: 0.75,
style: 'professional'
}
});
console.log('Audio URL:', response.output.url);OpenAI TTS Example
// Fast TTS with GPT-4o Mini
const response = await wazza.generate({
provider: 'openai',
model: 'gpt-4o-mini-tts',
text: 'This is a quick test of OpenAI text-to-speech.',
parameters: {
voice: 'alloy', // Options: alloy, echo, fable, onyx, nova, shimmer
speed: 1.0
}
});Fast TTS (ElevenLabs Flash)
// Fastest TTS for real-time applications
const response = await wazza.generate({
provider: 'eleven-labs',
model: 'eleven_flash_v2_5',
text: 'Quick response for real-time applications.',
parameters: {
voiceId: 'en-US-male-1'
}
});Speech-to-Text (Transcription)
Transcribe audio files using Whisper or GPT-4o:
Whisper Transcription
// Transcribe audio with Whisper 1
const response = await wazza.generate({
provider: 'openai',
model: 'whisper-1',
parameters: {
audio: 'https://example.com/audio.mp3',
language: 'en', // Optional
format: 'text' // Options: text, srt, vtt
}
});
console.log('Transcription:', response.output.text);Speaker Diarization
// Transcribe with speaker identification
const response = await wazza.generate({
provider: 'openai',
model: 'gpt-4o-transcribe-diarize',
parameters: {
audio: 'https://example.com/meeting.mp3',
numSpeakers: 3 // Optional hint
}
});
// Output includes speaker labels
console.log('Diarized transcript:', response.output.transcript);
// Example output:
// [Speaker 1]: Hello everyone, welcome to the meeting.
// [Speaker 2]: Thank you for having me.
// [Speaker 1]: Let's get started with the agenda.Speech-to-Speech (Voice Transformation)
Transform the voice in an audio file while preserving speech content:
// Voice transformation with ElevenLabs Multilingual STS
const response = await wazza.generate({
provider: 'eleven-labs',
model: 'eleven_multilingual_sts_v2',
parameters: {
audio: 'https://example.com/original-speech.mp3',
targetVoiceId: 'en-US-female-2',
stability: 0.75
}
});
console.log('Transformed audio URL:', response.output.url);Voice Cloning
Create custom voices with ElevenLabs:
// Step 1: Create a custom voice from samples
const voiceResponse = await wazza.createCustomVoice({
provider: 'eleven-labs',
name: 'My Custom Voice',
samples: [
'https://example.com/voice-sample-1.mp3',
'https://example.com/voice-sample-2.mp3',
'https://example.com/voice-sample-3.mp3'
],
description: 'Professional male voice with British accent'
});
const customVoiceId = voiceResponse.voiceId;
// Step 2: Use the custom voice for TTS
const response = await wazza.generate({
provider: 'eleven-labs',
model: 'eleven_v3',
text: 'This is using my custom cloned voice.',
parameters: {
voiceId: customVoiceId
}
});Best Practices
1. Model Selection for TTS
- Highest Quality: ElevenLabs v3 (best for final production)
- Balanced: ElevenLabs Turbo v2.5 (good quality, faster)
- Fastest: ElevenLabs Flash v2.5 (real-time applications)
- Cost-Effective: OpenAI GPT-4o Mini TTS
2. Model Selection for Transcription
- Standard: Whisper 1 (best accuracy/cost ratio)
- Complex Audio: GPT-4o Transcribe (noisy environments)
- Meetings: GPT-4o Transcribe Diarize (speaker identification)
3. Audio Quality Tips
- Use clear, well-formatted text for TTS (avoid special characters)
- Break long text into smaller chunks (max 5000 characters)
- For transcription, use high-quality audio (16kHz+ sample rate)
- Provide language hints for multilingual content
4. Voice Cloning Requirements
- Minimum 3 audio samples, 30-60 seconds each
- Clear audio with minimal background noise
- Consistent recording environment across samples
- Only clone voices you have permission to use