Audio Generation

Generate speech, transcribe audio, and transform voices using ElevenLabs and OpenAI

Overview

Wazza Engine provides access to 13 audio models across 2 providers:

ElevenLabs

5 models for text-to-speech and speech-to-speech

• Eleven v3 (highest quality)
• Eleven TTV v3
• Eleven Flash v2.5 (fastest)
• Eleven Turbo v2.5
• Eleven Multilingual STS v2

Pricing: 1-3 credits per generation

OpenAI

5 models for TTS and speech-to-text

• GPT-4o Mini TTS
• Whisper 1
• GPT-4o Mini Transcribe
• GPT-4o Transcribe
• GPT-4o Transcribe Diarize

Pricing: 1-2 credits per generation

Text-to-Speech (TTS)

Convert text to natural-sounding speech:

ElevenLabs Example

import WazzaEngine from '@wazza/engine';

const wazza = new WazzaEngine({
  apiKey: process.env.WAZZA_API_KEY
});

// High-quality TTS with ElevenLabs v3
const response = await wazza.generate({
  provider: 'eleven-labs',
  model: 'eleven_v3',
  text: 'Welcome to Wazza Engine. This is a demonstration of our text-to-speech capabilities.',
  parameters: {
    voiceId: 'en-US-female-1',
    stability: 0.75,
    similarityBoost: 0.75,
    style: 'professional'
  }
});

console.log('Audio URL:', response.output.url);

OpenAI TTS Example

// Fast TTS with GPT-4o Mini
const response = await wazza.generate({
  provider: 'openai',
  model: 'gpt-4o-mini-tts',
  text: 'This is a quick test of OpenAI text-to-speech.',
  parameters: {
    voice: 'alloy', // Options: alloy, echo, fable, onyx, nova, shimmer
    speed: 1.0
  }
});

Fast TTS (ElevenLabs Flash)

// Fastest TTS for real-time applications
const response = await wazza.generate({
  provider: 'eleven-labs',
  model: 'eleven_flash_v2_5',
  text: 'Quick response for real-time applications.',
  parameters: {
    voiceId: 'en-US-male-1'
  }
});

Speech-to-Text (Transcription)

Transcribe audio files using Whisper or GPT-4o:

Whisper Transcription

// Transcribe audio with Whisper 1
const response = await wazza.generate({
  provider: 'openai',
  model: 'whisper-1',
  parameters: {
    audio: 'https://example.com/audio.mp3',
    language: 'en', // Optional
    format: 'text' // Options: text, srt, vtt
  }
});

console.log('Transcription:', response.output.text);

Speaker Diarization

// Transcribe with speaker identification
const response = await wazza.generate({
  provider: 'openai',
  model: 'gpt-4o-transcribe-diarize',
  parameters: {
    audio: 'https://example.com/meeting.mp3',
    numSpeakers: 3 // Optional hint
  }
});

// Output includes speaker labels
console.log('Diarized transcript:', response.output.transcript);
// Example output:
// [Speaker 1]: Hello everyone, welcome to the meeting.
// [Speaker 2]: Thank you for having me.
// [Speaker 1]: Let's get started with the agenda.

Speech-to-Speech (Voice Transformation)

Transform the voice in an audio file while preserving speech content:

// Voice transformation with ElevenLabs Multilingual STS
const response = await wazza.generate({
  provider: 'eleven-labs',
  model: 'eleven_multilingual_sts_v2',
  parameters: {
    audio: 'https://example.com/original-speech.mp3',
    targetVoiceId: 'en-US-female-2',
    stability: 0.75
  }
});

console.log('Transformed audio URL:', response.output.url);

Voice Cloning

Create custom voices with ElevenLabs:

// Step 1: Create a custom voice from samples
const voiceResponse = await wazza.createCustomVoice({
  provider: 'eleven-labs',
  name: 'My Custom Voice',
  samples: [
    'https://example.com/voice-sample-1.mp3',
    'https://example.com/voice-sample-2.mp3',
    'https://example.com/voice-sample-3.mp3'
  ],
  description: 'Professional male voice with British accent'
});

const customVoiceId = voiceResponse.voiceId;

// Step 2: Use the custom voice for TTS
const response = await wazza.generate({
  provider: 'eleven-labs',
  model: 'eleven_v3',
  text: 'This is using my custom cloned voice.',
  parameters: {
    voiceId: customVoiceId
  }
});

Best Practices

1. Model Selection for TTS

Highest Quality: ElevenLabs v3 (best for final production)
Balanced: ElevenLabs Turbo v2.5 (good quality, faster)
Fastest: ElevenLabs Flash v2.5 (real-time applications)
Cost-Effective: OpenAI GPT-4o Mini TTS

2. Model Selection for Transcription

Standard: Whisper 1 (best accuracy/cost ratio)
Complex Audio: GPT-4o Transcribe (noisy environments)
Meetings: GPT-4o Transcribe Diarize (speaker identification)

3. Audio Quality Tips

Use clear, well-formatted text for TTS (avoid special characters)
Break long text into smaller chunks (max 5000 characters)
For transcription, use high-quality audio (16kHz+ sample rate)
Provide language hints for multilingual content

4. Voice Cloning Requirements

Minimum 3 audio samples, 30-60 seconds each
Clear audio with minimal background noise
Consistent recording environment across samples
Only clone voices you have permission to use

Previous:Video Generation Next:3D Generation