Back to Skills

Speech To Text

Transcribe audio to text using ElevenLabs Scribe v2. Use when converting audio/video to text, generating subtitles, transcribing meetings, or processing spoken content.

$ npx promptcreek add speech-to-text

Auto-detects your installed agents and installs the skill to each one.

What This Skill Does

This skill transcribes audio into text using ElevenLabs' Speech-to-Text API. It supports over 90 languages and offers features like speaker diarization and word-level timestamps. It is useful for developers needing to convert audio files or streams into text for analysis, subtitles, or real-time applications.

When to Use

  • Transcribing meeting recordings for searchable notes.
  • Generating subtitles for video content.
  • Powering voice-controlled applications.
  • Analyzing audio data for sentiment or keywords.
  • Creating real-time transcriptions for live events.
  • Converting audiobooks to text format.

Key Features

Supports over 90 languages for transcription.
Offers speaker diarization to identify speakers.
Provides word-level timestamps for precise alignment.
Includes different models for batch or real-time transcription.
Can be used with Python, JavaScript, and cURL.
Enables keyterm prompting to improve accuracy.

Installation

Run in your project directory:
$ npx promptcreek add speech-to-text

Auto-detects your installed agents (Claude Code, Cursor, Codex, etc.) and installs the skill to each one.

View Full Skill Content

ElevenLabs Speech-to-Text

Transcribe audio to text with Scribe v2 - supports 90+ languages, speaker diarization, and word-level timestamps.

> Setup: See Installation Guide. For JavaScript, use @elevenlabs/* packages only.

Quick Start

Python

from elevenlabs import ElevenLabs

client = ElevenLabs()

with open("audio.mp3", "rb") as audio_file:

result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")

print(result.text)

JavaScript

import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";

import { createReadStream } from "fs";

const client = new ElevenLabsClient();

const result = await client.speechToText.convert({

file: createReadStream("audio.mp3"),

modelId: "scribe_v2",

});

console.log(result.text);

cURL

curl -X POST "https://api.elevenlabs.io/v1/speech-to-text" \

-H "xi-api-key: $ELEVENLABS_API_KEY" -F "file=@audio.mp3" -F "model_id=scribe_v2"

Models

| Model ID | Description | Best For |

|----------|-------------|----------|

| scribe_v2 | State-of-the-art accuracy, 90+ languages | Batch transcription, subtitles, long-form audio |

| scribe_v2_realtime | Low latency (~150ms) | Live transcription, voice agents |

Transcription with Timestamps

Word-level timestamps include type classification and speaker identification:

result = client.speech_to_text.convert(

file=audio_file, model_id="scribe_v2", timestamps_granularity="word"

)

for word in result.words:

print(f"{word.text}: {word.start}s - {word.end}s (type: {word.type})")

Speaker Diarization

Identify WHO said WHAT - the model labels each word with a speaker ID, useful for meetings, interviews, or any multi-speaker audio:

result = client.speech_to_text.convert(

file=audio_file,

model_id="scribe_v2",

diarize=True

)

for word in result.words:

print(f"[{word.speaker_id}] {word.text}")

Keyterm Prompting

Help the model recognize specific words it might otherwise mishear - product names, technical jargon, or unusual spellings (up to 100 terms):

result = client.speech_to_text.convert(

file=audio_file,

model_id="scribe_v2",

keyterms=["ElevenLabs", "Scribe", "API"]

)

Language Detection

Automatic detection with optional language hint:

result = client.speech_to_text.convert(

file=audio_file,

model_id="scribe_v2",

language_code="eng" # ISO 639-1 or ISO 639-3 code

)

print(f"Detected: {result.language_code} ({result.language_probability:.0%})")

Supported Formats

Audio: MP3, WAV, M4A, FLAC, OGG, WebM, AAC, AIFF, Opus

Video: MP4, AVI, MKV, MOV, WMV, FLV, WebM, MPEG, 3GPP

Limits: Up to 3GB file size, 10 hours duration

Response Format

{

"text": "The full transcription text",

"language_code": "eng",

"language_probability": 0.98,

"words": [

{"text": "The", "start": 0.0, "end": 0.15, "type": "word", "speaker_id": "speaker_0"},

{"text": " ", "start": 0.15, "end": 0.16, "type": "spacing", "speaker_id": "speaker_0"}

]

}

Word types:

  • word - An actual spoken word
  • spacing - Whitespace between words (useful for precise timing)
  • audio_event - Non-speech sounds the model detected (laughter, applause, music, etc.)

Error Handling

try:

result = client.speech_to_text.convert(file=audio_file, model_id="scribe_v2")

except Exception as e:

print(f"Transcription failed: {e}")

Common errors:

  • 401: Invalid API key
  • 422: Invalid parameters
  • 429: Rate limit exceeded

Tracking Costs

Monitor usage via request-id response header:

response = client.speech_to_text.convert.with_raw_response(file=audio_file, model_id="scribe_v2")

result = response.parse()

print(f"Request ID: {response.headers.get('request-id')}")

Real-Time Streaming

For live transcription with ultra-low latency (~150ms), use the real-time API. The real-time API produces two types of transcripts:

  • Partial transcripts: Interim results that update frequently as audio is processed - use these for live feedback (e.g., showing text as the user speaks)
  • Committed transcripts: Final, stable results after you "commit" - use these as the source of truth for your application

A "commit" tells the model to finalize the current segment. You can commit manually (e.g., when the user pauses) or use Voice Activity Detection (VAD) to auto-commit on silence.

Python (Server-Side)

import asyncio

from elevenlabs import ElevenLabs

client = ElevenLabs()

async def transcribe_realtime():

async with client.speech_to_text.realtime.connect(

model_id="scribe_v2_realtime",

include_timestamps=True,

) as connection:

await connection.stream_url("https://example.com/audio.mp3")

async for event in connection:

if event.type == "partial_transcript":

print(f"Partial: {event.text}")

elif event.type == "committed_transcript":

print(f"Final: {event.text}")

asyncio.run(transcribe_realtime())

JavaScript (Client-Side with React)

import { useScribe, CommitStrategy } from "@elevenlabs/react";

function TranscriptionComponent() {

const [transcript, setTranscript] = useState("");

const scribe = useScribe({

modelId: "scribe_v2_realtime",

commitStrategy: CommitStrategy.VAD, // Auto-commit on silence for mic input

onPartialTranscript: (data) => console.log("Partial:", data.text),

onCommittedTranscript: (data) => setTranscript((prev) => prev + data.text),

});

const start = async () => {

// Get token from your backend (never expose API key to client)

const { token } = await fetch("/scribe-token").then((r) => r.json());

await scribe.connect({

token,

microphone: { echoCancellation: true, noiseSuppression: true },

});

};

return <button onClick={start}>Start Recording</button>;

}

Commit Strategies

| Strategy | Description |

|----------|-------------|

| Manual | You call commit() when ready - use for file processing or when you control the audio segments |

| VAD | Voice Activity Detection auto-commits when silence is detected - use for live microphone input |

// React: set commitStrategy on the hook (recommended for mic input)

import { useScribe, CommitStrategy } from "@elevenlabs/react";

const scribe = useScribe({

modelId: "scribe_v2_realtime",

commitStrategy: CommitStrategy.VAD,

// Optional VAD tuning:

vadSilenceThresholdSecs: 1.5,

vadThreshold: 0.4,

});

// JavaScript client: pass vad config on connect

const connection = await client.speechToText.realtime.connect({

modelId: "scribe_v2_realtime",

vad: {

silenceThresholdSecs: 1.5,

threshold: 0.4,

},

});

Event Types

| Event | Description |

|-------|-------------|

| partial_transcript | Live interim results |

| committed_transcript | Final results after commit |

| committed_transcript_with_timestamps | Final with word timing |

| error | Error occurred |

See real-time references for complete documentation.

References

0Installs
0Views

Supported Agents

Claude CodeCursorCodexGemini CLIAiderWindsurfOpenClaw

Attribution

Details

License
MIT
Source
admin
Published
3/18/2026

Related Skills