--- name: local-whisper-stt description: Local speech-to-text transcription using Faster-Whisper. Use when receiving voice messages in Telegram (or other channels) that need to be transcribed to text. Automatically downloads and transcribes audio files using local CPU-based Whisper models. Supports multiple model sizes (tiny, base, small, medium, large) with automatic language detection. --- # Local Whisper STT ## Overview Transcribes voice messages to text using local Faster-Whisper (CPU-based, no GPU required). ## When to Use - User sends a voice message in Telegram - Need to transcribe audio to text locally (free, private) - Any audio transcription task where cloud STT is not desired ## Models Available | Model | Size | Speed | Accuracy | Use Case | |-------|------|-------|----------|----------| | tiny | 39MB | Fastest | Basic | Quick testing, low resources | | base | 74MB | Fast | Good | Default for most use | | small | 244MB | Medium | Better | Better accuracy needed | | medium | 769MB | Slower | Very Good | High accuracy, more RAM | | large | 1550MB | Slowest | Best | Maximum accuracy | ## Workflow 1. Receive voice message (Telegram provides OGG/Opus) 2. Download audio file to temp location 3. Load Faster-Whisper model (cached after first use) 4. Transcribe audio to text 5. Return transcription to conversation 6. Cleanup temp file ## Usage ### From Telegram Voice Message When a voice message arrives, the skill: 1. Downloads the voice file from Telegram 2. Transcribes using the configured model 3. Returns text to the agent context ### Manual Transcription ```python # Transcribe a local audio file from faster_whisper import WhisperModel model = WhisperModel("base", device="cpu", compute_type="int8") segments, info = model.transcribe("/path/to/audio.ogg", beam_size=5) for segment in segments: print(segment.text) ``` ## Configuration Default model: `base` (good balance of speed/accuracy on CPU) To change model, edit the script or set environment variable: ```bash export WHISPER_MODEL=small ``` ## Requirements - Python 3.8+ - faster-whisper package - ~100MB-1.5GB disk space (depending on model) - No GPU required (CPU-only) ## Resources ### scripts/ - `transcribe.py` - Main transcription script - `telegram_voice_handler.py` - Telegram-specific voice message handler