skills/local-whisper-stt/SKILL.md

---
name: local-whisper-stt
description: Local speech-to-text transcription using Faster-Whisper. Use when receiving voice messages in Telegram (or other channels) that need to be transcribed to text. Automatically downloads and transcribes audio files using local CPU-based Whisper models. Supports multiple model sizes (tiny, base, small, medium, large) with automatic language detection.
---

# Local Whisper STT

## Overview

Transcribes voice messages to text using local Faster-Whisper (CPU-based, no GPU required).

## When to Use

- User sends a voice message in Telegram
- Need to transcribe audio to text locally (free, private)
- Any audio transcription task where cloud STT is not desired

## Models Available

| Model | Size | Speed | Accuracy | Use Case |
|-------|------|-------|----------|----------|
| tiny | 39MB | Fastest | Basic | Quick testing, low resources |
| base | 74MB | Fast | Good | Default for most use |
| small | 244MB | Medium | Better | Better accuracy needed |
| medium | 769MB | Slower | Very Good | High accuracy, more RAM |
| large | 1550MB | Slowest | Best | Maximum accuracy |

## Workflow

1. Receive voice message (Telegram provides OGG/Opus)
2. Download audio file to temp location
3. Load Faster-Whisper model (cached after first use)
4. Transcribe audio to text
5. Return transcription to conversation
6. Cleanup temp file

## Usage

### From Telegram Voice Message

When a voice message arrives, the skill:
1. Downloads the voice file from Telegram
2. Transcribes using the configured model
3. Returns text to the agent context

### Manual Transcription

```python
# Transcribe a local audio file
from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")
segments, info = model.transcribe("/path/to/audio.ogg", beam_size=5)

for segment in segments:
    print(segment.text)
```

## Configuration

Default model: `base` (good balance of speed/accuracy on CPU)

To change model, edit the script or set environment variable:
```bash
export WHISPER_MODEL=small
```

## Requirements

- Python 3.8+
- faster-whisper package
- ~100MB-1.5GB disk space (depending on model)
- No GPU required (CPU-only)

## Resources

### scripts/
- `transcribe.py` - Main transcription script
- `telegram_voice_handler.py` - Telegram-specific voice message handler
Initial commit: workspace setup with skills, memory, config 2026-02-10 14:37:49 -06:00			`---`
			`name: local-whisper-stt`
			`description: Local speech-to-text transcription using Faster-Whisper. Use when receiving voice messages in Telegram (or other channels) that need to be transcribed to text. Automatically downloads and transcribes audio files using local CPU-based Whisper models. Supports multiple model sizes (tiny, base, small, medium, large) with automatic language detection.`
			`---`

			`# Local Whisper STT`

			`## Overview`

			`Transcribes voice messages to text using local Faster-Whisper (CPU-based, no GPU required).`

			`## When to Use`

			`- User sends a voice message in Telegram`
			`- Need to transcribe audio to text locally (free, private)`
			`- Any audio transcription task where cloud STT is not desired`

			`## Models Available`

			`\| Model \| Size \| Speed \| Accuracy \| Use Case \|`
			`\|-------\|------\|-------\|----------\|----------\|`
			`\| tiny \| 39MB \| Fastest \| Basic \| Quick testing, low resources \|`
			`\| base \| 74MB \| Fast \| Good \| Default for most use \|`
			`\| small \| 244MB \| Medium \| Better \| Better accuracy needed \|`
			`\| medium \| 769MB \| Slower \| Very Good \| High accuracy, more RAM \|`
			`\| large \| 1550MB \| Slowest \| Best \| Maximum accuracy \|`

			`## Workflow`

			`1. Receive voice message (Telegram provides OGG/Opus)`
			`2. Download audio file to temp location`
			`3. Load Faster-Whisper model (cached after first use)`
			`4. Transcribe audio to text`
			`5. Return transcription to conversation`
			`6. Cleanup temp file`

			`## Usage`

			`### From Telegram Voice Message`

			`When a voice message arrives, the skill:`
			`1. Downloads the voice file from Telegram`
			`2. Transcribes using the configured model`
			`3. Returns text to the agent context`

			`### Manual Transcription`

			```python
			`# Transcribe a local audio file`
			`from faster_whisper import WhisperModel`

			`model = WhisperModel("base", device="cpu", compute_type="int8")`
			`segments, info = model.transcribe("/path/to/audio.ogg", beam_size=5)`

			`for segment in segments:`
			`print(segment.text)`
			```

			`## Configuration`

			Default model: `base` (good balance of speed/accuracy on CPU)

			`To change model, edit the script or set environment variable:`
			```bash
			`export WHISPER_MODEL=small`
			```

			`## Requirements`

			`- Python 3.8+`
			`- faster-whisper package`
			`- ~100MB-1.5GB disk space (depending on model)`
			`- No GPU required (CPU-only)`

			`## Resources`

			`### scripts/`
			- `transcribe.py` - Main transcription script
			- `telegram_voice_handler.py` - Telegram-specific voice message handler