Initial commit: workspace setup with skills, memory, config
This commit is contained in:
79
skills/local-whisper-stt/SKILL.md
Normal file
79
skills/local-whisper-stt/SKILL.md
Normal file
@@ -0,0 +1,79 @@
|
||||
---
|
||||
name: local-whisper-stt
|
||||
description: Local speech-to-text transcription using Faster-Whisper. Use when receiving voice messages in Telegram (or other channels) that need to be transcribed to text. Automatically downloads and transcribes audio files using local CPU-based Whisper models. Supports multiple model sizes (tiny, base, small, medium, large) with automatic language detection.
|
||||
---
|
||||
|
||||
# Local Whisper STT
|
||||
|
||||
## Overview
|
||||
|
||||
Transcribes voice messages to text using local Faster-Whisper (CPU-based, no GPU required).
|
||||
|
||||
## When to Use
|
||||
|
||||
- User sends a voice message in Telegram
|
||||
- Need to transcribe audio to text locally (free, private)
|
||||
- Any audio transcription task where cloud STT is not desired
|
||||
|
||||
## Models Available
|
||||
|
||||
| Model | Size | Speed | Accuracy | Use Case |
|
||||
|-------|------|-------|----------|----------|
|
||||
| tiny | 39MB | Fastest | Basic | Quick testing, low resources |
|
||||
| base | 74MB | Fast | Good | Default for most use |
|
||||
| small | 244MB | Medium | Better | Better accuracy needed |
|
||||
| medium | 769MB | Slower | Very Good | High accuracy, more RAM |
|
||||
| large | 1550MB | Slowest | Best | Maximum accuracy |
|
||||
|
||||
## Workflow
|
||||
|
||||
1. Receive voice message (Telegram provides OGG/Opus)
|
||||
2. Download audio file to temp location
|
||||
3. Load Faster-Whisper model (cached after first use)
|
||||
4. Transcribe audio to text
|
||||
5. Return transcription to conversation
|
||||
6. Cleanup temp file
|
||||
|
||||
## Usage
|
||||
|
||||
### From Telegram Voice Message
|
||||
|
||||
When a voice message arrives, the skill:
|
||||
1. Downloads the voice file from Telegram
|
||||
2. Transcribes using the configured model
|
||||
3. Returns text to the agent context
|
||||
|
||||
### Manual Transcription
|
||||
|
||||
```python
|
||||
# Transcribe a local audio file
|
||||
from faster_whisper import WhisperModel
|
||||
|
||||
model = WhisperModel("base", device="cpu", compute_type="int8")
|
||||
segments, info = model.transcribe("/path/to/audio.ogg", beam_size=5)
|
||||
|
||||
for segment in segments:
|
||||
print(segment.text)
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Default model: `base` (good balance of speed/accuracy on CPU)
|
||||
|
||||
To change model, edit the script or set environment variable:
|
||||
```bash
|
||||
export WHISPER_MODEL=small
|
||||
```
|
||||
|
||||
## Requirements
|
||||
|
||||
- Python 3.8+
|
||||
- faster-whisper package
|
||||
- ~100MB-1.5GB disk space (depending on model)
|
||||
- No GPU required (CPU-only)
|
||||
|
||||
## Resources
|
||||
|
||||
### scripts/
|
||||
- `transcribe.py` - Main transcription script
|
||||
- `telegram_voice_handler.py` - Telegram-specific voice message handler
|
||||
Reference in New Issue
Block a user