Files
jarvis-memory/skills/local-whisper-stt/SKILL.md

2.3 KiB

name, description
name description
local-whisper-stt Local speech-to-text transcription using Faster-Whisper. Use when receiving voice messages in Telegram (or other channels) that need to be transcribed to text. Automatically downloads and transcribes audio files using local CPU-based Whisper models. Supports multiple model sizes (tiny, base, small, medium, large) with automatic language detection.

Local Whisper STT

Overview

Transcribes voice messages to text using local Faster-Whisper (CPU-based, no GPU required).

When to Use

  • User sends a voice message in Telegram
  • Need to transcribe audio to text locally (free, private)
  • Any audio transcription task where cloud STT is not desired

Models Available

Model Size Speed Accuracy Use Case
tiny 39MB Fastest Basic Quick testing, low resources
base 74MB Fast Good Default for most use
small 244MB Medium Better Better accuracy needed
medium 769MB Slower Very Good High accuracy, more RAM
large 1550MB Slowest Best Maximum accuracy

Workflow

  1. Receive voice message (Telegram provides OGG/Opus)
  2. Download audio file to temp location
  3. Load Faster-Whisper model (cached after first use)
  4. Transcribe audio to text
  5. Return transcription to conversation
  6. Cleanup temp file

Usage

From Telegram Voice Message

When a voice message arrives, the skill:

  1. Downloads the voice file from Telegram
  2. Transcribes using the configured model
  3. Returns text to the agent context

Manual Transcription

# Transcribe a local audio file
from faster_whisper import WhisperModel

model = WhisperModel("base", device="cpu", compute_type="int8")
segments, info = model.transcribe("/path/to/audio.ogg", beam_size=5)

for segment in segments:
    print(segment.text)

Configuration

Default model: base (good balance of speed/accuracy on CPU)

To change model, edit the script or set environment variable:

export WHISPER_MODEL=small

Requirements

  • Python 3.8+
  • faster-whisper package
  • ~100MB-1.5GB disk space (depending on model)
  • No GPU required (CPU-only)

Resources

scripts/

  • transcribe.py - Main transcription script
  • telegram_voice_handler.py - Telegram-specific voice message handler