Voxtype

Push-to-talk voice-to-text for Linux. Optimized for Wayland, works on X11 too.

Hold a hotkey (default: ScrollLock) while speaking, release to transcribe and output the text at your cursor position.

Features

Works on any Linux desktop - Uses compositor keybindings (Hyprland, Sway, River) with evdev fallback for X11 and other environments
Fully offline by default - Uses whisper.cpp for local transcription, with optional remote server support
7 transcription engines - Whisper, Parakeet, Moonshine, SenseVoice, Paraformer, Dolphin, and Omnilingual (see Supported Engines below)
Chinese, Japanese, Korean, and 1600+ languages - SenseVoice, Dolphin, and Omnilingual add native support for CJK and other non-Latin scripts
Meeting mode - Continuous meeting transcription with chunked processing, speaker attribution, and export to Markdown, JSON, SRT, or VTT
Fallback chain - Types via wtype (best CJK support), falls back to dotool (keyboard layout support), ydotool, then clipboard
Push-to-talk or Toggle mode - Hold to record, or press once to start/stop
Audio feedback - Optional sound cues when recording starts/stops
Configurable - Choose your hotkey, model size, output mode, and more
Waybar integration - Optional status indicator shows recording state in your bar

Quick Start

# 1. Build
cargo build --release

# 2. Install typing backend (Wayland)
# Fedora:
sudo dnf install wtype
# Arch:
sudo pacman -S wtype
# Ubuntu:
sudo apt install wtype

# 3. Download whisper model
./target/release/voxtype setup --download

# 4. Add keybinding to your compositor
# See "Compositor Keybindings" section below

# 5. Run
./target/release/voxtype

Compositor Keybindings

Voxtype works best with your compositor's native keybindings. Add these to your compositor config.

Not sure which compositor you have? Run echo $XDG_CURRENT_DESKTOP in a terminal. Common values: Hyprland, sway, river, KDE, GNOME.

Hyprland (~/.config/hypr/hyprland.conf):

bind = SUPER, V, exec, voxtype record start
bindr = SUPER, V, exec, voxtype record stop

Sway (~/.config/sway/config):

bindsym --no-repeat $mod+v exec voxtype record start
bindsym --release $mod+v exec voxtype record stop

River (~/.config/river/init):

riverctl map normal Super V spawn 'voxtype record start'
riverctl map -release normal Super V spawn 'voxtype record stop'

KDE Plasma (KWin):

KDE does not support key-release events, so use toggle mode. Open System Settings > Shortcuts > Custom Shortcuts, create a new shortcut, and set the command to:

voxtype record toggle

Assign your preferred key combination (e.g., Meta+V). Since KDE handles the keybinding, the built-in hotkey should be disabled (see below).

Then disable the built-in hotkey in your config:

# ~/.config/voxtype/config.toml
[hotkey]
enabled = false

X11 / Built-in hotkey fallback: If you're on X11 or prefer voxtype's built-in hotkey (ScrollLock by default), add yourself to the input group: sudo usermod -aG input $USER and log out/in. See the User Manual for details.

Omarchy / Multi-modifier keybindings: If using keybindings with multiple modifiers (e.g., SUPER+CTRL+X), releasing keys slowly can cause typed text to trigger window manager shortcuts instead of inserting text. See Modifier Key Interference in the troubleshooting guide for the solution using output hooks and Hyprland submaps.

Usage

Run voxtype (it runs as a foreground daemon)
Hold ScrollLock (or your configured hotkey)
Speak
Release the key
Text appears at your cursor (or in clipboard if typing isn't available)

Press Ctrl+C to stop the daemon.

Toggle Mode

If you prefer to press once to start recording and again to stop (instead of holding):

# Via command line
voxtype --toggle

# Or in config.toml
[hotkey]
key = "SCROLLLOCK"
mode = "toggle"

Meeting Mode

For longer recordings like meetings and interviews, meeting mode provides continuous transcription with automatic chunking, speaker attribution, and export.

# Start a meeting
voxtype meeting start --title "Weekly standup"

# Check status
voxtype meeting status

# Stop and export
voxtype meeting stop
voxtype meeting export latest --format markdown --speakers --timestamps

Meetings are stored locally and can be exported to Markdown, plain text, JSON, SRT, or VTT. Use voxtype meeting list to see past meetings, and voxtype meeting summarize latest to generate an AI summary via Ollama.

Configuration

Config file location: ~/.config/voxtype/config.toml

See config/default.toml for the full annotated default configuration.

# State file for Waybar/polybar integration (enabled by default)
state_file = "auto"  # Or custom path, or "disabled" to turn off

[hotkey]
key = "SCROLLLOCK"  # Or: PAUSE, F13-F24, RIGHTALT, etc.
modifiers = []      # Optional: ["LEFTCTRL", "LEFTALT"]
# mode = "toggle"   # Uncomment for toggle mode (press to start/stop)

[audio]
device = "default"  # Or specific device from `pactl list sources short`
sample_rate = 16000
max_duration_secs = 60

# Audio feedback (sound cues when recording starts/stops)
# [audio.feedback]
# enabled = true
# theme = "default"   # "default", "subtle", "mechanical", or path to custom dir
# volume = 0.7        # 0.0 to 1.0

[whisper]
model = "base.en"   # tiny, base, small, medium, large-v3, large-v3-turbo
language = "en"     # Or "auto" for detection, or language code (es, fr, de, etc.)
translate = false   # Translate non-English speech to English
# threads = 4       # CPU threads for inference (omit for auto-detect)
# on_demand_loading = true  # Load model only when recording (saves memory)

[output]
mode = "type"       # "type", "clipboard", or "paste"
fallback_to_clipboard = true
type_delay_ms = 0   # Increase if characters are dropped
# auto_submit = true  # Send Enter after transcription (for chat apps, terminals)
# Note: "paste" mode copies to clipboard then simulates Ctrl+V
#       Useful for non-US keyboard layouts where ydotool typing fails

[output.notification]
on_recording_start = false  # Notify when PTT activates
on_recording_stop = false   # Notify when transcribing
on_transcription = true     # Show transcribed text

# Text processing (word replacements, spoken punctuation)
# [text]
# spoken_punctuation = true  # Say "period" → ".", "open paren" → "("
# replacements = { "vox type" = "voxtype", "oh marky" = "Omarchy" }

Audio Feedback

Enable audio feedback to hear a sound when recording starts and stops:

[audio.feedback]
enabled = true
theme = "default"  # Built-in themes: default, subtle, mechanical
volume = 0.7       # 0.0 to 1.0

Built-in themes:

default - Clear, pleasant two-tone beeps
subtle - Quiet, unobtrusive clicks
mechanical - Typewriter/keyboard-like sounds

Custom themes: Point theme to a directory containing start.wav, stop.wav, and error.wav files.

Text Processing

Voxtype can post-process transcribed text with word replacements and spoken punctuation.

Word replacements fix commonly misheard words:

[text]
replacements = { "vox type" = "voxtype", "oh marky" = "Omarchy" }

Spoken punctuation (opt-in) converts spoken words to symbols - useful for developers:

[text]
spoken_punctuation = true

With this enabled, saying "function open paren close paren" outputs function(). Supports period, comma, brackets, braces, newlines, and many more. See CONFIGURATION.md for the full list.

Post-Processing Command (Advanced)

For advanced cleanup, you can pipe transcriptions through an external command
like a local LLM for grammar correction, filler word removal, or text formatting:

[output.post_process]
command = "ollama run llama3.2:1b 'Clean up this dictation. Fix grammar, remove filler words:'"
timeout_ms = 30000  # 30 second timeout for LLM

The command receives text on stdin and outputs cleaned text on stdout. On any
failure (timeout, error), Voxtype gracefully falls back to the original transcription.

See CONFIGURATION.md for more examples including scripts for LM Studio, Ollama, and llama.cpp.

CLI Options

voxtype [OPTIONS] [COMMAND]

Commands:
  daemon      Run as background daemon (default)
  transcribe  Transcribe an audio file
  setup       Setup and installation utilities
  config      Show current configuration
  status      Show daemon status (for Waybar/polybar integration)
  record      Control recording from external sources (compositor keybindings, scripts)
  meeting     Meeting transcription (start, stop, export, summarize)

Setup subcommands:
  voxtype setup              Run basic dependency checks (default)
  voxtype setup --download   Download the configured Whisper model
  voxtype setup systemd      Install/manage systemd user service
  voxtype setup waybar       Generate Waybar module configuration
  voxtype setup model        Interactive model selection and download
  voxtype setup gpu          Manage GPU acceleration (switch CPU/Vulkan)
  voxtype setup onnx         Switch between Whisper and ONNX engines

Status options:
  voxtype status --format json       Output as JSON (for Waybar)
  voxtype status --follow            Continuously output on state changes
  voxtype status --extended          Include model, device, backend in JSON
  voxtype status --icon-theme THEME  Icon theme (emoji, nerd-font, material, etc.)

Record subcommands (for compositor keybindings):
  voxtype record start                     Start recording (send SIGUSR1 to daemon)
  voxtype record start --output-file PATH  Write transcription to a file
  voxtype record stop                      Stop recording and transcribe (send SIGUSR2 to daemon)
  voxtype record toggle                    Toggle recording state

Options:
  -c, --config <FILE>    Path to config file
  -v, --verbose          Increase verbosity (-v, -vv)
  -q, --quiet            Quiet mode (errors only)
  --clipboard            Force clipboard mode
  --paste                Force paste mode (clipboard + Ctrl+V)
  --model <MODEL>        Override transcription model
  --engine <ENGINE>      Override transcription engine (whisper, parakeet, moonshine, sensevoice, paraformer, dolphin, omnilingual)
  --hotkey <KEY>         Override hotkey
  --toggle               Use toggle mode (press to start/stop)
  -h, --help             Print help
  -V, --version          Print version

Whisper Models

Model	Size	English WER	Speed
tiny.en	39 MB	~10%	Fastest
base.en	142 MB	~8%	Fast
small.en	466 MB	~6%	Medium
medium.en	1.5 GB	~5%	Slow
large-v3	3 GB	~4%	Slowest
large-v3-turbo	1.6 GB	~4%	Fast

For most uses, base.en provides a good balance of speed and accuracy. If you have a GPU, large-v3-turbo offers excellent accuracy with fast inference.

Multilingual Support

The .en models are English-only but faster and more accurate for English. For other languages, use large-v3 which supports 99 languages.

Use Case 1: Transcribe in the spoken language (speak French, output French)

[whisper]
model = "large-v3"
language = "auto"     # Auto-detect and transcribe in that language
translate = false

Use Case 2: Translate to English (speak French, output English)

[whisper]
model = "large-v3"
language = "auto"     # Auto-detect the spoken language
translate = true      # Translate output to English

Use Case 3: Force a specific language (always transcribe as Spanish)

[whisper]
model = "large-v3"
language = "es"       # Force Spanish transcription
translate = false

With GPU acceleration, large-v3 achieves sub-second inference while supporting all languages.

Supported Engines

Voxtype ships separate binaries for Whisper and ONNX engines. Use voxtype setup onnx --enable to switch to the ONNX binary, or --disable to switch back.

Engine	Languages	Architecture	Best For
Whisper (default)	99 languages	Encoder-decoder (whisper.cpp)	General use, multilingual
Parakeet	English	FastConformer TDT (ONNX)	Fast English transcription
Moonshine	English	Encoder-decoder (ONNX)	Edge devices, low memory
SenseVoice	zh, en, ja, ko, yue	CTC encoder (ONNX)	Chinese, Japanese, Korean
Paraformer	zh+en, zh+yue+en	Non-autoregressive (ONNX)	Chinese-English bilingual
Dolphin	40 languages + 22 Chinese dialects	CTC E-Branchformer (ONNX)	Eastern languages (no English)
Omnilingual	1600+ languages	wav2vec2 CTC (ONNX)	Low-resource and rare languages

To set the engine in your config:

engine = "sensevoice"  # or: whisper, parakeet, moonshine, paraformer, dolphin, omnilingual

Or override on the command line:

voxtype --engine sensevoice

GPU Acceleration

Voxtype supports optional GPU acceleration for significantly faster inference. With GPU acceleration, even the large-v3 model can achieve sub-second inference times.

Vulkan (AMD, NVIDIA, Intel)

Packages include a Vulkan binary. To enable GPU acceleration:

# Install Vulkan runtime (if not already installed)
# Arch:
sudo pacman -S vulkan-icd-loader

# Ubuntu/Debian:
sudo apt install libvulkan1

# Fedora:
sudo dnf install vulkan-loader

# Enable GPU acceleration
sudo voxtype setup gpu --enable

# Check status
voxtype setup gpu

To switch back to CPU: sudo voxtype setup gpu --disable

Building from Source (CUDA, Metal, ROCm)

For other GPU backends, build from source with the appropriate feature flag:

CUDA (NVIDIA)

# Install CUDA toolkit first, then:
cargo build --release --features gpu-cuda

Metal (macOS/Apple Silicon)

cargo build --release --features gpu-metal

HIP/ROCm (AMD alternative)

cargo build --release --features gpu-hipblas

Performance Comparison

Results vary by hardware. Example on AMD RX 6800:

Model	CPU	Vulkan GPU
base.en	~7x realtime	~35x realtime
large-v3	~1x realtime	~5x realtime

Requirements

System Requirements

Linux with glibc 2.38+ (Ubuntu 24.04+, Fedora 39+, Arch, Debian Trixie+)
Wayland or X11 desktop (GNOME, KDE, Sway, Hyprland, River, i3, etc.)

Runtime Dependencies

PipeWire or PulseAudio (for audio capture)
wtype (for typing output on Wayland) - recommended, best CJK/Unicode support
dotool - for non-US keyboard layouts (German, French, etc.) - supports XKB layouts
ydotool + daemon - for X11 or as Wayland fallback
wl-clipboard (for clipboard fallback on Wayland)

Permissions

Wayland compositors: No special permissions needed when using compositor keybindings
Built-in hotkey / X11: User must be in the input group (for evdev access)

Installing Dependencies

Fedora:

sudo dnf install wtype wl-clipboard

Ubuntu/Debian:

sudo apt install wtype wl-clipboard

Arch:

sudo pacman -S wtype wl-clipboard

Building from Source

# Install Rust if needed
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Install build dependencies
# Fedora:
sudo dnf install alsa-lib-devel

# Ubuntu:
sudo apt install libasound2-dev

# Build (Whisper engine only)
cargo build --release

# Build with ONNX engines (Parakeet, Moonshine, SenseVoice, etc.)
cargo build --release --features parakeet,moonshine,sensevoice,paraformer,dolphin

# Or just the engine you need
cargo build --release --features parakeet

# Binary is at: target/release/voxtype

ONNX engines require the corresponding Cargo feature at build time. Without it, setting
engine = "parakeet" in your config will fail with an error. The prebuilt release binaries
(-onnx-avx2, -onnx-cuda, etc.) include all ONNX engines.

AppImage (Universal)

AppImage works on any Linux distribution without installation:

# Download the appropriate AppImage from the GitHub release
chmod +x voxtype-*-x86_64.AppImage

# Move to a permanent location
mv voxtype-*-x86_64.AppImage ~/.local/bin/voxtype

# Run setup (downloads model, configures service)
~/.local/bin/voxtype setup

Available AppImage variants:

voxtype-{ver}-x86_64.AppImage - Whisper engine with CPU and Vulkan GPU support (recommended)
voxtype-{ver}-onnx-x86_64.AppImage - ONNX engines (Parakeet, Moonshine, etc.) + Vulkan Whisper
voxtype-{ver}-onnx-cuda-x86_64.AppImage - ONNX engines with NVIDIA CUDA + Vulkan Whisper

Each ONNX AppImage also includes the Vulkan Whisper binary, so you can switch between
engines via engine = "whisper" or engine = "parakeet" in your config without changing
AppImages. For GPU-accelerated Whisper in the Whisper-only AppImage, set VOXTYPE_GPU=1.

Waybar Integration

Add to your Waybar config:

"custom/voxtype": {
    "exec": "voxtype status --follow --format json",
    "return-type": "json",
    "format": "{}",
    "tooltip": true
}

The state file is enabled by default (state_file = "auto"). If you've disabled it, re-enable it:

state_file = "auto"

Extended Status Info

Use --extended to include model, device, and backend in the JSON output:

voxtype status --format json --extended

Output:

{
  "text": "🎙️",
  "class": "idle",
  "tooltip": "Voxtype ready\nModel: base.en\nDevice: default\nBackend: CPU (AVX-512)",
  "model": "base.en",
  "device": "default",
  "backend": "CPU (AVX-512)"
}

Waybar config with model display:

"custom/voxtype": {
    "exec": "voxtype status --follow --format json --extended",
    "return-type": "json",
    "format": "{} [{}]",
    "format-alt": "{model}",
    "tooltip": true
}

Troubleshooting

"Cannot open input device" error

This only affects the built-in evdev hotkey. You have two options:

Option 1: Use compositor keybindings (recommended)
Configure your compositor to call voxtype record start/stop and disable the built-in hotkey. See "Compositor Keybindings" above.

Option 2: Add yourself to the input group

sudo usermod -aG input $USER
# Log out and back in

Text not appearing / typing not working

Voxtype uses wtype (preferred), dotool, or ydotool for typing output:

# Check available typing backends
which wtype dotool ydotool

# For non-US keyboard layouts, install dotool and configure:
# In ~/.config/voxtype/config.toml:
# [output]
# dotool_xkb_layout = "de"  # Your layout (de, fr, es, etc.)

# If using ydotool fallback (X11/TTY), start the daemon:
systemctl --user start ydotool
systemctl --user enable ydotool  # Start on login

KDE Plasma / GNOME users: wtype does not work on these desktops. Voxtype automatically falls back to dotool (recommended for non-US layouts) or ydotool. See Troubleshooting for setup instructions.

No audio captured

Check your default audio input:

# List audio sources
pactl list sources short

# Test recording
arecord -d 3 -f S16_LE -r 16000 test.wav
aplay test.wav

Text appears slowly

If characters are being dropped, increase the delay:

[output]
type_delay_ms = 10

Architecture

flowchart LR
    subgraph Input
        Hotkey["Hotkey<br/>(compositor/evdev)"] --> Audio["Audio<br/>(cpal)"]
    end
    subgraph Transcription
        Audio --> Engine{Engine?}
        Engine -->|whisper| WhisperBackend{Backend?}
        Engine -->|onnx| ONNX["ONNX Engine<br/>(Parakeet, Moonshine,<br/>SenseVoice, Paraformer,<br/>Dolphin, Omnilingual)"]
        WhisperBackend -->|local| Whisper["Whisper<br/>(whisper-rs)"]
        WhisperBackend -->|cli| CLI["whisper-cli<br/>(subprocess)"]
        WhisperBackend -->|remote| Remote["Remote Server<br/>(HTTP API)"]
    end
    subgraph Output
        Whisper --> PostProcess["Post-Process<br/>(optional)"]
        CLI --> PostProcess
        Remote --> PostProcess
        ONNX --> PostProcess
        PostProcess --> PreHook["Pre-Output Hook"]
        PreHook --> TextOutput["Output<br/>(wtype/dotool/ydotool)"]
        TextOutput --> PostHook["Post-Output Hook"]
        PreHook -.-> Compositor["Compositor<br/>(submap/mode)"]
        PostHook -.-> Compositor
    end

Multiple transcription engines. Voxtype supports 7 transcription engines across two runtime backends:

Whisper (default): OpenAI's Whisper model via whisper.cpp. Supports local in-process, CLI subprocess, and remote HTTP backends. 99 languages.
ONNX engines (via ONNX Runtime): Parakeet (English), Moonshine (English), SenseVoice (zh/en/ja/ko/yue), Paraformer (zh+en bilingual), Dolphin (40 languages + Chinese dialects, no English), Omnilingual (1600+ languages). Switch engines with voxtype setup onnx.

Why compositor keybindings? Wayland compositors like Hyprland, Sway, and River support key-release events, enabling push-to-talk without special permissions. Voxtype's record start/stop commands integrate directly with your compositor's keybinding system.

Fallback: evdev hotkey. For X11 or compositors without key-release support, voxtype includes a built-in hotkey using evdev (the Linux input subsystem). This requires the user to be in the input group.

Why wtype + dotool + ydotool? On Wayland, wtype uses the virtual-keyboard protocol for text input, with excellent Unicode/CJK support and no daemon required. When wtype fails (KDE/GNOME), dotool provides keyboard layout support via XKB for non-US layouts. As a final fallback, ydotool uses uinput for text injection on X11/TTY. This combination ensures Voxtype works on any Linux desktop with proper keyboard layout support.

Post-processing. Transcriptions can optionally be piped through an external command before output. Use this to integrate local LLMs (Ollama, llama.cpp) for grammar correction, text expansion, or domain-specific vocabulary. Any command that reads stdin and writes stdout works.

Feedback

We want to hear from you! Voxtype is a young project and your feedback helps make it better.

Something not working? If Voxtype doesn't install cleanly, doesn't work on your system, or is buggy in any way, please open an issue. I actively monitor and respond to issues.
Like Voxtype? I don't accept donations, but if you find it useful:
- A GitHub star helps others discover the project
- Arch users: a vote on the AUR package helps keep it maintained

Contributors

Peter Jackson - Creator and maintainer
jvantillo - GPU acceleration patch, whisper-rs 0.15.1 compatibility
materemias - Paste output mode, on-demand model loading, single-instance safeguard, meeting mode post-processing, PKGBUILD fix
Dan Heuckeroth - NixOS Home Manager module design
Kevin Miller - NixOS module enhancements, ROCm support
reisset - Testing and feedback on post-processing feature
Goodroot - Testing, feedback, and documentation updates
robzolkos - Auto-submit feature for AI agent workflows
konnsim - Modifier key interference bug report
IgorWarzocha - Hyprland submap solution for modifier key fix
Zubair - dotool output driver with keyboard layout support
ayoahha - CLI backend for whisper-cli subprocess transcription
Loki Coyote - eitype output driver for KDE/GNOME support, media keys and numeric keycode hotkey support
Umesh - Documentation website
Sami Jawhar - Eager input processing wiring
KaiStarkk - Post-process trim and fallback_on_empty options
graysky - Flash attention config fix

License

MIT