You want to speak to your machine, but the settings menu is silent
You just finished configuring your workspace. You want to dictate notes, toggle media playback, or launch terminals without touching the keyboard. You open GNOME Settings, search for voice, and find nothing. The accessibility panel offers high contrast, screen readers, and pointer controls. It does not offer continuous voice command processing. You are not missing a hidden toggle. Fedora ships with a curated desktop environment that prioritizes stability and explicit user control over always-listening assistants. If you want voice control, you need to build the pipeline yourself.
What is actually happening under the hood
Voice control on a desktop Linux system is not a single application. It is a chain of three distinct components. First, the audio server captures raw PCM data from your microphone. Second, a speech recognition engine converts that audio stream into text tokens. Third, a dispatcher matches those tokens to system actions and executes them via D-Bus, shell commands, or window manager shortcuts.
Think of it like a relay race. The microphone hands the baton to the audio server. The audio server passes it to the recognition engine. The engine decodes the phrase and hands it to the dispatcher. If any link drops the baton, the command fails. Fedora handles the first link natively through PipeWire. The remaining two links require third-party software. You will install a recognition engine, configure it to listen on a direct stream, and write a lightweight script to translate recognized phrases into desktop actions.
Run systemctl --user status pipewire before you start. Verify the audio server is active and not in a failed state. A broken audio daemon will silently swallow your microphone input.
Configure the audio pipeline before installing anything
Speech recognition fails most often because of sample rate mismatches, not because of bad AI models. PipeWire defaults to 48000 Hz. Many open-source recognition engines expect 16000 Hz. If you feed 48 kHz audio to a 16 kHz model, the engine will hear static and return empty results.
Check your current audio routing and default sample rate. Run this to verify PipeWire is active and list your capture devices.
pactl info | grep "Default Sample Rate" # Confirms the system-wide audio rate
pactl list short sources # Shows available microphones and virtual inputs
If your default rate is 48000, you will need to resample on the fly or configure your application to request 16000 Hz. Most modern engines handle resampling internally, but verifying the baseline prevents hours of debugging later. Match the engine to the hardware.
Install and test the recognition engine
The original article mentions Mycroft and Vosk. Mycroft reached end-of-life in 2023 and its repositories are archived. Vosk remains actively maintained, runs entirely offline, and exposes a clean Python API. It is the practical choice for a Fedora desktop.
Install the Vosk Python bindings and a minimal audio capture library. Fedora's packaging keeps Python libraries in the python3- namespace.
sudo dnf install python3-vosk python3-pyaudio alsa-lib-devel # Installs the recognition engine, audio I/O, and ALSA headers
pip3 install --user vosk # Fetches the latest wheel from PyPI if the distro package is outdated
Create a test script to verify the engine can decode speech from your microphone. This script opens a stream, feeds chunks to Vosk, and prints recognized text to stdout.
#!/usr/bin/env python3
import vosk
import pyaudio
import json
# Initialize the model. Replace with your actual model path.
# Vosk models are language-specific and must be downloaded separately.
model = vosk.Model("vosk-model-small-en-us-0.15") # Loads the compact English model into RAM
# Open a direct audio stream at 16000 Hz, mono, 16-bit
# This matches the model's training data exactly
audio = pyaudio.PyAudio()
stream = audio.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
# Start the recognition loop
rec = vosk.KaldiRecognizer(model, 16000) # Creates a decoder instance tuned to 16 kHz
print("Listening...")
while True:
data = stream.read(4000, exception_on_overflow=False) # Reads a chunk without blocking forever
if rec.AcceptWaveform(data): # Checks if the engine found a complete utterance
result = json.loads(rec.Result()) # Parses the JSON output containing the transcript
print("Recognized:", result.get("text", "")) # Prints the decoded phrase
else:
print("Partial:", rec.PartialResult().get("partial", "")) # Shows in-progress decoding
Save this as test_vosk.py, make it executable, and run it. Speak clearly into your microphone. The terminal should print your words within a second. If it prints empty strings or garbled characters, your microphone is likely delivering 48 kHz audio. Add a resampling step or switch to a virtual sink that forces 16 kHz.
Build the command dispatcher
Raw transcription is useless without action. You need a dispatcher that watches the transcript stream and triggers desktop commands. The dispatcher should run as a background service, listen for specific keywords, and execute lightweight actions. Avoid heavy GUI automation tools. Use xdotool for window management and dbus-send for GNOME integration.
Install the automation dependencies.
sudo dnf install xdotool dbus-x11 # Provides terminal-based window control and D-Bus messaging
Write a dispatcher that wraps the Vosk engine and maps phrases to actions. This example handles three commands: opening a terminal, toggling media playback, and locking the screen.
#!/usr/bin/env python3
import subprocess
import vosk
import pyaudio
import json
import time
model = vosk.Model("vosk-model-small-en-us-0.15")
audio = pyaudio.PyAudio()
stream = audio.open(format=pyaudio.paInt16, channels=1, rate=16000, input=True, frames_per_buffer=8000)
rec = vosk.KaldiRecognizer(model, 16000)
# Define command mappings. Keep phrases short and distinct.
# Vosk works best with clear, unambiguous triggers.
COMMANDS = {
"open terminal": "gnome-terminal",
"play pause": "dbus-send --print-reply --dest=org.mpris.MediaPlayer2.player /org/mpris/MediaPlayer2 org.mpris.MediaPlayer2.Player.PlayPause",
"lock screen": "loginctl lock-session"
}
print("Voice dispatcher active. Say a command.")
while True:
data = stream.read(4000, exception_on_overflow=False)
if rec.AcceptWaveform(data):
result = json.loads(rec.Result())
text = result.get("text", "").lower().strip()
# Match the recognized text against known commands
for phrase, action in COMMANDS.items():
if phrase in text:
print(f"Executing: {action}")
subprocess.run(action, shell=True) # Runs the mapped command in a subshell
break
else:
print(f"Unrecognized: {text}")
time.sleep(0.1) # Prevents CPU spinning during silence
Run this script in the background. Test each phrase. The terminal should launch, media should toggle, and the screen should lock. If a command fails, check the exact string Vosk outputs. Speech recognition rarely returns exact matches. Use substring matching or phonetic aliases in your dictionary.
Run it as a persistent user service
Running a Python script in a terminal window is fine for testing. It is not reliable for daily use. You need the dispatcher to start automatically when you log in and restart if it crashes. Create a systemd user service file in ~/.config/systemd/user/voice-dispatcher.service.
[Unit]
Description=Offline Voice Command Dispatcher
After=pipewire.service # Ensures audio server is ready before starting
[Service]
ExecStart=/home/youruser/bin/voice_dispatcher.py # Points to your dispatcher script
Restart=on-failure # Automatically restarts if the process exits unexpectedly
RestartSec=5 # Waits five seconds before retrying to avoid rapid loops
[Install]
WantedBy=default.target # Starts when your user session initializes
Reload the user manager and enable the service.
systemctl --user daemon-reload # Picks up the new unit file
systemctl --user enable --now voice-dispatcher.service # Starts the service and enables auto-start
Check the service status immediately. Use systemctl --user status voice-dispatcher.service to verify it is active. If it fails, run journalctl --user -xeu voice-dispatcher.service to read the actual error before guessing.
Verify it worked
Run the dispatcher service and speak each command twice. Watch the terminal output. Confirm that xdotool or dbus-send triggers the expected behavior. Check system logs for permission denials or audio buffer underruns.
journalctl --user -u voice-dispatcher.service --since "5 minutes ago" | grep -i error # Checks for application-level failures
journalctl -u pipewire --since "5 minutes ago" | grep -i underrun # Checks for audio pipeline starvation
If the commands execute reliably and the audio stream stays stable, the pipeline is solid. Reboot before you debug. Half the time the symptom is gone.
Common pitfalls and what the error looks like
Voice control setups fail in predictable ways. Recognize the symptoms early.
You will see pyaudio.PyAudioError: Input overflowed when the audio buffer fills faster than the script processes it. This happens when the recognition engine blocks on heavy computation. Reduce frames_per_buffer or increase the sample rate to match your hardware.
You will see ALSA lib confmisc.c:767:(parse_card) cannot find card '0' when the script tries to open a microphone that PipeWire has hidden or renamed. List active sources with pactl list short sources and pass the correct device index to pyaudio.open.
SELinux will block D-Bus commands if your script runs under an unconfined user context and tries to interact with system services. You will see Permission denied in the journal. Run ausearch -m avc -ts recent to find the exact denial. Adjust the policy or run the dispatcher under your normal user session. Never disable SELinux to fix a script. SELinux denials show up in journalctl -t setroubleshoot with a one-line summary. Read those before changing security contexts.
Trust the package manager. Manual file edits drift, snapshots stay.
When to use this vs alternatives
Use Vosk with a custom dispatcher when you want offline processing, zero telemetry, and full control over command mapping. Use a cloud-based assistant when you need natural language understanding, multi-turn conversations, and integration with smart home ecosystems. Use GNOME's built-in accessibility shortcuts when you only need keyboard-driven navigation and screen reader support. Stay on the terminal-based pipeline if you are comfortable writing Python and managing background services.