Voice Control

Voice Control allows you to use speech-to-text functionality to dictate your prompts directly into the chat input field. This feature provides a hands-free way to interact with AiderDesk, making it easier to input long or complex prompts without typing.

Supported Providers

Voice control is currently supported with the following AI providers:

OpenAI: Uses OpenAI's real-time speech-to-text API
Google Gemini: Uses Gemini's live audio input capabilities

Note: Only one provider can have voice control enabled at a time. This ensures proper audio stream management and avoids conflicts.

Enabling Voice Control

Step 1: Configure a Provider Profile

Voice control uses the provider profile that has Voice Control enabled (OpenAI or Gemini).

Open the Model Library (database icon in the top bar)
Select your preferred provider profile (OpenAI or Gemini)
Enter your API key in the provider settings
Save the configuration

Step 2: Enable Voice Control (Model Library)

You can enable voice control directly on a provider profile:

Open the Model Library
Select the provider profile you configured in Step 1
Find the Voice Control section
Enable Voice Control and save

Note: Only one provider profile can have voice control enabled at a time.

Step 3: Configure Voice Options (Settings → Voice)

Use Settings → Voice for the detailed configuration:

Open Settings
Go to the Voice tab
Select the Provider Profile you want to use for voice control
Configure voice options as needed:
- Model (provider-specific)
- Microphone (choose a device or keep Default)
- Idle timeout (silence duration before auto-stop; default is 5 seconds)
- System instructions (what the speech-to-text session should expect)
- OpenAI: Language
- Gemini: Temperature

The microphone icon will appear in the chat input when a supported provider profile has voice enabled.

Using Voice Control

Starting Voice Recording

Once voice control is enabled, you'll see a microphone icon in the chat input area:

Click the microphone icon to start recording
Speak clearly into your microphone
The audio analyzer will show visual feedback of your voice input
Click the microphone icon again to stop recording

Real-time Transcription

As you speak, the system will transcribe your speech in real-time:

Live Transcription: Text appears as you speak
Automatic Silence Detection: Recording stops automatically after the configured idle timeout (default: 5 seconds)
Visual Feedback: Audio level indicators show when your voice is being detected

After Recording

Once you stop recording:

The transcribed text will appear in the chat input field
You can edit the text before sending if needed
Press Enter or click Send to submit your prompt

Technical Implementation

Audio Processing

The voice control system uses Web Audio API for:

Audio Capture: Accesses microphone through navigator.mediaDevices.getUserMedia()
Audio Processing: Real-time audio level monitoring and silence detection
Format Conversion: Converts audio to PCM format for provider APIs

Provider Integration

OpenAI Integration

Uses WebRTC for real-time audio streaming
Supports OpenAI's real-time speech-to-text API
Handles audio buffer management and transcription events

Gemini Integration

Uses Google GenAI SDK for live audio input
Uses the voice model gemini-2.5-flash-native-audio-preview-12-2025 (current default)
Implements an idle timeout (silence) auto-stop (default: 5 seconds)

Security and Privacy

Local Processing: Audio is processed locally before being sent to providers
Secure Transmission: All audio data is transmitted using encrypted HTTPS connections
No Local Storage: Audio recordings are not stored locally after transcription
Permission Required: Microphone access requires explicit user permission

Requirements and Limitations

System Requirements

Microphone: A working microphone is required
Browser Permissions: Microphone access must be granted
Network Connection: Stable internet connection for provider API communication
API Key: Valid API key for OpenAI or Gemini

Current Limitations

One Provider at a Time: Only one provider profile can have voice control enabled
No Voice Commands: Voice control only transcribes speech, it doesn't execute voice commands
No Audio Playback: The system doesn't provide text-to-speech capabilities

Platform Support

Desktop: Full support on Windows, macOS, and Linux
Microphone Access: Requires microphone permissions on all platforms
Electron Security: Microphone access is properly sandboxed within the application

Troubleshooting

Common Issues

Microphone Not Working

Check if your microphone is connected and working
Verify microphone permissions in your system settings
In Settings → Voice, try selecting a specific microphone device (instead of Default)
Ensure no other application is using the microphone
Restart the application

Voice Control Not Available

Verify you have a provider profile configured with a valid API key
Open Settings → Voice and select a supported provider profile (OpenAI/Gemini)
Ensure only one provider profile has voice control enabled
Restart the application if changes don't take effect

Poor Transcription Quality

Speak clearly and at a moderate pace
Ensure minimal background noise
Check your microphone quality and positioning
Try moving closer to the microphone

Connection Issues

Verify your internet connection is stable
Check if your API key is valid and has sufficient credits
Ensure the provider's API is operational
Try switching to a different provider if available

Configuration Options

Audio Settings

The voice control system includes several configurable parameters:

Microphone: Select a specific input device (or keep Default)
Idle timeout: Automatically stops recording after a period of silence (default: 5 seconds)

Provider Settings

Each provider has specific configuration options (available in Settings → Voice):

OpenAI

Model: gpt-4o-transcribe or gpt-4o-mini-transcribe
Language: Selectable (default: en)
System instructions: Customizable
Idle timeout: Customizable

Gemini

Model: gemini-2.5-flash-native-audio-preview-12-2025 (current default)
Temperature: Slider from 0 to 1 (default: 0.7)
System instructions: Customizable
Idle timeout: Customizable

Future Enhancements

Planned improvements for voice control include:

Voice Commands: Ability to execute commands through voice
Multiple Provider Support: Enable voice control on multiple providers simultaneously
Text-to-Speech: Add audio feedback for system responses

Supported Providers​

Enabling Voice Control​

Step 1: Configure a Provider Profile​

Step 2: Enable Voice Control (Model Library)​

Step 3: Configure Voice Options (Settings → Voice)​

Using Voice Control​

Starting Voice Recording​

Real-time Transcription​

After Recording​

Technical Implementation​

Audio Processing​

Provider Integration​

OpenAI Integration​

Gemini Integration​

Security and Privacy​

Requirements and Limitations​

System Requirements​

Current Limitations​

Platform Support​

Troubleshooting​

Common Issues​

Microphone Not Working​

Voice Control Not Available​

Poor Transcription Quality​

Connection Issues​

Configuration Options​

Audio Settings​

Provider Settings​

OpenAI​

Gemini​

Future Enhancements​